Posts

Showing posts with the label BigData

Apache Hadoop Admin Tricks and Tips

In this post I will share some tips I learned after using the Apache Hadoop environment for some years, and  doing many many workshops and courses. The information here considers Apache Hadoop around version 2.9, but it could probably be extended to other similar versions. These are considerations for when building or using a Hadoop cluster. Some are considerations over the Cloudera distribution. Anyway, hope it helps!  Don't use Hadoop for millions of small files. It overloads the namenode and makes it slower. It is not difficult to overload the namenode. Always check capability vs number of files. Files on Hadoop usually should be more than 100 MB. You have to have a 1 GB of memory for around 1 million files in the namenode. Nodes usually fail after 5 years. Node failures is one of the most frequent problems in H adoop . Big companies like facebook and google should have node failures by the minute. The MySQL on Cloudera Manager does not have redunda...

BigData White Papers

I don't know about you, but I always like to read the white papers that originate OpenSource projects (when available of course :) ). I have been working with BigData quite a lot lately and this area is mostly dominated by Apache OpenSource projects.  So, naturally (given the nerd that I am) I tried to investigate their history. I created a list of articles and companies that originated most BigData Apache projects. Here it is! Hope you guys find it interesting too. :) Apache Hadoop  Based on: Google MapReduce and GFS  Papers: https://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf https://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf Apache Spark   Created by: University of California, Berkeley  Papers:  http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf http://people.csail.mit.edu/matei/papers/2010/hotcloud_spark.pdf http://peo...

Genetic Algorithm for Knapsack using Hadoop

Image
Development of Genetic Algorithm using Apache Hadoop framework to solve optimization problems Introduction This project I developed during a course on my Master intends to construct a Genetic algorithm to solve optimization problems, focusing on the Knapsack Problem. It uses as base the distributed framework Apache Hadoop. The idea is to show that the MapReduce paradigm implemented by Hadoop is a good fit for several NP-Complete optimization problems. As knapsack, many problems present a simple structure and converge to optimal solutions given a proper amount of computation. Genetic Algorithm The algorithm was developed based on a Genetic paradigm. It starts with a initial random population (random instances to the problem). Then, the best individuals are selected among the population (instances that generate the best profits for the knapsack). A phase of crossover was then implemented to generate new instances as combination of the selected indi...

IBM BigData approach: BigInsights

Image
Hadoop and BigData have been two tremendous hot topic lately. Although many people want to dig into Hadoop and enjoy the benefits of Big Data, most of them don't know exactly how to do it or where to start it. This is where BigInsights is most beneficial. BigInsights is the Apache Hadoop related software from IBM , and its many built-in features and capabilities leverage your start point. First, besides having all Hadoop  ecosystem components (Hadoop, Hbase, Hive, Pig, Oozie, Zookeeper, Flume, Avro and Lucene) already working together and tested, it has a very easy-to-use install utility. If you have ever downloaded and installed Hadoop and all its components, and tried to make sure everything was working, you should know how much time a automatic installer can save. The principal value brought by BigInsights is, in my opinion, the friendly web-interface of the Hadoop tools. You don't have to program on "vim" or create MapReduce Java applications. You c...