Posts

Showing posts with the label Map Reduce

Genetic Algorithm for Knapsack using Hadoop

Image
Development of Genetic Algorithm using Apache Hadoop framework to solve optimization problems Introduction This project I developed during a course on my Master intends to construct a Genetic algorithm to solve optimization problems, focusing on the Knapsack Problem. It uses as base the distributed framework Apache Hadoop. The idea is to show that the MapReduce paradigm implemented by Hadoop is a good fit for several NP-Complete optimization problems. As knapsack, many problems present a simple structure and converge to optimal solutions given a proper amount of computation. Genetic Algorithm The algorithm was developed based on a Genetic paradigm. It starts with a initial random population (random instances to the problem). Then, the best individuals are selected among the population (instances that generate the best profits for the knapsack). A phase of crossover was then implemented to generate new instances as combination of the selected indi...

BigData Free Course Online

Coursera offers several great online courses from the best universities around the world. The courses involve video lectures being released weekly, work assignments for the student, and reading material indications.  I had enrolled on this course about BigData a couple of months ago, and I confess I didn't have time to start doing it since last week. Once I started the course I was pleased with the content presented. They talk about important Data Mining algorithms for dealing with great amount of data such as PageRank . MapReduce and Distributed File Systems are also two very well explained topics on this course. So, for those who want to know more about computing related to BigData this course is certainly recommended! https://www.coursera.org/course/bigdata PS: The course is being offered since march, and its inscriptions period must soon be over. But keep watching the course page, because they open new courses often.

Frequent Itemset problem for MapReduce

I have received many emails asking for tips for starting Hadoop projects with Data Mining. In this post I describe how the Apriori algorithm solves the frequent itemset problem, and how it can be applied to a MapReduce framework. The Problem The frequent itemset problem consists of mining a set of items to find a subset of items that have a strong connexion between them . A simple example to clear the concept would be: given a set of baskets in a supermarket, a frequent itemset would be hamburgers and ketchup. These items appear frequently in the baskets, and very often, together. In the general a set of items that appear in many baskets is said to be frequent . In the computer world, we could use this algorithm to recommend items of purchase for a user. If A and B are a frequent itemset, once a user buys A, B would certainly be a good recommendation. In this problem, the number of "baskets" in assumed to be very large. Greater than what could fit in memory. The ...

Dependencies on a Hadoop Ecosystem

Image
When building a Hadoop cluster and all other Apache projects related to it, it might be tricky to know what to install and how to install. You first should understand your data, and what you want to do with it. If you have log-like data, that keeps increasing all the time, and you have keep updating it to the Hadoop cluster, you might want to consider Flume. Apache Flume is a distributed deamon-like software ( and for that presents High Availability) that can keep feeding data to the Hadoop cluster. If you need a no-sql radom read/write database you can use Hbase , implemented based on Google's BigTable database. If you have relatively structured data and you want to do query-like analyses on it, you might consider Pig or Hive . Both work on top of the Hadoop cluster, executing commands instead of Java written MapReduce jobs. They both furnish their own language to execute such commands. Pig uses a textual language called Pig Latin, and Hive uses a syntax ver...

Apache Hadoop for Beginners

Image
The Apache Hadoop is a framework for distributed computing applications, inspired by Google's MapReduce  and GFS paper. It is an Open Source software that enables the processing of  massive amounts of data with commodity. First introduced by Doug Cutting, who named the project after his son's toy (a yellow elephant), Hadoop it is now one of the greatest Apache projects. It involves many contributors and users around the world such as Yahoo!, IBM, Facebook and many others.   The framework presents a master/worker shared nothing architecture . The Hadoop cluster is composed of a group of single nodes (computers), being one of these nodes the master server and the other nodes the workers. On the master node, the Namenode deamon and the JobTracker daemon usually run. The Namenode deamon keeps files metadata, and the JobTracker one manages the mapreduce tasks executed on the cluster. The management and monitoring of tasks are made by the Hadoop serv...

MapReduce

Image
"Easy distributed computing" MapReduce is a framework introduced by Google for processing larges amounts of data . The framework uses a simple idea derived from the commonly known map and reduce functions used in functional programming (ex: LISP). It divides the main problem into smaller sub-problems and distribute these to a cluster of computers . It then combines the answers to these sub-problems to obtain a final answer . MapReduce facilitates the process of distributed computing making possible that users with no knowledge on the subject create their own distributed applications. The framework hides all the details of parallelization , data distribution load balancing and fault tolerance and the user basically has only to specify the Map and the Reduce functions. In the process, the inp ut is divided into small independent chunks. The map function receives a piece of the input, processes it, and passes the input in the format key/value pair as answer. These k...