Tuesday, December 21, 2010

MapReduce

"Easy distributed computing"

MapReduce is a framework introduced by Google for processing larges amounts of data.
The framework uses a simple idea derived from the commonly known map and reduce functions used in functional programming (ex: LISP). It divides the main problem into smaller sub-problems and distribute these to a cluster of computers. It then combines the answers to these sub-problems to obtain a final answer.

MapReduce facilitates the process of distributed computing making possible that users with no knowledge on the subject create their own distributed applications. The framework hides all the details of parallelization, data distribution load balancing and fault tolerance and the user basically has only to specify the Map and the Reduce functions.

In the process, the input is divided into small independent chunks. The map function receives a piece of the input, processes it, and passes the input in the format key/value pair as answer. These key/values are grouped in a certain way and given as input to the reduce function. This in its turn merges the values, giving the final answer.

Each map and each reduce may be processed by a different node (Computer) in the cluster. The quantity of nodes in the cluster may be as big as the availability of computers in your network. The framework is responsible for dividing the input and feeding the map function. Afterwards, it collects  map's outputs, group and send them to the reduce function. After the work of reduce if done, the framework gather the answers in a final output.

The following picture shows the MapReduce flow.


Example of Map and Reduce functions:

 This is a canonical example of the Map and the Reduce functions. It is an application to count the number of occurrence of words in a large collection of documents.

void map(String name, String document):
   // name: document name
   // document: document contents
   for each word w in document:
     EmitIntermediate(w, 1);


void reduce(String word, Iterator partialCounts):
   // word: a word
   // partialCounts: a list of aggregated partial counts
   int result = 0;
   for each pc in partialCounts:
     result += ParseInt(pc);
   Emit(result);

In this example, each map called receives a document and outputs the occurrence "1" to each word.
 The reduce function is going to  get a list with all the occurrences of a certain word and count it. Then as output it will give the total number of occurrences of that word.


Useful Links:


http://code.google.com/intl/pt-BR/edu/submissions/mapreduce-minilecture/listing.html
http://labs.google.com/papers/mapreduce.html
http://code.google.com/intl/pt-BR/edu/parallel/mapreduce-tutorial.html
http://code.google.com/intl/fr/edu/submissions/mapreduce/listing.html
http://www.mapreduce.org/
http://en.wikipedia.org/wiki/MapReduce

1 comment:

  1. Great writeup!

    The thing that I miss in every non-academical essay about MapReduce are (slightly) more complicated examples. The canonical example is _too_ simple IMO for one to really grasp the concept.

    One example which I find really great is the Anagram problem (Given a dataset, find the words which are an anagram of another) - It's more complex and seems more "real".

    That said, a suggestion: write a bit about real-world cases that would greatly benefit from using MapReduce.

    Cheers!

    ReplyDelete