Frequent Itemset problem for MapReduce
I have received many emails asking for tips for starting Hadoop projects with Data Mining. In this post I describe how the Apriori algorithm solves the frequent itemset problem, and how it can be applied to a MapReduce framework. The Problem The frequent itemset problem consists of mining a set of items to find a subset of items that have a strong connexion between them . A simple example to clear the concept would be: given a set of baskets in a supermarket, a frequent itemset would be hamburgers and ketchup. These items appear frequently in the baskets, and very often, together. In the general a set of items that appear in many baskets is said to be frequent . In the computer world, we could use this algorithm to recommend items of purchase for a user. If A and B are a frequent itemset, once a user buys A, B would certainly be a good recommendation. In this problem, the number of "baskets" in assumed to be very large. Greater than what could fit in memory. The ...