Tuesday, January 29, 2013

Frequent Itemset problem for MapReduce

I have received many emails asking for tips for starting Hadoop projects with Data Mining. In this post I describe how the Apriori algorithm solves the frequent itemset problem, and how it can be applied to a MapReduce framework.

The Problem

The frequent itemset problem consists of mining a set of items to find a subset of items that have a strong connexion between them.
A simple example to clear the concept would be: given a set of baskets in a supermarket, a frequent itemset would be hamburgers and ketchup.
These items appear frequently in the baskets, and very often, together.
In the general a set of items that appear in many baskets is said to be frequent.

In the computer world, we could use this algorithm to recommend items of purchase for a user. If A and B are a frequent itemset,
once a user buys A, B would certainly be a good recommendation.

In this problem, the number of "baskets" in assumed to be very large. Greater than what could fit in memory. The number of items in a basket, on the other hand, is considered small.

The main challenge in this problem is the amount of data to be put in memory. In a set of N items per basket for example,
there are n!/2!(n-2)! pair combinations of items. We would have to keep all these combinations for all baskets and iterate through them
to find the frequent pairs.

This is where the Apriori algorithm enters!
The Apriori algorithm is based on the idea that for a pair o items to be frequent, each individual item should also be frequent.
If the hamburguer-ketchup pair is frequent, the hamburger itself must also appear frequently in the baskets. The same can be said about the ketchup.

The Solution

So for the algorithm, it is established a "threshold X" to define what is or it is not frequent. If an item appears more than X times, it is considered frequent.

The first step of the algorithm is to pass for each item in each basket, and calculate their frequency (count how many time it appears).
This can be done with a hash of size N, where the position y of the hash, refers to the frequency of Y.

If item y has a frequency greater than X, it is said to be frequent.

In the second step of the algorithm, we iterate through the items again, computing the frequency of pairs in the baskets. The catch is that
we compute only for items that are individually frequent. So if item y and item z are frequent on itselves,
we then compute the frequency of the pair. This condition greatly reduces the pairs to compute, and the amount of memory taken.

Once this is calculated, the frequencies greater than the threshold are said frequent itemset.

How to Map-and-Reduce it???

To use this algorithm in the MapReduce model, you can follow these instructions described in the "Mining of Massive Datasets" book:

First Map Function: Take the assigned subset of the baskets and find the
itemsets frequent in the subset using the algorithm of Section 6.4.1. As described
there, lower the support threshold from s to ps if each Map task gets
fraction p of the total input file. The output is a set of key-value pairs (F, 1),
where F is a frequent itemset from the sample. The value is always 1 and is

First Reduce Function: Each Reduce task is assigned a set of keys, which
are itemsets. The value is ignored, and the Reduce task simply produces those keys (itemsets) that appear one or more times. Thus, the output of the first
Reduce function is the candidate itemsets.

Second Map Function: The Map tasks for the second Map function take
all the output from the first Reduce Function (the candidate itemsets) and a
portion of the input data file. Each Map task counts the number of occurrences of each of the candidate itemsets among the baskets in the portion of the dataset that it was assigned. The output is a set of key-value pairs (C, v), where C is one of the candidate sets and v is the support for that itemset among the baskets that were input to this Map task.

Second Reduce Function: The Reduce tasks take the itemsets they are
given as keys and sum the associated values. The result is the total support
for each of the itemsets that the Reduce task was assigned to handle. Those
itemsets whose sum of values is at least s are frequent in the whole dataset, so the Reduce task outputs these itemsets with their counts. Itemsets that do not have total support at least s are not transmitted to the output of the Reduce task.


For this post I used the book "Mining of Massive Datasets", from Anand Rajaraman and Jeff Ullman. This book is really really good. I definitely recommend it, and you can get it for free here:


Besides the frequent itemset problem, it shows how to model many data mining algorithms to MapReduce framework.

Friday, January 11, 2013

Error when building Apache components using non-Sun Java

Sometimes, when building some Apache Hadoop related components with non-Sun Java (such as IBM Java or OpenJDK) you may encounter the following error:

java.lang.NoClassDefFoundError: org.apache.hadoop.security.UserGroupInformation

I got this error while building Hbase, Hive and Oozie. And all times the problem was the same. When building a component that depends on Hadoop, your Hadoop jar should build with non-Sun Java as well.

That means that you should rename some com.sun imports in Hadoop code, so it is buildable with other JVMs. Replace this com.sun imports with the correspondent non-Sun package.

Usually the jar that causes the problem is hadoop-core.jar.

Hope it save you all some work!