Posts

Showing posts from January, 2013

Frequent Itemset problem for MapReduce

I have received many emails asking for tips for starting Hadoop projects with Data Mining. In this post I describe how the Apriori algorithm solves the frequent itemset problem, and how it can be applied to a MapReduce framework. The Problem The frequent itemset problem consists of mining a set of items to find a subset of items that have a strong connexion between them . A simple example to clear the concept would be: given a set of baskets in a supermarket, a frequent itemset would be hamburgers and ketchup. These items appear frequently in the baskets, and very often, together. In the general a set of items that appear in many baskets is said to be frequent . In the computer world, we could use this algorithm to recommend items of purchase for a user. If A and B are a frequent itemset, once a user buys A, B would certainly be a good recommendation. In this problem, the number of "baskets" in assumed to be very large. Greater than what could fit in memory. The

Error when building Apache components using non-Sun Java

Sometimes, when building some Apache Hadoop related components with non-Sun Java (such as IBM Java or OpenJDK) you may encounter the following error: java.lang.NoClassDefFoundError: org.apache.hadoop.security.UserGroupInformation I got this error while building Hbase, Hive and Oozie. And all times the problem was the same. When building a component that depends on Hadoop, your Hadoop jar should build with non-Sun Java as well. That means that you should rename some com.sun imports in Hadoop code, so it is buildable with other JVMs. Replace this com.sun imports with the correspondent non-Sun package. Usually the jar that causes the problem is hadoop-core.jar . Hope it save you all some work!