Friday, December 20, 2013

Overview of Digital Cloning


The growth of the image processing and editing software availability has made it easy to manipulate digital images.
With the amount of digital content being generated nowadays, developing  techniques to verify the authenticity and integrity of digital content might be essential to provide truthful evidences in a forensics case.
In this context, copy-move is a type of forgery in which a part of an image is copied and pasted somewhere else in the same image. This forgery might be particularly challenging to discover due to properties like illumination  and noise matching on the source and the tampered regions. An example of copy-move forgery can be seen in picture 1. First we can see the original image, followed by the tampered one, and then a picture with the indication of the cloned areas.

Several techniques have been proposed to solve this problem. The Block-based methods [1] divide an image in blocks of pixels and compare them to find a forgery.

Keypoint-based methods [2] on the other hand extract keypoints of an image  and use these to find tampered regions.

While keypoints might generate better and computationally efficient detectors [3]  they also present difficulty in finding tampering in homogeneous regions.


Many block-based algorithms have been proposed over the years. Involving mainly dividing the image in a fixed size b of pixels, they differ on how they compare the blocks.
Once blocks are extracted from a NxM image, they are usually inserted  lexicographically in a M - b + 1 x N - b + 1 matrix. This matrix is latter on analysed to see if any two lines match as a cloned region. Authors have proposed ways to improve this method such as using PCA to reduce the blocks dimensions.

Among several approaches in the literature, the cloning detection via multiscale analysis and voting method can be highlighted, proposed by Ewerton Silva et al. The first step on the process here is to extract keypoints of interest in an image using SURF, robust to scaling and rotating, and then match such points between themselves. These points are then grouped based on their physic distance, to limit the search space. The image is then redimensioned,
creating a type of pyramid of images, representing the several scales of the image. In each level of the pyramid, a search is made for possible duplicated regions. This search only occurs in the point groups discovered. The final decision is made based on a voting process, if a certain region is considered cloned in more than a threshold of levels of the pyramid.

Another work worth mentioning is an evaluation made on copy-move forgery approaches, by Christlein et al. 15 different copy-move detection approaches were compared in their work such as SIFT, SURF, PCA, KPCA, Zerkine,
DCT and DWT. They aimed to answer which algorithm performed best against realistic scenarios like different compressions and noise. Results showed that keypoints methods (SIFT and SURF) had clear computational complexity advantage, but other block based methods like Zernike achieved quite precise results.


[1] Connelly Barnes, Eli Shechtman, Adam Finkelstein, and
Dan B. Goldman. The generalized patchmatch correspon-
dence algorithm. 2010.
[2] Ewerton Silva and Anderson Rocha. Cloning detection. In
Elsevier JVCI 2013 (Submited).
[3] Christian Riess Johannes Jordan Corinna Riess Elli An-
gelopoulou Vincent Christlein. Evaluation of popular copy-
move forgery detection approaches. In IEEE Transactions on
Information Forensics and Security (TIFS) 2012, pages 015–
021, Graz, Austria, 2010.
[4] Andrea Vedaldi. An implementation of multi-dimensional
maximally stable extremal regions. 2007.

This topic was inspired from a class I took with Anderson Rocha. See his publications on digital forensics here:

Tuesday, August 20, 2013

Understanding Apache Hive


  BigData and Hive

Apache Hive is a software application created to facilitate data analyses on Apache Hadoop. It is a Java framework that helps extracting knowledge from data placed on a HDFS cluster by providing a SQL-like interface to it.

The Apache Hadoop platform is a major project on distributed computing and it is commonly assumed to be the best approach when dealing with BigData challenges.
It is now very well established that great volume of data is produced everyday. Whether it is by system logs or by users purchases, the amount of information generated is such that previous existing Databases and Datawarehouses solutions don’t seem to scale well enough.
The MapReduce programming paradigm was uncovered in 2004 as a new approach on processing large datasets. In 2005 its OpenSource version, Hadoop, was created by Doug Cutting. Although Hadoop is not set for substituting relational databases, it is a good solution for big data analyses.

Hadoop facilitates large data processing, but it still requires skillful programmers to create the Map and Reduce functions to analyze the data. All analyzes made through Hadoop had to be condensed on these two functions. Creating this type of applications might be challenging and difficult to maintain. Previous data developers had difficulty on extracting intelligence from their data. Hive was created to overcome this issues.

Apache Hive

First introduced by Facebook and latter donated to the Apache Software Foundation, it is a data warehouse interface for Hadoop. With Hive, users can create SQL statements that will be automatically converted to MapReduce jobs and run on a HDFS cluster.

Data can be inserted or dealt with on the Hadoop cluster through command line interface using statements from the Hive Quey Language, or HiveQL, such as SELECT, INSERT or CREATE TABLE. Users can also create their own User Defined Functions, by extending the UDF class already provided. Within these statements tables can be defined using primitive types as integers, floating points, strings, dates and booleans. Furthermore, new types can be created by grouping these primitives types into maps and arrays. Please check for more information on HiveQL.

Although Hive presents a data warehouse interface for Hadoop, it is still a batch processing framework. As Hive’s data is located on Hadoop, it is limited to Hadoop’s constraints.
Hadoop does not index data it is not made for editing Data. There is no UPDATE on Hive, because this functionality could not be executed on data over HDFS. Hive does not support transactions. If you want these kind of database on top of Hadoop you should look for options such as HBase. Check to read more about this Hadoop limitations.

Even so, Apache Hive made it possible for developers with basic SQL knowledge to create complicated meaningful queries and quickly extract value from big data.


Users can start interacting with Hive though a Command Line Interface (CLI), Hive Web Interface (HWI), JDBC or ODBC.
The CLI interface is a command line tool accessed through a terminal. It can be initiated by calling the HIVE_HOME/bin/hive script, inside Hive downloaded source code. Hive also provides a Hive server, so that users can use JDBC or ODBC to communicate with it.

When you type a query through the CLI interface, this HiveQL statement will be handled by the Driver component. The Driver connects a bunch of modules that transform the statement into MapReduce jobs to be run in Hadoop. It is importante to note that the query is not transformed in Java code in this process. Its goes direclty to MapReduce jobs. The modules involved in this process are: Parser, Semantic Analyzes, Logical Plan generator, Optimizer, Physical Plan Generator and Executor.

First, the Driver creates a session to remember details about the process, to maintain dates and statistics.

Some metadata (information about tables and columns) is then collected and stored on Metastore as soon as the input data (tables) are created. This metadata is actually stored in a relational database and it is latter on used on the Semantic Analyses.

ANTLR software is used to create a parser on the Parser module and parse the query. As in a compiler, the statement in broken down into token values and a Abstract Syntax Tree (AST) is created.

The following HiveQL statement

FROM src src1 JOIN src src2 ON (src1.key = src2.key) JOIN src src3 ON (src1.key + src2.key = src3.key)

INSERT OVERWRITE TABLE dest1 SELECT src1.key, src3.value

would became this AST

(TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_JOIN (TOK_TABREF (TOK_TABNAME src) src1) (TOK_TABREF (TOK_TABNAME src) src2) (= (. (TOK_TABLE_OR_COL src1) key) (. (TOK_TABLE_OR_COL src2) key))) (TOK_TABREF (TOK_TABNAME src) src3) (= (+ (. (TOK_TABLE_OR_COL src1) key) (. (TOK_TABLE_OR_COL src2) key)) (. (TOK_TABLE_OR_COL src3) key)))) (TOK_INSERT (TOK_DESTINATION (TOK_TAB (TOK_TABNAME dest1))) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL src1) key)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL src3) value))))) null

The process follows on with a Semantic Analyses on the generated AST. The information provided on the query is verified to be valid by confronting the schema information from the input tables, stored on the Metastore component.
Type checking is a example of operations performed by the Semantic Analyzes.

After this analyses, a operator tree is created by the Logical Plan Generator, based on the parsed information and on the AST created. This operator tree is then, passed to the Optimizer procedure, which will perform a set of transformations to, not surprisingly, optimize the operations. The improvements accomplished by the Optimizer include column pruning (only column really needed will be fetched) and join reordering (to make sure only small tables are kept in memory).
The Physical Plan Generator gets the optimized operator tree and creates a Directed Acyclic Graph of MapReduce jobs of it. This physical plan is displayed in a XML file, and it is delivered to the Executor to be executed into the Hadoop cluster finally.

Hive and the different Hadoop versions

Hive can be built with Hadoop 1.x or with Hadoop 2.x. It presents interfaces for this purpose, and these interfaces are defined in the Shims interface.

There are three interfaces for Hadoop: 0.20, 0.20S, 0.23. 0.20 is supposed to work with Hadoop 1.x, 0.20s is for a secure version of Hadoop 1.x and 0.23 is for building Hive against Hadoop 2.x.

You can prevent a interface to be built by editing the property shims.include on HIVE_HOME/shims/build.xml:

<property name="shims.include" value="0.20,0.20S,0.23"/>                             

Hive uses a Factory Method to decide which Hadoop interface to use, based on the version of Hadoop on the classpath. This is situated on



encapsulates the interfaces.

hadoop.version is defined on HIVE_HOME/ but you can overwrite it by using the flag -Dhadoop.version.

To build Hive with Hadoop 1.1.2, use

$ ant clean package -Dhadoop.version=1.1.2                   

To build Hive with Hadoop 2.0.4-alpha, use

$ ant clean package -Dhadoop.version=2.0.4-alpha -Dhadoop-0.23.version=2.0.4-alpha


To use some of Hive features such as UDF, or even to adapt Hive’s code to your own needs, you might have to build the source code from source.
Before start, make sure you have a Java JDK, Ant and Subversion installed on your computer.

Then, start by downloading the last stable release version from Hive repository.

$ svn checkout hive-0.11.0

Enter on your Hive home directory (which from now on, we will call HIVE_HOME):

$ cd hive-0.11.0                                               

And finally build the code with:

$ ant package                                                  

This will automatically download and install all dependencies required for Hive’s use.

Hive depends on (or uses) other Hadoop-related components. As from Hive 0.11 version, these are:

Apache Hadoop
Apache HBase
Apache Avro
Apache Zookeper

This components will be automatically downloaded by Ant and Ivy, when you run the ant package command. You can check which version of each component will be downloaded in


and, as explained on last session, Hadoop version can be chekced here:


To check all ant command possibilities with Hive, type:

$ ant -p                                                       

This should show how to built it, test it and even how to create a tar file from the source. The testing will be explained a little bit further next.

Unit Tests

Hive provides several buitltin Unit Tests to verify its own modules and features functionalities. They are constructed using JUnit 4 and run queries (.q files) already provided by the framework.

To create the JUnit classes execute:

$ ant package                                                  

To run the unit tests type:

$ ant test                                                     

To run a specific test run:

$ ant test -Dtestcase=TestCliDriver                            

To run a specific query inside one Unit Test run:

$ ant test -Dtestcase=TestCliDriver -Dqfile=alter5.q           

The command described above will produce a output that will be compared with Hive’s expected output. It will also generate a .xml log file, very helpuf for debbugging purposes:


If you are having troubles with a certain testcase, and trying to debug it, pay attention: some java test files (all files under ql module) for Hive are created on build time from Velocity Templates (.vm). If you want to modify this tests you have to change the .vm file, not the .java one.


Book: Programming Hive:Data Warehouse and Query Language for Hadoop. Edward Capriolo, Dean Wampler, Jason Rutherglen

Article: Hive A Warehousing Solution Over a MapReduce

Framework. Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao,

Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff and Raghotham Murthy.

Facebook Data Infrastructure Team

Saturday, July 27, 2013

Is there such a thing as "best" Recommender System algorithm?

I received emails from users asking which recommender system algorithm they should use. Usually people start looking for articles on which approach has a better performance, and once they find something convincing they start to implement it.

I believe that the best recommender system depends on the data and the problem you have to deal with.

With that in mind, I decided to publish here some pros and cons for each recommender type (collaborative, content and hybrid), so people can decide for their own what algoritms better suit their needs.

I've already presented these approaches here, so if you know nothing about recommender systems, you can read it there first.

Collaborative Filtering


  • Recommends diverse items to users, being innovative;
  • Good practical results (read Amazon's article);
  • It is widely used, and you can find several OpenSource  implementations of it (Apache Mahout);
  • It can be used on ratings from users on items;
  • It can deal with video and audio data;


  •   It suffers with scarcity of data, if you don't have many ratings for example you might end up with bad results;
  •   When the number of ratings grow, scalability becomes an issue, it might be hard to calculate similarity for all users;

Content Based Filtering


  • It works better with smaller amount of information than Collaborative Filtering;
  • It uses description of items, so it works well with tagged items, and it usually matches well users preferences profile;


  • It doesn't work so well for video or audio data with no text tags;
  • Frequently recommends repetitive items, staying only on similar things that the user has already seen;

Hybrid Systems


  • Usually the most effective approach (more accuracy on results);
  • It overcomes the single approaches;


  • Hard to find a balance when combining the two approaches;
  • Challenging to implement;

Friday, July 26, 2013

Recommender Systems Online Free Course on Coursera

I already talked about Coursera's great courses here.
There is a new course on Recommender Systems starting in September:

I don't know how it is going to be, but based on the courses I've done so far, it looks good.

Tuesday, June 18, 2013

Apache Hive .orig test file and "#### A masked pattern was here ####"

Just a quick information about something in Hive.

If you ever typed:

$ ant clean package test

to run Apache Hive unit tests, you may have seen that Hive sometimes creates two output files.
If you run for example:

$ ant test -Dtestcase=TestCliDriver -Dqfile=alter5.q

Hive sometimes generates a alter5.q.out and a alter5.q.out.orig :


This happens because Hive uses a method to mask any local information, as local time, or local path, with the following sentence:

#### A masked pattern was here ####

So, if you check your .q.out file it should have a bunch of this sentence above covering several local information. This information needs to be covered so that the tests outputs are the same in all computers.

The .q.out.orig file has the original test output, with all the local information non covered.

Out of curiosity, the method to mask the local patterns (private void maskPatterns(String[] patterns, String fname) throws Exception) is located on:


Tuesday, May 21, 2013

BigData Free Course Online

Coursera offers several great online courses from the best universities around the world. The courses involve video lectures being released weekly, work assignments for the student, and reading material indications.

 I had enrolled on this course about BigData a couple of months ago, and I confess I didn't have time to start doing it since last week.

Once I started the course I was pleased with the content presented.
They talk about important Data Mining algorithms for dealing with great amount of data such as PageRank.
MapReduce and Distributed File Systems are also two very well explained topics on this course.

So, for those who want to know more about computing related to BigData this course is certainly recommended!

PS: The course is being offered since march, and its inscriptions period must soon be over. But keep watching the course page, because they open new courses often.

Tuesday, April 23, 2013

How to Build Oozie with Different Versions of Hadoop

After downloading Oozie code with

svn checkout .

and then building it with Hadoop 1.1.0 with the familiar

mvn clean compile -Dhadoop.version=1.1.0

I got the following error:

[INFO] ------------------------------------------------------------------------
[INFO] Total time: 1:06.497s
[INFO] Finished at: Tue Apr 23 12:36:53 BRT 2013
[INFO] Final Memory: 20M/67M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal on project oozie-sharelib-distcp: Could not resolve dependencies for project org.apache.oozie:oozie-sharelib-distcp:jar:3.3.0: Could not find artifact org.apache.oozie:oozie-hadoop-distcp:jar:1.1.0.oozie-3.3.0 in central ( -> [Help 1]

Reading a bit about it, and checking some pom files, I realized that inside the hadoolibs directory (inside oozie home), there are three sub-directories with the hadoop version hard coded on their poms.
So when you pass the -Dhadoop.version, these pom don't "change"! And they continue on using their pre-defined version of Hadoop!

I talked to the community guys from Oozie, and they say that the recommended thing to do is to change the pom files itself, and not pass by parameter.

Resuming, if you want to build oozie 3.3 with a different Hadoop, edit these pom files:


Setting the desired version of Hadoop. This off courseif you are building against Hadoop 1.x. If you are building oozie with Hadoop 2.x, edit:


Friday, April 5, 2013

HashMap JVM Differences

Although Java slogan's is "Write once, run everywhere" , to emphasize the cross-platform benefit, in practice unfortunately this is not totally true.

One known difference between Sun and other JVMs is the HashMap order output.

When  executing the exact same program and iterating though  the same exact same HashMap input, a Sun JVM will produce a different output than another JVM.

See as example the code below:

import java.util.LinkedHashMap;
import java.util.HashMap;
import java.util.Iterator;
import java.util.Map;

public class HashMapTest {

        static HashMap<String, String> result = new HashMap<String, String>();
        static Iterator<Map.Entry<String, String>> entryIter;
        static HashMap<String, String> thash = new HashMap<String, String>();

        public static void main(String[] args) {

                for (int i = 0; i < 10; i++){
                        thash.put(Integer.toString(10 - i), "abc");


                entryIter = result.entrySet().iterator();
                while (entryIter.hasNext()) {

                        Map.Entry<String, String>  entry =;
                        String val1 = entry.getKey();
                        String val =  entry.getValue();
                        System.out.println("Key: "+ val1 + " Value: "+val);


Compiling and executing this code with Sun Java will create the following output:

Key: 3 Value: abc
Key: 2 Value: abc
Key: 10 Value: abc
Key: 1 Value: abc
Key: 7 Value: abc
Key: 6 Value: abc
Key: 5 Value: abc
Key: 4 Value: abc
Key: 9 Value: abc
Key: 8 Value: abc

While whether doing the same thing with IBM Java you should get:

Key: 10 Value: abc
Key: 9 Value: abc
Key: 8 Value: abc
Key: 7 Value: abc
Key: 6 Value: abc
Key: 5 Value: abc
Key: 4 Value: abc
Key: 3 Value: abc
Key: 2 Value: abc
Key: 1 Value: abc

I don't want to enter in merits of which one is right and which one is wrong. Just want to alert people that this issue can cause serious differences in programs output.

Thursday, March 28, 2013

IBM BigData approach: BigInsights

Hadoop and BigData have been two tremendous hot topic lately.

Although many people want to dig into Hadoop and enjoy the benefits of Big Data, most of them don't know exactly how to do it or where to start it. This is where BigInsights is most beneficial.

BigInsights is the Apache Hadoop related software from IBM, and its many built-in features and capabilities leverage your start point.

First, besides having all Hadoop  ecosystem components (Hadoop, Hbase, Hive, Pig, Oozie, Zookeeper, Flume, Avro and Lucene) already working together and tested, it has a very easy-to-use install utility.

If you have ever downloaded and installed Hadoop and all its components, and tried to make sure everything was working, you should know how much time a automatic installer can save.

The principal value brought by BigInsights is, in my opinion, the friendly web-interface of the Hadoop tools. You don't have to program on "vim" or create MapReduce Java applications. You can use web tools, in a spreasheet-interface utility, to run queries on you data.
You can import and export data to your cluster through the web-interface, and manage it too.

I wrote a book about BigInsights, describing what it is, how to install it and how to use it.
You can find it here:

You can download the free version here.

Wednesday, March 20, 2013

Dummy Mahout Recommender System Example

I already talked about the Open Source Apache Mahout here, and now I'll show a dummy dummy first example of how to use its recommender system.

It is a basic Java example that I used to try out Mahout. Hope it helps people starting to work with it.


package myexample;


import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import java.util.List;

 * Renata Ghisloti - Dummy Mahout Example

public class GeneralRecommender {

  public static void main(String[] args) throws IOException, TasteException, SAXException, ParserConfigurationException {

    String recsFile = args[0];
    long userId = Long.parseLong(args[1]);
    String categoriesFile = args[2];
    String outputPlace = args[3];
    Integer neighborhoodSize = Integer.parseInt(args[4]);
    Integer method = 0;
    String version = null;

    if(args.length >= 6 )
        method  = Integer.parseInt(args[5]);
        version = args[6];

    //Default - needed to initiate the recommendation
    InputSource is = new InputSource(new FileInputStream(recsFile));
    SAXParserFactory factory = SAXParserFactory.newInstance();
    SAXParser sp = factory.newSAXParser();
    ContentHandler handler = new ContentHandler();
    sp.parse(is, handler);

    //Here is were you should load your own input
    XmlFile dataModel = new XmlFile(new File(recsFile));

      case 0:
       recommenderItemBased(dataModel, userId , categoriesFile, outputPlace, handler, version);
      case 1:
       recommenderItemBased(dataModel, userId , categoriesFile, outputPlace, handler, version);
      case 2:
       recommenderSlopeOne(dataModel, userId , categoriesFile, outputPlace, handler);
      case 3:
       recommenderUserBased(dataModel, userId , categoriesFile, outputPlace, handler, neighborhoodSize, version);

    //Item Based Recommender System
    public static void recommenderItemBased(XmlFile dataModel, long userId ,
        String categoriesFile, String outputPlace, ContentHandler handler, String version) throws  TasteException{

        System.out.println("Recommending with Item Based");
        ItemSimilarity itemSimilarity;

        if(version == "LogLikelihoodSimilarity")
            itemSimilarity = new LogLikelihoodSimilarity(dataModel);
        else {
            itemSimilarity = new PearsonCorrelationSimilarity(dataModel); 
            System.out.println("Recommending with Item Based Pearson");
        ItemBasedRecommender recommender =
            new GenericItemBasedRecommender(dataModel, itemSimilarity);

        //Just get top 5 recommendations
        List recommendations =
            recommender.recommend(userId, 5);

        //This is were you should add your own print output method
        PrintXml.printRecs(dataModel, userId, recommendations,, categoriesFile, outputPlace);

    //Slope One Recommender System
    public static void recommenderSlopeOne(XmlFile dataModel, long userId ,
        String categoriesFile, String outputPlace, ContentHandler handler) throws  TasteException{

        System.out.println("Recommending with Slope One");

        CachingRecommender cachingRecommender = new CachingRecommender(new SlopeOneRecommender(dataModel));

        List recommendations =
            cachingRecommender.recommend(userId, 5);

        PrintXml.printRecs(dataModel, userId, recommendations,, categoriesFile, outputPlace);

    //User based Recommender System
    public static void recommenderUserBased(XmlFile dataModel, long userId ,
        String categoriesFile, String outputPlace, ContentHandler handler, Integer neighborhoodSize, String version) throws  TasteException{

        System.out.println("Recommending with User Based");
        UserSimilarity userSimilarity;

        if(version == "LogLikelihoodSimilarity")
            userSimilarity = new LogLikelihoodSimilarity(dataModel);
            userSimilarity = new PearsonCorrelationSimilarity(dataModel);

        userSimilarity.setPreferenceInferrer(new AveragingPreferenceInferrer(dataModel));

        UserNeighborhood neighborhood =
            new NearestNUserNeighborhood(neighborhoodSize, userSimilarity, dataModel);

        Recommender recommender =
            new GenericUserBasedRecommender(dataModel, neighborhood, userSimilarity);

        List recommendations =
            recommender.recommend(userId, 5);

    PrintXml.printRecs(dataModel, userId, recommendations,, categoriesFile, outputPlace);

Wednesday, February 20, 2013

Why are there three Hadoop svn repositories (common, hdfs and mapreduce)? Where is the repository for YARN?

When developers start reading about Hadoop, one of the first info they get is:

"The project includes these modules:
  • Hadoop Common: The common utilities that support the other Hadoop modules.
  • Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
  • Hadoop YARN: A framework for job scheduling and cluster resource management.
  • Hadoop MapReduce: A YARN-based system for parallel processing of large data sets."
So it might be a little confusing when trying to build Hadoop code from source, they are indicated to check out only a repository called hadoop-common.

It might became even more confusing when you realize that there are two other repositories for Hadoop: hadoop-hdfs and hadoop-mapreduce.

So what repositories to use? 

The answer is: hadoop-common encompasses all these Hadoop modules.

When looking at the hadoop-hdfs or hadoop-mapreduce repos you should see that they haven't been modified since 2009 (more or less). What happened was that until version 0.21 Hadoop repositories were divided between modules. From version 0.22 on, they were combined into a single SVN repository, as documented on Jira:

What about the YARN thing? And what is MapReduce 2?

A few years ago, there was a "split" between Hadoop releases: release 1.x continued on as classic Hadoop from version 0.21, and release 2.x was created based on 0.22, with different features.

Hadoop 2.x includes a couple of new modules that enables MapReduce running on a general resource management system for running distributed applications. This system is YARN, and the MapReduce that runs on it is called MapReduce 2.

This new 2.x release does not contain the old classic MapReduce, only the  MapReduce 2.

The hadoop-common repository includes all of these modules: YARN, HDFS, MapReduce, MapReduce2 and Common utilities libraries. Just pay attention that the presence of these modules will vary from release to release.

Tuesday, January 29, 2013

Frequent Itemset problem for MapReduce

I have received many emails asking for tips for starting Hadoop projects with Data Mining. In this post I describe how the Apriori algorithm solves the frequent itemset problem, and how it can be applied to a MapReduce framework.

The Problem

The frequent itemset problem consists of mining a set of items to find a subset of items that have a strong connexion between them.
A simple example to clear the concept would be: given a set of baskets in a supermarket, a frequent itemset would be hamburgers and ketchup.
These items appear frequently in the baskets, and very often, together.
In the general a set of items that appear in many baskets is said to be frequent.

In the computer world, we could use this algorithm to recommend items of purchase for a user. If A and B are a frequent itemset,
once a user buys A, B would certainly be a good recommendation.

In this problem, the number of "baskets" in assumed to be very large. Greater than what could fit in memory. The number of items in a basket, on the other hand, is considered small.

The main challenge in this problem is the amount of data to be put in memory. In a set of N items per basket for example,
there are n!/2!(n-2)! pair combinations of items. We would have to keep all these combinations for all baskets and iterate through them
to find the frequent pairs.

This is where the Apriori algorithm enters!
The Apriori algorithm is based on the idea that for a pair o items to be frequent, each individual item should also be frequent.
If the hamburguer-ketchup pair is frequent, the hamburger itself must also appear frequently in the baskets. The same can be said about the ketchup.

The Solution

So for the algorithm, it is established a "threshold X" to define what is or it is not frequent. If an item appears more than X times, it is considered frequent.

The first step of the algorithm is to pass for each item in each basket, and calculate their frequency (count how many time it appears).
This can be done with a hash of size N, where the position y of the hash, refers to the frequency of Y.

If item y has a frequency greater than X, it is said to be frequent.

In the second step of the algorithm, we iterate through the items again, computing the frequency of pairs in the baskets. The catch is that
we compute only for items that are individually frequent. So if item y and item z are frequent on itselves,
we then compute the frequency of the pair. This condition greatly reduces the pairs to compute, and the amount of memory taken.

Once this is calculated, the frequencies greater than the threshold are said frequent itemset.

How to Map-and-Reduce it???

To use this algorithm in the MapReduce model, you can follow these instructions described in the "Mining of Massive Datasets" book:

First Map Function: Take the assigned subset of the baskets and find the
itemsets frequent in the subset using the algorithm of Section 6.4.1. As described
there, lower the support threshold from s to ps if each Map task gets
fraction p of the total input file. The output is a set of key-value pairs (F, 1),
where F is a frequent itemset from the sample. The value is always 1 and is

First Reduce Function: Each Reduce task is assigned a set of keys, which
are itemsets. The value is ignored, and the Reduce task simply produces those keys (itemsets) that appear one or more times. Thus, the output of the first
Reduce function is the candidate itemsets.

Second Map Function: The Map tasks for the second Map function take
all the output from the first Reduce Function (the candidate itemsets) and a
portion of the input data file. Each Map task counts the number of occurrences of each of the candidate itemsets among the baskets in the portion of the dataset that it was assigned. The output is a set of key-value pairs (C, v), where C is one of the candidate sets and v is the support for that itemset among the baskets that were input to this Map task.

Second Reduce Function: The Reduce tasks take the itemsets they are
given as keys and sum the associated values. The result is the total support
for each of the itemsets that the Reduce task was assigned to handle. Those
itemsets whose sum of values is at least s are frequent in the whole dataset, so the Reduce task outputs these itemsets with their counts. Itemsets that do not have total support at least s are not transmitted to the output of the Reduce task.


For this post I used the book "Mining of Massive Datasets", from Anand Rajaraman and Jeff Ullman. This book is really really good. I definitely recommend it, and you can get it for free here:

Besides the frequent itemset problem, it shows how to model many data mining algorithms to MapReduce framework.

Friday, January 11, 2013

Error when building Apache components using non-Sun Java

Sometimes, when building some Apache Hadoop related components with non-Sun Java (such as IBM Java or OpenJDK) you may encounter the following error:


I got this error while building Hbase, Hive and Oozie. And all times the problem was the same. When building a component that depends on Hadoop, your Hadoop jar should build with non-Sun Java as well.

That means that you should rename some com.sun imports in Hadoop code, so it is buildable with other JVMs. Replace this com.sun imports with the correspondent non-Sun package.

Usually the jar that causes the problem is hadoop-core.jar.

Hope it save you all some work!