Posts

Showing posts from 2013

Overview of Digital Cloning

Image
Introduction The growth of the image processing and editing software availability has made it easy to manipulate digital images. With the amount of digital content being generated nowadays, developing  techniques to verify the authenticity and integrity of digital content might be essential to provide truthful evidences in a forensics case. In this context, copy-move is a type of forgery in which a part of an image is copied and pasted somewhere else in the same image . This forgery might be particularly challenging to discover due to properties like illumination  and noise matching on the source and the tampered regions. An example of copy-move forgery can be seen in picture 1. First we can see the original image, followed by the tampered one, and then a picture with the indication of the cloned areas. Several techniques have been proposed to solve this problem. The Block-based methods [1] divide an image in blocks of pixels and compare them to find a forgery. Keypoint-ba

Understanding Apache Hive

Image
   Introduction   BigData and Hive Apache Hive is a software application created to facilitate data analyses on Apache Hadoop. It is a Java framework that helps extracting knowledge from data placed on a HDFS cluster by providing a SQL-like interface to it. The Apache Hadoop platform is a major project on distributed computing and it is commonly assumed to be the best approach when dealing with BigData challenges. It is now very well established that great volume of data is produced everyday. Whether it is by system logs or by users purchases, the amount of information generated is such that previous existing Databases and Datawarehouses solutions don’t seem to scale well enough. The MapReduce programming paradigm was uncovered in 2004 as a new approach on processing large datasets. In 2005 its OpenSource version, Hadoop, was created by Doug Cutting. Although Hadoop is not set for substituting relational databases, it is a good solution for big data analyses

Is there such a thing as "best" Recommender System algorithm?

I received emails from users asking which recommender system algorithm they should use . Usually people start looking for articles on which approach has a better performance, and once they find something convincing they start to implement it. I believe that the best recommender system depends on the data and the problem you have to deal with. With that in mind, I decided to publish here some pros and cons for each recommender type (collaborative, content and hybrid), so people can decide for their own what algoritms better suit their needs. I've already presented these approaches here , so if you know nothing about recommender systems, you can read it there first. Collaborative Filtering Pros Recommends diverse items to users, being innovative; Good practical results (read Amazon's article ); It is widely used, and you can find several OpenSource  implementations of it ( Apache Mahout ); It can be used on ratings from users on items; It can deal with video and

Recommender Systems Online Free Course on Coursera

I already talked about Coursera's great courses here . There is a new course on Recommender Systems starting in September: https://www.coursera.org/course/recsys I don't know how it is going to be, but based on the courses I've done so far, it looks good.

Apache Hive .orig test file and "#### A masked pattern was here ####"

Just a quick information about something in Hive. If you ever typed: $ ant clean package test to run Apache Hive unit tests, you may have seen that Hive sometimes creates two output files . If you run for example: $ ant test -Dtestcase=TestCliDriver -Dqfile=alter5.q Hive sometimes generates a alter5.q.out and a alter5.q.out.orig : build/ql/test/logs/clientpositive/alter5.q.out build/ql/test/logs/clientpositive/alter5.q.out.orig This happens because Hive uses a method to mask any local information, as local time, or local path, with the following sentence: #### A masked pattern was here #### So, if you check your .q.out file it should have a bunch of this sentence above covering several local information . This information needs to be covered so that the tests outputs are the same in all computers. The .q.out.orig file has the original test output, with all the local information non covered. Out of curiosity, the method to mask the local patterns (private void

BigData Free Course Online

Coursera offers several great online courses from the best universities around the world. The courses involve video lectures being released weekly, work assignments for the student, and reading material indications.  I had enrolled on this course about BigData a couple of months ago, and I confess I didn't have time to start doing it since last week. Once I started the course I was pleased with the content presented. They talk about important Data Mining algorithms for dealing with great amount of data such as PageRank . MapReduce and Distributed File Systems are also two very well explained topics on this course. So, for those who want to know more about computing related to BigData this course is certainly recommended! https://www.coursera.org/course/bigdata PS: The course is being offered since march, and its inscriptions period must soon be over. But keep watching the course page, because they open new courses often.

How to Build Oozie with Different Versions of Hadoop

After downloading Oozie code with svn checkout http://svn.apache.org/repos/asf/oozie/tags/release-3.3.0/ . and then building it with Hadoop 1.1.0 with the familiar mvn clean compile -Dhadoop.version=1.1.0 I got the following error: [INFO] BUILD FAILURE [INFO] ------------------------------------------------------------------------ [INFO] Total time: 1:06.497s [INFO] Finished at: Tue Apr 23 12:36:53 BRT 2013 [INFO] Final Memory: 20M/67M [INFO] ------------------------------------------------------------------------ [ERROR] Failed to execute goal on project oozie-sharelib-distcp: Could not resolve dependencies for project org.apache.oozie:oozie-sharelib-distcp:jar:3.3.0: Could not find artifact org.apache.oozie:oozie-hadoop-distcp:jar:1.1.0.oozie-3.3.0 in central (http://repo1.maven.org/maven2) -> [Help 1] Reading a bit about it, and checking some pom files, I realized that inside the hadoolibs directory (inside oozie home), there are three sub-directories with

HashMap JVM Differences

Although Java slogan's is "Write once, run everywhere" , to emphasize the cross-platform benefit, in practice unfortunately this is not totally true. One known difference between Sun and other JVMs is the HashMap order output. When  executing the exact same program and iterating though  the same exact same HashMap input, a Sun JVM will produce a different output than another JVM. See as example the code below: import java.util.LinkedHashMap; import java.util.HashMap; import java.util.Iterator; import java.util.Map; public class HashMapTest {         static HashMap<String, String> result = new HashMap<String, String>();         static Iterator<Map.Entry<String, String>> entryIter;         static HashMap<String, String> thash = new HashMap<String, String>();         public static void main(String[] args) {                 for (int i = 0; i < 10; i++){                         thash.put(Integer.toString(10 - i), "ab

IBM BigData approach: BigInsights

Image
Hadoop and BigData have been two tremendous hot topic lately. Although many people want to dig into Hadoop and enjoy the benefits of Big Data, most of them don't know exactly how to do it or where to start it. This is where BigInsights is most beneficial. BigInsights is the Apache Hadoop related software from IBM , and its many built-in features and capabilities leverage your start point. First, besides having all Hadoop  ecosystem components (Hadoop, Hbase, Hive, Pig, Oozie, Zookeeper, Flume, Avro and Lucene) already working together and tested, it has a very easy-to-use install utility. If you have ever downloaded and installed Hadoop and all its components, and tried to make sure everything was working, you should know how much time a automatic installer can save. The principal value brought by BigInsights is, in my opinion, the friendly web-interface of the Hadoop tools. You don't have to program on "vim" or create MapReduce Java applications. You c

Dummy Mahout Recommender System Example

I already talked about the Open Source Apache Mahout here , and now I'll show a dummy dummy first example of how to use its recommender system. It is a basic Java example that I used to try out Mahout. Hope it helps people starting to work with it. package myexample; import org.apache.mahout.cf.taste.common.TasteException; import org.apache.mahout.cf.taste.impl.model.XmlFile; import org.apache.mahout.cf.taste.impl.recommender.CachingRecommender; import org.apache.mahout.cf.taste.impl.recommender.GenericItemBasedRecommender; import org.apache.mahout.cf.taste.impl.similarity.LogLikelihoodSimilarity; import org.apache.mahout.cf.taste.impl.recommender.slopeone.SlopeOneRecommender; import org.apache.mahout.cf.taste.impl.similarity.PearsonCorrelationSimilarity; import org.apache.mahout.cf.taste.similarity.ItemSimilarity; import org.apache.mahout.cf.taste.neighborhood.UserNeighborhood; import org.apache.mahout.cf.taste.similarity.UserSimilarity; import org.apache.mahout.cf.taste.

Why are there three Hadoop svn repositories (common, hdfs and mapreduce)? Where is the repository for YARN?

When developers start reading about Hadoop, one of the first info they get is: " The project includes these modules: Hadoop Common : The common utilities that support the other Hadoop modules. Hadoop Distributed File System (HDFS™) : A distributed file system that provides high-throughput access to application data. Hadoop YARN : A framework for job scheduling and cluster resource management. Hadoop MapReduce : A YARN-based system for parallel processing of large data sets. " So it might be a little confusing when trying to build Hadoop code from source, they are indicated to check out only a repository called hadoop-common . It might became even more confusing when you realize that there are two other repositories for Hadoop: hadoop-hdfs and hadoop-mapreduce . So what repositories to use?  The answer is: hadoop-common encompasses all these Hadoop modules. When looking at the hadoop-hdfs or  hadoop-mapreduce repo

Frequent Itemset problem for MapReduce

I have received many emails asking for tips for starting Hadoop projects with Data Mining. In this post I describe how the Apriori algorithm solves the frequent itemset problem, and how it can be applied to a MapReduce framework. The Problem The frequent itemset problem consists of mining a set of items to find a subset of items that have a strong connexion between them . A simple example to clear the concept would be: given a set of baskets in a supermarket, a frequent itemset would be hamburgers and ketchup. These items appear frequently in the baskets, and very often, together. In the general a set of items that appear in many baskets is said to be frequent . In the computer world, we could use this algorithm to recommend items of purchase for a user. If A and B are a frequent itemset, once a user buys A, B would certainly be a good recommendation. In this problem, the number of "baskets" in assumed to be very large. Greater than what could fit in memory. The

Error when building Apache components using non-Sun Java

Sometimes, when building some Apache Hadoop related components with non-Sun Java (such as IBM Java or OpenJDK) you may encounter the following error: java.lang.NoClassDefFoundError: org.apache.hadoop.security.UserGroupInformation I got this error while building Hbase, Hive and Oozie. And all times the problem was the same. When building a component that depends on Hadoop, your Hadoop jar should build with non-Sun Java as well. That means that you should rename some com.sun imports in Hadoop code, so it is buildable with other JVMs. Replace this com.sun imports with the correspondent non-Sun package. Usually the jar that causes the problem is hadoop-core.jar . Hope it save you all some work!