Tuesday, August 5, 2014

Lecture on Recommender Systems

Great lecture on Recommender Systems by  Xavier Amatriain, Researcher on Netflix. 

https://www.youtube.com/watch?v=bLhq63ygoU8


Wednesday, July 16, 2014

Genetic Algorithm for Knapsack using Hadoop


Development of Genetic Algorithm using Apache Hadoop framework to solve optimization problems



Introduction


This project I developed during a course on my Master intends to construct a Genetic algorithm to solve optimization problems, focusing on the Knapsack Problem. It uses as base the distributed framework Apache Hadoop.
The idea is to show that the MapReduce paradigm implemented by Hadoop is a good fit for several NP-Complete optimization problems. As knapsack, many problems present a simple structure and converge to optimal solutions given a proper amount of computation.



Genetic Algorithm


The algorithm was developed based on a Genetic paradigm. It starts with a initial random population (random instances to the problem). Then, the best individuals are selected among the population (instances that generate the best profits for the knapsack). A phase of crossover was then implemented to generate new instances as combination of the selected individuals. In this phase, parts of each selected individual are randomly selected to form a new instance.
The next step of the process is the mutation phase. In this phase, “genes” are selected also randomly, and changed. In the knapsack context this means that random items are substituted from the solutions found.
This cycle can be repeated several times, until good individuals are generated.
The steps are reproduced below:

Initial Population → Selection → Crossover → Mutation




The Knapsack Problem











The goal is to maximize the objective function, where vi represents the value of each element i, and variable xi represents whether element i is going to be taken or not. The maximization has to take into account the weight of each element represented by the first constraint.

This problem was chosen because its solution only require enough amount of processing, and this processing can be done in parallel. The space of possible solutions can be seen as a set of local maximums that are going to be tested by the genetic algorithm. Hopefully, with enough computation, the global maximum can be achieved.




Validation Results


To demonstrate that the algorithm find a correct result with high probability, it was tested against two validation datasets. One was composed of a set of 10 items and another with 15 items.

With the validation input P01, we have:

P01 is a set of 10 weights and profits for a knapsack of capacity 165.
















The optimal result value is 309, by selecting items 0, 1, 2, 3 and 5 .

The algorithm finds the correct answer:

Best 5 answers:

Value Items Selected
270 0 1 2 9
276 0 2 3 6
277 0 1 3 4
284 0 1 3 6
309 0 1 2 3 5

The input P02 has optimal profit value 1458, and the algorithm also finds the correct answer.

These dataset were found on:


Setting up the Hadoop Cluster on a Single Machine



For testing purposes, here follows how to start on Hadoop, and set it up on a single machine. 

Download hadoop 1.21 from:

http://hadoop.apache.org/releases.html

Install a Java 6:

$ sudo apt-get install oracle-java6-installer

Inside the hadoop home directory, update a configuration file in conf/ :

$ vim conf/hadoop-env.sh

export JAVA_HOME=/usr/lib/jvm/java-6-oracle/
export HADOOP_HOME=/home/renata/hadoop-1.2.1
export HADOOP_VERSION=1.2.1

Create an input directory, to place your input file:

$ mkdir input

Put some content in:

$ cp conf/*.xml input

To test if your hadoop is working properly:

$ bin/hadoop jar hadoop-examples*.jar grep input output 'dfs[a-z.]+'

Don't forget to remove the output file after running a Job. Hadoop doesn't do this for you.

$ rm -rf output

Now, to test the Genetic jar file:

$ export HADOOP_HOME=/home/renata/hadoop-1.2.1
$ export HADOOP_VERSION=1.2.1

To compile the code run :

$ javac -classpath ${HADOOP_HOME}/hadoop-core*.jar -d genetic_classes Genetic.java

And create the jar:

$ jar -cvf genetic.jar -C genetic_classes/ .

Now, to run the code once you put the input data on "input" directory:

$ bin/hadoop jar genetic.jar org.myorg.Genetic input output 


PS: A large large dataset consisting of 10.000 items, with optimal profit value of 7145 and a weight capacity of 431 can be found on: https://github.com/jaredtking/knapsack-lisp/blob/master/datasets/large01.

Friday, January 31, 2014

Dealing with NP-Hard Problems: An Introduction to Approximation Algorithms


This is just a quick overview on approximation algorithms. It is a broad topic to discuss. For more information go to References.

The famous NP-Complete class is known for its possible intractability. NP means non deterministic polynomial and for a problem to be NP-Complete it has to be
  •   NP  (verified in polynomial time) and
  •   NP-Hard (as hard as any other problem in the NP class).

Among the several important problems that are NP-Complete or NP-Hard (on its optimization form) we can name the Knapsack, the Travel Salesmen, and the Set Cover problem.

Even though no efficient optimal solution might exist for NP-Complete problems we still need to address this issue due to the amount of practical problems existent in the NP-Complete class.

Considering that even for medium volumes of data exponential brute-force is impractical, the option is abdicating the optimum solution as minimum as possible and pursuing an efficient algorithm.

 Approximation algorithms are a way to deal with NP-Hard problems. They are polynomial-time algorithms, that return guaranteed near optimal solutions for any instance of the problem. The objective is to make the output as close to the optimal value as possible. We say that an α-approximation, is an algorithm that is within a α distance from the optimal value.

Measures of Approximation


Among the measures of approximation it is possible to highlight:
  • Absolute approximation
  • Approximation factor
We say that an algorithm A is an absolute approximation for an instance I if it stays under a absolute distance k of the optimum algorithm OPT:

                         |A(I) − OPT(I)| ≤ k  ∀ instance I.

An algorithm has an approximation factor of α if:

                       A(I) ≤ α · OPT(I) ∀I, for minimization
                       A(I) ≥ α · OPT(I) ∀I, for maximization

being that α>1 for minimization and α<1 for maximization problems.

Approximation Techniques


Greedy approaches are a highly intuitive and usually easy to implement approximation algorithm. It basically consists of making the "best" local choice at each step. For example, for the Set Cover problem, a greedy algorithm would be:

"At each round choose the set that minimizes the ratio weight of its weight to the number of currently uncovered elements it contains."

You can check a proof of approximation here [1].


Another way to create an approximation is by making a linear relaxation of a integral program. Remembering that many problems can be written in a more  mathematical form, were we have a objective function to maximize (or minimize), constraints and variables.
The Knapsack is a good example:



The goal here is to maximize the objective function, where vi represents the value of each element i, and variable xi represents whether element i is going to be taken or not. The maximization has to take into account the weight of each element represented by the first constraint.

Approximation through linear relaxation can be done in a very simple way. While Integral programming is NP-Hard, linear programming is not. If we relax (substitute) the constraint
   to 

        we can have a solution (not optimal!) in polynomial time.

Here is what is would look like:



The linear relaxation serves as a lowerbound of the optimal solution in minimization problems and upperbound in maximization problems.


There is also a very powerful method called Primal-dual schema.

It is well known that, in some cases, optimal solutions satisfy the complementary slackness conditions.

The Primal-Dual schema works with a relaxed version of the complementary slackness conditions. It starts with a integral feasible solution for the primal and a feasible solution for the dual, for example. Iteratively, it satisfies the slackness conditions so that no constraint problem is broken in the process. When all conditions are satisfied the program ends.

It can be a little complex, so I am going to do a post about it on the future. For now, to understand it in details please read [2].

Classes of Approximation


Algorithms as the mentioned above can be classified by their approximation "degree", or how much it is possible to approximate them of the optimal solution.
Following there are listed the classifications:

  • PO - problems for which there are exact polynomial algorithms. This an extension of the P class to optimization problems. 
  • PTAS (Polynomial Time Approximation Scheme) -  It presents an approximation scheme which is a (1 + e)OPT approximation for minimization, and, (1 - e)OPT for maximization, where e is a relative error (rational number).
  • FPTAS  (Fully Polynomial Time Approximation Scheme) -  as PTAS, FPTAS is also a polynomial scheme, but FPTAS running time is polynomial in the inverse of the error (1/e).
  • APX -  there is at least one polynomial α-approximation (for some constant α).
  • NPO - is an extension of the NP class to optimization problems. This means that an algorithm A is in the NP Optimization class if its instances are polynomial in size,  and solution for the problem can be verified in polynomial time and the size of the objective function (solution) can be calculated in polynomial time. 
This is the relation among the classes:

PO ⊆  FPTAS ⊆ PTAS ⊆ APX ⊆ NPO

If an algorithm is in the FTPAS class, this usually means that it can have better approximations than one in the APX class.

The knapsack problem, for example, has been proved to be in FPTAS [3]. If by any chance someone proved it to be also in any other class of the above, than P=NP.


References


[1] D. P. Williamson and D. B. Shmoys. The Design of Approximation Algorithms. 2010. Cambridge University Press
 
[2] V. Vazirani. Approximation Algorithms. 2001. Springer-Verlag. 


[3] (PORTUGUESE) M.H. Carvalho, M.R. Cerioli, R. Dahab, P. Feofiloff, C.G. Fernandes, C.E. Ferreira, K.S. Guimarães, F.K. Miyazawa, J.C. Pina Jr., J. Soares, Y. Wakabayashi. Uma introdução sucinta a algoritmos de aproximação. 23o Colóquio Brasileiro de Matemática, IMPA, Rio de Janeiro. 

[4]  Giorgio Ausiello, Pierluigi Crescenzi, and Marco Protasi. Approximate Solution of NP Optimization Problems. Theor. Comput. Sci. 150(1):1-55 (1995)


Friday, December 20, 2013

Overview of Digital Cloning

Introduction


The growth of the image processing and editing software availability has made it easy to manipulate digital images.
With the amount of digital content being generated nowadays, developing  techniques to verify the authenticity and integrity of digital content might be essential to provide truthful evidences in a forensics case.
In this context, copy-move is a type of forgery in which a part of an image is copied and pasted somewhere else in the same image. This forgery might be particularly challenging to discover due to properties like illumination  and noise matching on the source and the tampered regions. An example of copy-move forgery can be seen in picture 1. First we can see the original image, followed by the tampered one, and then a picture with the indication of the cloned areas.


Several techniques have been proposed to solve this problem. The Block-based methods [1] divide an image in blocks of pixels and compare them to find a forgery.

Keypoint-based methods [2] on the other hand extract keypoints of an image  and use these to find tampered regions.

While keypoints might generate better and computationally efficient detectors [3]  they also present difficulty in finding tampering in homogeneous regions.

State-of-the-Art


Many block-based algorithms have been proposed over the years. Involving mainly dividing the image in a fixed size b of pixels, they differ on how they compare the blocks.
Once blocks are extracted from a NxM image, they are usually inserted  lexicographically in a M - b + 1 x N - b + 1 matrix. This matrix is latter on analysed to see if any two lines match as a cloned region. Authors have proposed ways to improve this method such as using PCA to reduce the blocks dimensions.

Among several approaches in the literature, the cloning detection via multiscale analysis and voting method can be highlighted, proposed by Ewerton Silva et al. The first step on the process here is to extract keypoints of interest in an image using SURF, robust to scaling and rotating, and then match such points between themselves. These points are then grouped based on their physic distance, to limit the search space. The image is then redimensioned,
creating a type of pyramid of images, representing the several scales of the image. In each level of the pyramid, a search is made for possible duplicated regions. This search only occurs in the point groups discovered. The final decision is made based on a voting process, if a certain region is considered cloned in more than a threshold of levels of the pyramid.

Another work worth mentioning is an evaluation made on copy-move forgery approaches, by Christlein et al. 15 different copy-move detection approaches were compared in their work such as SIFT, SURF, PCA, KPCA, Zerkine,
DCT and DWT. They aimed to answer which algorithm performed best against realistic scenarios like different compressions and noise. Results showed that keypoints methods (SIFT and SURF) had clear computational complexity advantage, but other block based methods like Zernike achieved quite precise results.


References

[1] Connelly Barnes, Eli Shechtman, Adam Finkelstein, and
Dan B. Goldman. The generalized patchmatch correspon-
dence algorithm. 2010.
[2] Ewerton Silva and Anderson Rocha. Cloning detection. In
Elsevier JVCI 2013 (Submited).
[3] Christian Riess Johannes Jordan Corinna Riess Elli An-
gelopoulou Vincent Christlein. Evaluation of popular copy-
move forgery detection approaches. In IEEE Transactions on
Information Forensics and Security (TIFS) 2012, pages 015–
021, Graz, Austria, 2010.
[4] Andrea Vedaldi. An implementation of multi-dimensional
maximally stable extremal regions. 2007.


This topic was inspired from a class I took with Anderson Rocha. See his publications on digital forensics here: http://www.ic.unicamp.br/~rocha/pub/index.html.

Tuesday, August 20, 2013

Understanding Apache Hive


  
Introduction

  BigData and Hive

Apache Hive is a software application created to facilitate data analyses on Apache Hadoop. It is a Java framework that helps extracting knowledge from data placed on a HDFS cluster by providing a SQL-like interface to it.

The Apache Hadoop platform is a major project on distributed computing and it is commonly assumed to be the best approach when dealing with BigData challenges.
It is now very well established that great volume of data is produced everyday. Whether it is by system logs or by users purchases, the amount of information generated is such that previous existing Databases and Datawarehouses solutions don’t seem to scale well enough.
The MapReduce programming paradigm was uncovered in 2004 as a new approach on processing large datasets. In 2005 its OpenSource version, Hadoop, was created by Doug Cutting. Although Hadoop is not set for substituting relational databases, it is a good solution for big data analyses.

Hadoop facilitates large data processing, but it still requires skillful programmers to create the Map and Reduce functions to analyze the data. All analyzes made through Hadoop had to be condensed on these two functions. Creating this type of applications might be challenging and difficult to maintain. Previous data developers had difficulty on extracting intelligence from their data. Hive was created to overcome this issues.

Apache Hive

First introduced by Facebook and latter donated to the Apache Software Foundation, it is a data warehouse interface for Hadoop. With Hive, users can create SQL statements that will be automatically converted to MapReduce jobs and run on a HDFS cluster.

Data can be inserted or dealt with on the Hadoop cluster through command line interface using statements from the Hive Quey Language, or HiveQL, such as SELECT, INSERT or CREATE TABLE. Users can also create their own User Defined Functions, by extending the UDF class already provided. Within these statements tables can be defined using primitive types as integers, floating points, strings, dates and booleans. Furthermore, new types can be created by grouping these primitives types into maps and arrays. Please check https://cwiki.apache.org/Hive/languagemanual.html for more information on HiveQL.

Although Hive presents a data warehouse interface for Hadoop, it is still a batch processing framework. As Hive’s data is located on Hadoop, it is limited to Hadoop’s constraints.
Hadoop does not index data it is not made for editing Data. There is no UPDATE on Hive, because this functionality could not be executed on data over HDFS. Hive does not support transactions. If you want these kind of database on top of Hadoop you should look for options such as HBase. Check http://wiki.apache.org/hadoop/HadoopIsNot to read more about this Hadoop limitations.

Even so, Apache Hive made it possible for developers with basic SQL knowledge to create complicated meaningful queries and quickly extract value from big data.

Architecture

Users can start interacting with Hive though a Command Line Interface (CLI), Hive Web Interface (HWI), JDBC or ODBC.
The CLI interface is a command line tool accessed through a terminal. It can be initiated by calling the HIVE_HOME/bin/hive script, inside Hive downloaded source code. Hive also provides a Hive server, so that users can use JDBC or ODBC to communicate with it.




When you type a query through the CLI interface, this HiveQL statement will be handled by the Driver component. The Driver connects a bunch of modules that transform the statement into MapReduce jobs to be run in Hadoop. It is importante to note that the query is not transformed in Java code in this process. Its goes direclty to MapReduce jobs. The modules involved in this process are: Parser, Semantic Analyzes, Logical Plan generator, Optimizer, Physical Plan Generator and Executor.


First, the Driver creates a session to remember details about the process, to maintain dates and statistics.

Some metadata (information about tables and columns) is then collected and stored on Metastore as soon as the input data (tables) are created. This metadata is actually stored in a relational database and it is latter on used on the Semantic Analyses.

ANTLR software is used to create a parser on the Parser module and parse the query. As in a compiler, the statement in broken down into token values and a Abstract Syntax Tree (AST) is created.

The following HiveQL statement

FROM src src1 JOIN src src2 ON (src1.key = src2.key) JOIN src src3 ON (src1.key + src2.key = src3.key)

INSERT OVERWRITE TABLE dest1 SELECT src1.key, src3.value

would became this AST


(TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_JOIN (TOK_TABREF (TOK_TABNAME src) src1) (TOK_TABREF (TOK_TABNAME src) src2) (= (. (TOK_TABLE_OR_COL src1) key) (. (TOK_TABLE_OR_COL src2) key))) (TOK_TABREF (TOK_TABNAME src) src3) (= (+ (. (TOK_TABLE_OR_COL src1) key) (. (TOK_TABLE_OR_COL src2) key)) (. (TOK_TABLE_OR_COL src3) key)))) (TOK_INSERT (TOK_DESTINATION (TOK_TAB (TOK_TABNAME dest1))) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL src1) key)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL src3) value))))) null

The process follows on with a Semantic Analyses on the generated AST. The information provided on the query is verified to be valid by confronting the schema information from the input tables, stored on the Metastore component.
Type checking is a example of operations performed by the Semantic Analyzes.

After this analyses, a operator tree is created by the Logical Plan Generator, based on the parsed information and on the AST created. This operator tree is then, passed to the Optimizer procedure, which will perform a set of transformations to, not surprisingly, optimize the operations. The improvements accomplished by the Optimizer include column pruning (only column really needed will be fetched) and join reordering (to make sure only small tables are kept in memory).
The Physical Plan Generator gets the optimized operator tree and creates a Directed Acyclic Graph of MapReduce jobs of it. This physical plan is displayed in a XML file, and it is delivered to the Executor to be executed into the Hadoop cluster finally.




Hive and the different Hadoop versions

Hive can be built with Hadoop 1.x or with Hadoop 2.x. It presents interfaces for this purpose, and these interfaces are defined in the Shims interface.

There are three interfaces for Hadoop: 0.20, 0.20S, 0.23. 0.20 is supposed to work with Hadoop 1.x, 0.20s is for a secure version of Hadoop 1.x and 0.23 is for building Hive against Hadoop 2.x.

You can prevent a interface to be built by editing the property shims.include on HIVE_HOME/shims/build.xml:

<property name="shims.include" value="0.20,0.20S,0.23"/>                             

Hive uses a Factory Method to decide which Hadoop interface to use, based on the version of Hadoop on the classpath. This is situated on

HIVE_HOME/shims/src/common/java/org/apache/hadoop/hive/shims/ShimLoader.java.

HIVE_HOME/shims/src/common/java/org/apache/hadoop/hive/shims/HadoopShims.java

encapsulates the interfaces.

hadoop.version is defined on HIVE_HOME/build.properties but you can overwrite it by using the flag -Dhadoop.version.

To build Hive with Hadoop 1.1.2, use


$ ant clean package -Dhadoop.version=1.1.2                   


To build Hive with Hadoop 2.0.4-alpha, use

$ ant clean package -Dhadoop.version=2.0.4-alpha -Dhadoop-0.23.version=2.0.4-alpha -Dhadoop.mr.rev=2


Building

To use some of Hive features such as UDF, or even to adapt Hive’s code to your own needs, you might have to build the source code from source.
Before start, make sure you have a Java JDK, Ant and Subversion installed on your computer.

Then, start by downloading the last stable release version from Hive repository.

$ svn checkout https://svn.opensource.ibm.com/svn/stg-hadoop/hive/0.11.0/trunk hive-0.11.0

Enter on your Hive home directory (which from now on, we will call HIVE_HOME):

$ cd hive-0.11.0                                               

And finally build the code with:

$ ant package                                                  

This will automatically download and install all dependencies required for Hive’s use.

Hive depends on (or uses) other Hadoop-related components. As from Hive 0.11 version, these are:

Apache Hadoop
Apache HBase
Apache Avro
Apache Zookeper

This components will be automatically downloaded by Ant and Ivy, when you run the ant package command. You can check which version of each component will be downloaded in

HIVE_HOME/ivy/libraries.properties

and, as explained on last session, Hadoop version can be chekced here:

HIVE_HOME/build.properties

To check all ant command possibilities with Hive, type:

$ ant -p                                                       

This should show how to built it, test it and even how to create a tar file from the source. The testing will be explained a little bit further next.


Unit Tests

Hive provides several buitltin Unit Tests to verify its own modules and features functionalities. They are constructed using JUnit 4 and run queries (.q files) already provided by the framework.

To create the JUnit classes execute:

$ ant package                                                  

To run the unit tests type:

$ ant test                                                     

To run a specific test run:


$ ant test -Dtestcase=TestCliDriver                            

To run a specific query inside one Unit Test run:

$ ant test -Dtestcase=TestCliDriver -Dqfile=alter5.q           


The command described above will produce a output that will be compared with Hive’s expected output. It will also generate a .xml log file, very helpuf for debbugging purposes:

HIVE_HOME/build/ql/test/TEST-org.apache.hadoop.hive.cli.TestCliDriver.xml

If you are having troubles with a certain testcase, and trying to debug it, pay attention: some java test files (all files under ql module) for Hive are created on build time from Velocity Templates (.vm). If you want to modify this tests you have to change the .vm file, not the .java one.



References:




Book: Programming Hive:Data Warehouse and Query Language for Hadoop. Edward Capriolo, Dean Wampler, Jason Rutherglen



Article: Hive A Warehousing Solution Over a MapReduce

Framework. Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao,

Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff and Raghotham Murthy.

Facebook Data Infrastructure Team

















Saturday, July 27, 2013

Is there such a thing as "best" Recommender System algorithm?

I received emails from users asking which recommender system algorithm they should use. Usually people start looking for articles on which approach has a better performance, and once they find something convincing they start to implement it.

I believe that the best recommender system depends on the data and the problem you have to deal with.

With that in mind, I decided to publish here some pros and cons for each recommender type (collaborative, content and hybrid), so people can decide for their own what algoritms better suit their needs.

I've already presented these approaches here, so if you know nothing about recommender systems, you can read it there first.

Collaborative Filtering

Pros

  • Recommends diverse items to users, being innovative;
  • Good practical results (read Amazon's article);
  • It is widely used, and you can find several OpenSource  implementations of it (Apache Mahout);
  • It can be used on ratings from users on items;
  • It can deal with video and audio data;

Cons

  •   It suffers with scarcity of data, if you don't have many ratings for example you might end up with bad results;
  •   When the number of ratings grow, scalability becomes an issue, it might be hard to calculate similarity for all users;


Content Based Filtering

Pros

  • It works better with smaller amount of information than Collaborative Filtering;
  • It uses description of items, so it works well with tagged items, and it usually matches well users preferences profile;

Cons

  • It doesn't work so well for video or audio data with no text tags;
  • Frequently recommends repetitive items, staying only on similar things that the user has already seen;

Hybrid Systems

Pros

  • Usually the most effective approach (more accuracy on results);
  • It overcomes the single approaches;

Cons

  • Hard to find a balance when combining the two approaches;
  • Challenging to implement;


Friday, July 26, 2013

Recommender Systems Online Free Course on Coursera

I already talked about Coursera's great courses here.
There is a new course on Recommender Systems starting in September:

https://www.coursera.org/course/recsys

I don't know how it is going to be, but based on the courses I've done so far, it looks good.