Thursday, November 22, 2012

Dependencies on a Hadoop Ecosystem

When building a Hadoop cluster and all other Apache projects related to it, it might be tricky to know what to install and how to install.

You first should understand your data, and what you want to do with it. If you have log-like data, that keeps increasing all the time, and you have keep updating it to the Hadoop cluster, you might want to consider Flume.

Apache Flume is a distributed deamon-like software ( and for that presents High Availability) that can keep feeding data to the Hadoop cluster.

If you need a no-sql radom read/write database you can use Hbase, implemented based on Google's BigTable database.

If you have relatively structured data and you want to do query-like analyses on it, you might consider Pig or Hive. Both work on top of the Hadoop cluster, executing commands instead of Java written MapReduce jobs. They both furnish their own language to execute such commands. Pig uses a textual language called Pig Latin, and Hive uses a syntax very similar to SQL.

To clarify the dependencies on these and other projects I created a dependency-graph (actually a DAG). Here follows!

Just let me know if you have any doubts! :)

Saturday, November 10, 2012

Duine Open Source Recommender

Duine  is a open source Recommender System. It is a collection of software libraries developed by Telematica Instituut/Novay that intends to predict how interesting is an information to a user. It provides the collaborative filtering and content
based recommender and other features, such as an Explanation API (explanations to why such recommendations has been made).
Its result recommendations are quantified by a number, ranging from -1 to +1, being that the
greater the result, the more interesting the item should be to the user.
One of the main advantages of Duine is its well formed architecture. When it performs a
recommendation, it can incorporate the user feedback to its systems. Also, it possess a switching
engine, being able to analyse which method (content or collaborative) is better in the data situation, and
dynamically change it.

a . Architecture

The following picture describes the main concept of Duine framework.

b. Installation

To install the framework it is advised to follow some steps. It is first necessary to download the code Then you should build and run the project with the Maven project manager. The file that you will be running is with the dataset of MovieLens, which is already incorporated in the Duine package.

c. Examples

To create a Duine recommender you should create and run a scenario involving users and items.
Here follows a simple example, with no previous dataset. First we create an item (a movie) and we set a characteristic (being the type “horror”) . Then we create a user and we try to calculate a recommendation to this user on the item. This prediction should not work, since we have no rating information on the user or the item. Afterwards, we explicitly give a feedback from the user to the genre of the item (we give “horror” a rating of 0.4) and try a prediction again. This time, we tested the ability of Duine correlate the genre with the item, and the rating result to the item shows exactly the  same number as the feedback giving by the user. Then finally give a rating from the user to the item. As
expected, when making the recommendation Duine presents the rating given as result.
The example code:

1. public void runScenarioTest() {
2.   double temp;
3."********* SCENARIO TEST - RENATA **********");
4."Create a new movie with genre 'horror'");
5.   RatableItemId item1 = new RatableItemId("1");
6.   Movie movie1 = new Movie(item1);
8.   ArrayList<String> genres = new ArrayList<String>();
9.   genres.add("horror");
10.  movie1.setGenres(genres);
11.  movie1.setTitle("movie1");
13."Create a user id 'user1'");
14.  UserId user1 = new UserId("user1");
16."Calculate a prediction for the interest of this
user in this movie");
17.  Prediction prediction = recommender.predict(user1, movie1);
18."Prediction result: " + prediction);

21."Enter term feedback for user1: term='horror',value=0.4, certainty='0.8'");22. ITermFeedback termFeedback = new TermFeedback("horror",0.4, 0.8);23. recommender.enterFeedback(user1, termFeedback);24."Calculate a prediction for the interest of thisuser in this movie");                                                             
25.  Prediction prediction3 = recommender.predict(user1,movie1);         
26."Prediction result: " + prediction3);
29."Give a rating to the movie (value=0.9, certainty=0.8)");
30.  IRatableItemFeedback feedback = new RatableItemFeedback(movie1, 0.9, 0.8);
31.  recommender.enterFeedback(user1, feedback);
32."Calculate a prediction for the interest of this user in this movie");
33.  Prediction prediction2 = recommender.predict(user1,movie1); 

34."Prediction result: " + prediction2);
35. }

Friday, November 9, 2012

Open Source Recommendation Systems Survey

Here follows a survey I did back in 2010 when I was studying Recommender Systems. Hope it is useful.

The growth of web content and the expansion of e-commerce has deeply increased the interest
on recommender systems. This fact has led to the development of some open source projects in the area.
Among the recommender systems algorithms available in the web, we can distinguish the following:

All of these projects offers collaborative-filtering implementations, in different programming languages.

The Duine Framework supplies also an hybrid implementation. It is a Java software that presents the content-based and collaborative filtering in a switching engine: it dynamically switches between each prediction given the current state of the data. For example if there aren't many ratings
available, it uses the content-based approach, and switches to the collaborative when the scenario changes. It also presents an Explanation API, which can be used to create user-friendly recommendations and a demo application, with a Java Client example.

Apache Mahout constitutes a Java framework in the data mining area. It has incorporated the Taste Recommender System, a collaborative engine for personalized recommendations.

Vogoo is a PHP framework that implements an collaborative filtering recommender system. It also presents a Slope-One code.

A Java version of the Collaborative Filtering method is implemented in the Cofi library. It was developed by Daniel Lemire, the creator of the Slope-One algorithms. There is also an PHP version available in Lemire's webpage.

OpenSlopeOne offers an Slope One implementation on PHP that cares about performance.

SUGGEST is a recommendation library made by George Karkys and distributed in a binary format.

Many of these projects run with the help of Maven, a project manager by Apache, that can be downloaded in the website.
In this project, they were tested with the MovieLens dataset, a database available by the GroupLens Research. It is offered three packages with 100.00, 1 million and 10 million ratings from users on items varying from 0 to 5.
For my specific project, I had to chose one of these open source packages to be used. It was, then, natural to compare the softwares, analyzing which one was a better fit to our requirements.



Analysing software in the recommendation area is not an simple task, since is difficult to define measurement standards. In this work, we proposed some criteria of evaluation such as: types of recommendation implemented by the project, programming language, level of documentation and
magnitude of the project.
The documentation was evaluated based on its volume and clarity. It is possible to observe that
the volume of documentation presented by Mahout and Duine is remarkably larger than the other
systems. Both offer installation and utilization guides and come with a demonstration example. It must
be taken in count that OpenSlopeOne and Cofi are smaller project, and due to it, their documentation
tend to be smaller.
In the Downloads column we have a representation of the magnitude of the project. It is
presented the number of times the software, in any version, was downloaded from its source. Although
Mahout  does not present its number, its very populated mailing lists show that it is a widely used

The two projects that stood out were Apache Mahout and Duine. We tested them in order to verify which one was more applicable to our work. Both of them are Java frameworks and present an demonstration example with the Movielens data set.
The fact that Mahout is a greater project and has multiples machine-learning algorithms made it
more interesting to our research. Also, its module structure encouraged us to choose it.

Here follows the main advantages and characteristics of the two most qualified projects for our needs.

To read more about Mahout.