Wednesday, December 26, 2012

Apache Hbase unit test error

Building and testing Hbase 0.94.3 in Linux (RedHat and Sles) I came across with a error a couple of times, and solved it with 3 different approaches that I think might be good to share.
Several testcases were failing due to it, such as TestCatalogTrackerOnCluster and TestLogRollAbort.

I ran  mvn clean compile package, and got the following error:



java.io.IOException: Shutting down
 at org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:203)
 at org.apache.hadoop.hbase.MiniHBaseCluster.<init>(MiniHBaseCluster.java:76)
 at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniHBaseCluster(HBaseTestingUtility.java:632)
 at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:606)
 at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:554)
 at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:541)
 at org.apache.giraffa.TestRestartGiraffa.beforeClass(TestRestartGiraffa.java:39)
Caused by: java.lang.RuntimeException: Master not initialized after 200 seconds
 at org.apache.hadoop.hbase.util.JVMClusterUtil.startup(JVMClusterUtil.java:206)
 at org.apache.hadoop.hbase.LocalHBaseCluster.startup(LocalHBaseCluster.java:420)
 at org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:196)



The first thing you should do is run the following command:

 $ umask 0022

Next, make sure your JVM has enough memory to work with:

 $ export MAVEN_OPTS="-Xmx512m"

Finally, it might me possible that this error is being caused by different loopback addresses in the /etc/hosts file.
Open the /etc/hosts and check if you have different loopback address such as 127.0.0.1 and 127.0.0.2. If so, change both to 127.0.0.1, so they are the same.




Thursday, November 22, 2012

Dependencies on a Hadoop Ecosystem


When building a Hadoop cluster and all other Apache projects related to it, it might be tricky to know what to install and how to install.

You first should understand your data, and what you want to do with it. If you have log-like data, that keeps increasing all the time, and you have keep updating it to the Hadoop cluster, you might want to consider Flume.

Apache Flume is a distributed deamon-like software ( and for that presents High Availability) that can keep feeding data to the Hadoop cluster.

If you need a no-sql radom read/write database you can use Hbase, implemented based on Google's BigTable database.

If you have relatively structured data and you want to do query-like analyses on it, you might consider Pig or Hive. Both work on top of the Hadoop cluster, executing commands instead of Java written MapReduce jobs. They both furnish their own language to execute such commands. Pig uses a textual language called Pig Latin, and Hive uses a syntax very similar to SQL.

To clarify the dependencies on these and other projects I created a dependency-graph (actually a DAG). Here follows!













Just let me know if you have any doubts! :)

Saturday, November 10, 2012

Duine Open Source Recommender


Duine  is a open source Recommender System. It is a collection of software libraries developed by Telematica Instituut/Novay that intends to predict how interesting is an information to a user. It provides the collaborative filtering and content
based recommender and other features, such as an Explanation API (explanations to why such recommendations has been made).
Its result recommendations are quantified by a number, ranging from -1 to +1, being that the
greater the result, the more interesting the item should be to the user.
One of the main advantages of Duine is its well formed architecture. When it performs a
recommendation, it can incorporate the user feedback to its systems. Also, it possess a switching
engine, being able to analyse which method (content or collaborative) is better in the data situation, and
dynamically change it.

a . Architecture

The following picture describes the main concept of Duine framework.


b. Installation

To install the framework it is advised to follow some steps. It is first necessary to download the code http://sourceforge.net/projects/duine/files/. Then you should build and run the project with the Maven project manager. The file that you will be running is MovieLensRecommenderClient.java with the dataset of MovieLens, which is already incorporated in the Duine package.

c. Examples

To create a Duine recommender you should create and run a scenario involving users and items.
Here follows a simple example, with no previous dataset. First we create an item (a movie) and we set a characteristic (being the type “horror”) . Then we create a user and we try to calculate a recommendation to this user on the item. This prediction should not work, since we have no rating information on the user or the item. Afterwards, we explicitly give a feedback from the user to the genre of the item (we give “horror” a rating of 0.4) and try a prediction again. This time, we tested the ability of Duine correlate the genre with the item, and the rating result to the item shows exactly the  same number as the feedback giving by the user. Then finally give a rating from the user to the item. As
expected, when making the recommendation Duine presents the rating given as result.
The example code:

1. public void runScenarioTest() {
2.   double temp;
3.   log.info("********* SCENARIO TEST - RENATA **********");
4.   log.info("Create a new movie with genre 'horror'");
5.   RatableItemId item1 = new RatableItemId("1");
6.   Movie movie1 = new Movie(item1);
7.
8.   ArrayList<String> genres = new ArrayList<String>();
9.   genres.add("horror");
10.  movie1.setGenres(genres);
11.  movie1.setTitle("movie1");
12.
13.  log.info("Create a user id 'user1'");
14.  UserId user1 = new UserId("user1");
15.
16.  log.info("Calculate a prediction for the interest of this
user in this movie");
17.  Prediction prediction = recommender.predict(user1, movie1);
18.  log.info("Prediction result: " + prediction);
19.
20.

21.  log.info("Enter term feedback for user1: term='horror',value=0.4, certainty='0.8'");22. ITermFeedback termFeedback = new TermFeedback("horror",0.4, 0.8);23. recommender.enterFeedback(user1, termFeedback);24. log.info("Calculate a prediction for the interest of thisuser in this movie");                                                             
25.  Prediction prediction3 = recommender.predict(user1,movie1);         
26.  log.info("Prediction result: " + prediction3);
27.  log.info(prediction3.getExplanation());                                       
29.  log.info("Give a rating to the movie (value=0.9, certainty=0.8)");
30.  IRatableItemFeedback feedback = new RatableItemFeedback(movie1, 0.9, 0.8);
31.  recommender.enterFeedback(user1, feedback);
32.  log.info("Calculate a prediction for the interest of this user in this movie");
33.  Prediction prediction2 = recommender.predict(user1,movie1); 

34.  log.info("Prediction result: " + prediction2);
35. }

Friday, November 9, 2012

Open Source Recommendation Systems Survey

Here follows a survey I did back in 2010 when I was studying Recommender Systems. Hope it is useful.


The growth of web content and the expansion of e-commerce has deeply increased the interest
on recommender systems. This fact has led to the development of some open source projects in the area.
Among the recommender systems algorithms available in the web, we can distinguish the following:


All of these projects offers collaborative-filtering implementations, in different programming languages.

The Duine Framework supplies also an hybrid implementation. It is a Java software that presents the content-based and collaborative filtering in a switching engine: it dynamically switches between each prediction given the current state of the data. For example if there aren't many ratings
available, it uses the content-based approach, and switches to the collaborative when the scenario changes. It also presents an Explanation API, which can be used to create user-friendly recommendations and a demo application, with a Java Client example.

Apache Mahout constitutes a Java framework in the data mining area. It has incorporated the Taste Recommender System, a collaborative engine for personalized recommendations.

Vogoo is a PHP framework that implements an collaborative filtering recommender system. It also presents a Slope-One code.

A Java version of the Collaborative Filtering method is implemented in the Cofi library. It was developed by Daniel Lemire, the creator of the Slope-One algorithms. There is also an PHP version available in Lemire's webpage.

OpenSlopeOne offers an Slope One implementation on PHP that cares about performance.

SUGGEST is a recommendation library made by George Karkys and distributed in a binary format.

Many of these projects run with the help of Maven, a project manager by Apache, that can be downloaded in the website.
In this project, they were tested with the MovieLens dataset, a database available by the GroupLens Research. It is offered three packages with 100.00, 1 million and 10 million ratings from users on items varying from 0 to 5.
For my specific project, I had to chose one of these open source packages to be used. It was, then, natural to compare the softwares, analyzing which one was a better fit to our requirements.


  Comparison

 

Analysing software in the recommendation area is not an simple task, since is difficult to define measurement standards. In this work, we proposed some criteria of evaluation such as: types of recommendation implemented by the project, programming language, level of documentation and
magnitude of the project.
The documentation was evaluated based on its volume and clarity. It is possible to observe that
the volume of documentation presented by Mahout and Duine is remarkably larger than the other
systems. Both offer installation and utilization guides and come with a demonstration example. It must
be taken in count that OpenSlopeOne and Cofi are smaller project, and due to it, their documentation
tend to be smaller.
In the Downloads column we have a representation of the magnitude of the project. It is
presented the number of times the software, in any version, was downloaded from its source. Although
Mahout  does not present its number, its very populated mailing lists show that it is a widely used
software.

The two projects that stood out were Apache Mahout and Duine. We tested them in order to verify which one was more applicable to our work. Both of them are Java frameworks and present an demonstration example with the Movielens data set.
The fact that Mahout is a greater project and has multiples machine-learning algorithms made it
more interesting to our research. Also, its module structure encouraged us to choose it.

Here follows the main advantages and characteristics of the two most qualified projects for our needs.

To read more about Mahout.

Monday, October 22, 2012

Introduction to Apache Hive

Hive is a distributed data warehouse that runs on top of Apache Hadoop and enables analyses on huge amount of data.


 It provides its own query language HiveQL (similar to SQL) for querying data on a Hadoop cluster. It can manage data in HDFS and run jobs in MapReduce without translating the queries into Java. The mechanism is explained below:

"When MapReduce jobs are required, Hive doesn’t generate Java MapReduce programs.
Instead, it uses built-in, generic Mapper and Reducer modules that are driven by an
XML file representing the “job plan.” In other words, these generic modules function
like mini language interpreters and the “language” to drive the computation is encoded
in XML.
"

This text was extracted from Programming Hive.

Hive was initially developed by Facebook, with the intention to facilitate running MapReduce jobs on a Hadoop cluster, since sometimes writing Java programs can be challenging for non-Java developers (and for some Java developers as well). The language created HiveQL provides a more approachable manner to make MapReduce jobs.

Hive Architecture Overview

Hive can be accessed via a command line and  Web User interfaces. You can also use Hive through the JDBC or ODBC APIs provided. The Thrift server exposes an API to execute HiveQL statements with a different set of languages (PHP, Perl, Pyhton and Java).

The Metastore component is a system catalogue that contains metadata regarding tables, partitions and databases.

It is in the Driver and in the Compiler components that most of the core operations are made. They parse, optimize and execute queries.
The SQL statements are converted to a graph (a DAG graph actually) of map/reduce jobs in run time, and this are run in the Hadoop cluster.

For more information about the Hive architecture take a look at the Facebook article about it.

Building

 Hive has some dependencies that you should get before building it, like ant, svn and Java. It also depends on Hadoop, Hbase and Zookeper, but this packages are automatically downloaded by ivy. If you wish to change the Hadoop package it is going to build, take a look at the last section in this post.
  •  Download ant:
#yum install ant.x86_64

 (or apt-get install if you are using Debian-like systems).
  • Download Hive:
$svn co http://svn.apache.org/repos/asf/hive/trunk hive
  • Set the Java environment:
$export JAVA_HOME=/usr/lib/jvm/java-1.6.0-ibm-1.6.0.11.0.x86_64/
$export HIVE_HOME=/my_hive_home
$export PATH=$HIVE_HOME/bin:$PATH
  • Build ant:
$ant clean package
  • Run hive:
$build/dist/bin/hive


Troubleshooting


Problem:
ant java.lang.ClassNotFoundException: org.apache.tools.ant.taskdefs.optional.TraXLiaison

Solution:
download ant-trax

Problem:
 com.sun.tools.javac.Main is not on the classpath. Perhaps JAVA_HOME does not point to the JDK.

The thing here is that tools.jar must be found by ant when using javac.

Solution:
Download or find the library tools.jar (you can use $locate tools.jar to find it)  and make sure it is on your JAVA_HOME directory.
It might also be the case that you are pointing your  JAVA_HOME to a JRE Java and not a JDK Java.



Building Hive with different versions of Hadoop


When running ant package command, Ant  by default is going to download and build Hive against version 0.20.0 of Hadoop (check on build.properties).


If you, like me, wants to use Hive with your own version of Hadoop, specifically, a newer version,
you can  pass the -Dhadoop.version flag or change this hadoop.version property in build.properties.

You might want to know that Hive has an interface called Shims that is made exactly for this situation.
The Shims interface makes it possible for you to create your own Hadoop compatible class, or use one of that is already provided.
Hive provides a 0.20 class, a 0.20 Secure class and a 0.23 class.

If you are willing to build Hive against Hadoop version 1.x you the 20S class. If you are willing to built it with Hadoop 2.x you should use 23 class.

With this 0.20, you can build Hadoop 1.0.0 or newer.

If you want to build only one interface, then on shims/build.xml file edit this line:

<property name="shims.include" value="0.20,0.20S,0.23"/>                      

and set this line instead:

<property name="shims.include" value="0.20S"/>                     


It is not necessary to exclude the undesired interfaces, because Hive will choose on running time which interface to use, depending on the Hadoop present in your classpath.



Using Hive


For start using Hive, you should run the Hive shell (after configuring the environment as shown above):

 $ hive                                                         

As a response, you should see the Hive shell promt:

hive>                                                           

Now, you should be able to create your first table:

hive> CREATE TABLE my_test (id INT, test STRING);                

This table was assumed to be in the default format: lines followed by a '\n'.

hive> LOAD DATA LOCAL INPATH 'my_data' INTO my_test;              

With this command, you loaded the data my_data on your local machine to Hive. You can execute queries over this data now:

hive> SELECT * FROM my_test;                                      

 

More Information

 

https://cwiki.apache.org/confluence/display/Hive/GettingStarted

https://cwiki.apache.org/confluence/display/Hive/DeveloperGuide 

http://www.youtube.com/watch?v=U0r9s4iXwo0

http://vimeo.com/29732341

http://www.youtube.com/watch?v=Pn7Sp2-hUXE







Tuesday, October 9, 2012

Working and Testing with Ant

Recently I've been working with Hive and had some troubles working with Ant. For this reason, I bought the book "Ant in Action" and I'll share some thing I've learned with it, and some other experiences I had working with Ant.

Ant is a tool written in Java to build Java. By build I mean compile, run, package it and even test it.
It is designed for helping software teams develop big projects, automating tasks of compiling code.

To run Ant in you project you should have a build.xml file that describes how the projects should be built.
There should be one build per project, except if the project is too big. Then you might have subprojects with "sub build" files. These sub builds are coordinated by the main build file.

The goals of the build are listed as targets. You can have, for example, a target init that creates initial necessary directories.

  
  

You can also have a target compile that depends on the previous init target, and compiles the Java code.

 

After installing Ant, if you run the command:

$ant -p

you will see all the available targets in your project. These are all tests you can perform in your project.

In build.xml, inside of each target tag, you should see a command. In the compile target you should probably see a javac command. That is the action Ant is going to execute to perform the appropriate task.

To test, Ant uses JUNit, a unit test framework that verifies if your software components are working individually.
JUnit is API that facilitates writing Java test cases, whose execution can be fully automated. To use it, you just need to download a junit.jar file and set it in your Ant path. In my case, it was just a matter of adding it to the /usr/share/ant/lib/ directory.

Writing a test for JUnit involves three steps:

• Create a subclass of junit.framework.TestCase.
• Provide a constructor, accepting a single String name parameter, which calls super (name).
• Write some public no-argument void methods prefixed by the word test.

See the example:

public class SimpleTest extends TestCase {
  public SimpleTest(String s) {
    super(s);
  }
  public void testCreation() {
    Event event=new Event();
  }
}
The only part that actually  tests anything in this program is the testCreation() method, which is simply going to try to create an Event.
Beware that methods without the "test" prefix are ignored. Also your methods shouldn't have any arguments neither a return type.

Pay attention that this regards to JUnit 3. There are several differences between JUnit 3 and 4. You can check some of them here.
Just so you know, if the testcase fails, you will be presented with a junit.framework.AssertionFailedError.

To run your tests with ant, you should add a test target, with a <junit> tag.
A build.xml example would be:

  
  
    
    
  

It should work by just typing:

$ant test-basic

Friday, July 6, 2012

Altering a FOREIGN KEY Data Type

Today another curious issue happened regarding MySQL. 

I had a scenario where the rows in a table grew greatly, so that the previous PRIMARY KEY type, a SMALLINT (up to 32767 different ids), could not bear the new amount of data.
So I had to modify the PRIMARYKEY type.

When trying to alter the field, the following error appeared:

ERROR 1025 (HY000): Error on rename of '.\temp\#sql-248_4' to '.\temp\country' (errno: 150)input

Checking this error number in a shell:

$ perror 150
MySQL error code 150: Foreign key constraint is incorrectly formed

Basically this error was a consequence of trying to modify a field that was a FOREIGN KEY, with the command

ALTER TABLE mytable MODIFY COLUMN id INT NOT NULL AUTO_INCREMENT;

The problem is that in MySQL there is no way of updating a field's type "on cascade". So to update a field that is a FOREIGN KEY one should drop the FOREIGN KEY Relation, change the field type, and then bring the relation back on.
You could also delete the table that uses the FOREIGN KEY, alter the desired field, and then re-create the dropped table.

Tuesday, April 24, 2012

How to Deal with Duplicate Primary Keys on MySQL

Recently, I came across with a problem where I had to copy and insert some MySQL data from one database to another.
The issue regarded the same table structure in both databases, but with different data content in each. It was vital not to lose any information when importing the dump from one table to the other.
One of the most important details was that it was possible to have the same tuple repeated in both tables and I could not afford overwriting any content.

So, here follows the command used:

mysqldump -u user -p  --no-create-info --insert-ignore mydatabase mytable > file.dump

The --no-create-info makes sure no DROP or CREATE info are added, so your table is not deleted and recreated and you don't lose any information.

With the --insert-ignore parameter, the data is inserted with the INSERT IGNORE method. Using so, when trying to insert a tuple with an already existent  primary key it is simply ignored. The duplicated key tuple is discarded and the previous existent tuple remains.

Another useful command may be the  INSERT ... ON DUPLICATE KEY UPDATE, that also, as the name suggests, deals with duplicated keys. This clause updates the value of the previous existing key in case of conflicts, so it can insert the new tuple.

For example if "myvalue" is a primary key, you can set:

INSERT INTO mytable (myvalue) VALUES (10)
ON DUPLICATE KEY UPDATE myvalue=myvalue+1;

Thursday, February 23, 2012

Recommender System Implementation

Hello,

I have just posted an implementation of the SlopeOne Recommender System in Java. I designed it to work with Movielens Dataset (a bunch of ratings on movies).
I hope that helps everybody that is starting to develop their own recommender system.

Wednesday, January 25, 2012

Articles about Recommender Systems, Mahout and Hadoop Framework

Seeing that Recommender Systems has drawn a lot of attention in this past year, I would like to recommend further reading to those who want to obtain greater knowledge in the subject.
I will indicate some articles that have helped me study the matter:


This article written by Adomavicius introduces Recommender Systems very well. It explains the main three types of these systems (Content-Based, Collaborative Filtering and Hybrid Recommendation). I also gives a formal mathematical definition of a Recommender Systems, which for some people can be great. I greatly recommend any other article you may find of Adomavicius.


This paper also provides great overview of Recommender Systems and a very interesting comparison between Collaborative Filtering and Content-Based approaches.


This Article explains the Recommender System approach developed by Amazon. The paper describes how Amazon developed a Collaborative-Filtering method that has better practical performance than other Collaborative Filtering methods. It uses the Item-Item approaches, which instead of comparing the vector of users in the rating matrix, compares the vector of items. This approach is better explained in this other post of this blog.

If you are willing to know more about Apache Hadoop, this is a good way to start. I must say that the Oficial Hadoop website is obviously a great reference.



I can't help but indicating my own articles on the matter. The first one compares several Open Source Recommender Systems and the second explains how I tried to build a distributed Recommender Systems using Hadoop.


This book remains as one of the best references on Artificial Intelligence in general. It does not discuss Recommender Systems, but still worth the reading. It starts by defining that Artificial Intelligence are "Systems that act Rationally", explaining all AI history.
It covers most of AI main algorithms, including the famous Hill-Climbing, Simulated Annealing, BFS and DFS. It covers also Machine Learning areas, such the algorithms Support Vector Machines and K-means.
I definitely recommend it.