Thursday, November 4, 2010

Apache Mahout



“Scalable machine learning library”

Mahout is a solid Java framework in the Artificial Intelligence area. It is a machine learning project by the Apache Software Foundation that tries to build intelligent algorithms that learn from some data input.

What is special about Mahout is that  it is a scalable library, prepared to deal with huge datasets. Its algorithms are built on top of the Apache Hadoop project and, so, they work with distributed computing.

Mahout offers algorithms in three major  areas: Clustering, Categorization and Recommender Systems. This lats part was incoporated in April 4th 2008, from the previous Taste Recommender System project.

Mahout currently implements  a collaborative filtering engine that supports the user-based, item-based and Slope-one recommender systems. Other algorithms available in the package are  the k-means, fuzzy k-Means clustering, Canopy, Dirichlet and Mean-Shift. They also have The Naive Bayes, Complementary Naive Bayes and Random forest decision tree based classifier.

The project present a commercial friendly license (meaning that modifications in the code can be kept proprietary)  and a vast community of users and developers, so for that, it is highly recommended!

a. Taste
Taste is the Recommender System part of Mahout and it provides a very consistent and flexible collaborative filtering engine. It supports the user-based, item-based and Slope-one recommender systems. It can be easily modified in due to its well-structured modules abstractions. The package defines the following interfaces:

  • DataModel
  • UserSimilarity and ItemSimilarity
  • UserNeighborhood
  • Recommender


With these interfaces, its possible to adapt the framework to read different types of data, personalize your recommendation or even create new recommendation methods. Below is presented a figure with Taste's architecture.
The User Similarity and Item Similarity abstractions are here represented by the box named “Similarity”. These interfaces are responsible for calculating the similarity between a pair of users or items. Their function usually returns a value from 0 to 1 indicating the level of resemblance, being 1 the most similar possible.
Trough the DataModel interface is made the access to the data set. It is possible to retrieve and store the data from databases or from filesystems (MySQLJDBCDataModel and FileDataModel respectively). The functions developed in this interface are used by the Similarity abstraction to help
computing the similarity.
The main interface in Taste is Recommender. It is responsible for actually making the recommendations to the user by comparing items or by determining users with similar taste (item-based and user-based techniques). The Recommender access the similarity interface and uses its functions to compare a pair of users or items. It then collect the highest similarity values to offer as recommendations.

The UserNeighborhood is an assistant interface to help defining the neighborhood int the User-Based recommendation technique.
It is known that for greater data sets the item-based technique provides better results. For that, many companies choose to use this approach, such as Amazon. With the Mahout framework its not different, the item-based method generally runs faster and provides more accurate recommendation.

b. Installation

Here follow a step-by-step guide to install and test the Mahout recommender system.
Firstly, is necessary to have the project manager Maven. It is very easy to use it. It runs simply by calling the “mvn install” command. This command will compile the code and download missing packages. Maven uses a “pom.xml” file for configuration, but the Mahout project already comes with this file.
To test Taste it possible to get a data set from MovieLens. It is package with 3 files in the .dat format, presenting users, items, ratings and some information on users and items.


Pre-installation

1. Make sure you have at least the Java JDK 1.6

$ javac -version
javac 1.6.0_26

2. Make sure you have the project manager Maven installed in your computer.

$ mvn -version
Apache Maven 2.2.1 (rdebian-1)

3. Download a Hadoop version, from http://www.apache.org/dyn/closer.cgi/hadoop/common/. I downloaded: archive.apache.org/dist/hadoop/core/hadoop-0.20.204.0/ , which is the version required by Mahout 0.7 on pom.xml (check hadoop.version property).

Mahout Installation

1. Download the Mahout package:
https://cwiki.apache.org/confluence/display/MAHOUT/Downloads

I downloaded http://ftp.unicamp.br/pub/apache/mahout/0.7/mahout-distribution-0.7-src.tar.gz version (for development purposes it is even better to download with svn co http://svn.apache.org/repos/asf/mahout/trunk).

2. Unpack the Mahout downloaded package

$ tar -xvzf mahout-distribution-0.7-src.tar.gz

3. In a terminal, change to the mahout directory and compile the code using Maven :

$ cd mahout-distribution-0.7/
$ mvn install

With this, you will have compiled Mahout's code, and run the UnitTests that comes with it, to make sure everything is ok with the component.

If the build was sucessfull you should see something like:

[INFO] BUILD SUCCESSFUL
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 54 minutes 28 seconds
[INFO] Finished at: Tue Jul 09 11:17:08 BRT 2013
[INFO] Final Memory: 70M/375M
[INFO] ------------------------------------------------------------------------



 Executing Taste Recommender

1. Get your data on the following format:

userid, itemid, rating

For example, copy the following data and name it as mydata.dat :

1,101,5.0
1,102,3.0
1,103,2.5
2,101,2.0
2,102,2.5
2,103,5.0
2,104,2.0
3,101,2.5
3,104,4.0
3,105,4.5
3,107,5.0
4,101,5.0
4,103,3.0
4,104,4.5
4,106,4.0
5,101,4.0
5,102,3.0
5,103,2.0
5,104,4.0
5,105,3.5
5,106,4.0
   

2. Make sure you set your JAVA_HOME, HADOOP_HOME, for example:

$ export JAVA_HOME=/usr/lib/jvm/java-1.6.0-openjdk/
$ export HADOOP_HOME=/home/myuser/Downloads/hadoop-0.20.204.0/

$ export MAHOUT_HOME=/path/to/mahout-distribution-0.7

And put them on your PATH:


$ export PATH=$HADOOP_HOME/bin:$PATH
$ export PATH=$MAHOUT_HOME/bin:$PATH

3. Now, run:

$ bin/mahout recommenditembased --input mydata.dat --usersFile user.dat --numRecommendations 2 --output output/ --similarityClassname SIMILARITY_PEARSON_CORRELATION
 

The usersFile is where you should put for which users you want to o the recommendation for. You can change numRecommendations to the number of recommendations you desire.

Also, on similarityClassname, you can choose anyone you like from the above list:

  • SIMILARITY_COOCCURRENCE
  • SIMILARITY_LOGLIKELIHOOD
  • SIMILARITY_TANIMOTO_COEFFICIENT
  • SIMILARITY_CITY_BLOCK
  • SIMILARITY_COSINE
  • SIMILARITY_PEARSON_CORRELATION       
  • SIMILARITY_EUCLIDEAN_DISTANCE

Another Example

You can also test it with some real dataset, for example, the one from MovieLens.
1. Download a copy of the “1 million” data set from MovieLens:

2. Unpack the MovieLens data ( you will find 3 files: movies.dat, ratings.dat and

users.dat)
3. Edit the ratings.dat file, so that it is in the 
userid,itemid,rating format, and not on  userid::itemid::rating format.

3. Copy the movies.dat and ratings.dat to your Mahout directory.
4. Run:

$ bin/mahout recommenditembased --input ratings.dat --usersFile user.dat --numRecommendations 2 --output output/ --similarityClassname SIMILARITY_PEARSON_CORRELATION
  
c. Your Own Examples

To create your own Recommender System example, it is necessary to construct a Java Application defining which approach it's going to be used.
If you chose, for example, the Slope-One technique, the code may be:
 DataModel model = new FileDataModel(new File("data.txt"));
 Recommender recommender = new
 SlopeOneRecommender(model);
 Recommender cachingRecommender = new
 CachingRecommender(recommender);

This code gets the data input called “data.txt” and passes it to the Recommender
interface. It then provides a recommendation to the user based on the Slope-one technique.
It is possible to construct similar examples, using other approaches such as a combination of a User-based method with the Pearson Correlation Similarity, or an Item-based approach with the Log Likelihood Similarity measurement.


d. Modifying

To modify the Mahout framework, it is advisable to focus in the interface that you are willing to change. The implementations of the interfaces resides in a folder inside the Mahout package called “impl” . So it would be necessary to check which interface is going to be modified and then take advantage of mahout's structure.
If it is desired, for example, to add a new type of recommendation, it would be simply necessary to add another Java file in the Recommender interface implementation, and call it in your final Java application.

 e. Links

http://mahout.apache.org/
http://www.ibm.com/developerworks/java/library/j-mahout/
http://www.manning.com/owen/
http://blog.jteam.nl/2010/04/15/mahout-taste-part-two-getting-started/
VIDEO on how to use Mahout on Eclipse: https://www.youtube.com/watch?v=yD40rVKUwPI

50 comments:

  1. hello thank u for writing this blog...its rally wonderful..

    I didnot found installing and testing step no 5

    cp.. /examples/target/grouplens.jar ./lib

    cp :cannot stat '../examples/target/grouplens.jar' :No such file directory found.

    ReplyDelete
  2. Everything is great. Just the step 5 is confusing, which library???

    5. Change to the Taste-web folder, copy a library to it and run the maven command:
    $ cd taste-web
    $ cp ../examples/target/grouplens.jar ./lib

    Thank you

    ReplyDelete
  3. For me step 4 was located at examples/src/main/java/org/apache/mahout/cf/taste/example

    ReplyDelete
  4. Yes, "Anonymous" the classes for "grouplens" are on examples/src/main/java/org/apache/mahout/cf/taste/example. Thank you. :)

    ReplyDelete
  5. I was wondering ,I am using windows 7 will it work correctly or I have to download something extra.

    ReplyDelete
  6. in terminal ,do you mean java terminal or windows terminal(command prompt)

    ReplyDelete
  7. Hey Anonymous, I didn't try installing Mahout on Windows. This commands were executed on Linux (Ubuntu), so its was a gnome-terminal.

    You can follow this instructions on windows using Cygwin ( http://en.wikipedia.org/wiki/Cygwin ).

    ReplyDelete
  8. I downloaded cygwin successfully ,now what should i do?

    ReplyDelete
  9. hello, I downloaded cygwin and i follow your instructions the core did not build I use the command (mvn -dskipTests) and the all execute successfully ,but step 4 in testing ,I cannot find taste web ,what should I do, please help me.

    ReplyDelete
  10. wow site look so simple n attractive superb

    ReplyDelete
  11. >I cannot find taste web ,what should I do...


    the directories have changed since the version 0.4 this tutorial was written to. you can find the new directory at:
    examples/src/main/java/org/apache/mahout/cf/taste/example/grouplens

    ReplyDelete
  12. i got problems with executing the step 5 using v0.8:

    cp ../examples/target/grouplens.jar ./lib

    there is no grouplens*.jar in any place in my computer.

    ReplyDelete
  13. i am working on apache mahout ,unable to use itembasedrecommendation ,need an example

    ReplyDelete
  14. Hello guys,

    I just upated the blog to show how to build Mahout 0.7. Hope this answers your questions.

    ReplyDelete
  15. Dear Renata,

    thank you for the example. I followed your 'Taste' example. I have two questions:

    First you mention that I need to have HADOOP_HOME exported, however I can't see you using Hadoop a hadoop cluster for executing these examples, is Mahouts execution dependent on Hadoop libraries in any case?

    Second I get this error when executing the 'Taste Example'. I really cant imagine that I do not have enough heap available, what else could it be?

    Thanks a lot

    Fred

    ReplyDelete
  16. Hello fredericstahl, you didn't post the error. If you could, send another message with the error attached. ;)

    Yes, I used Hadoop for this example. Check in the "Pre-installation" phase, step 3 is to download hadoop. Latter on, I export the HADOOP_HOME to where Hadoop was downloaded.

    Let me know if you have more doubts!

    ReplyDelete
  17. Hi Renata,

    thanks for your reply. I like your tutorial. The error basically told me that no VM with enough heap could be found.

    Well, this error is resolved, I started up my hadoop cluster (single node configuration) and the error disappeared. Now that the error message disappeared I can see that this example already uses hadoop. However, this was not obvious to me from the command you gave to run the recommender. I would have expected some statement explicitly stating to run this on the cluster, if that makes sense?

    Thanks again

    Fred

    ReplyDelete
  18. Nice Article! I had a question though. . Can you also tell how do i run Custom Java code on top of mahout ? I tried running the instructions withought Java application and i got no reccomendations in the output file. IS this because i am missing the Java code ?

    ReplyDelete
  19. This comment has been removed by the author.

    ReplyDelete
    Replies
    1. Hi. Thanks for this useful article! Though i have a question. Can you pls explain the format of UserFile?

      Delete
    2. thanks Renata!! bt I think I got it... it contains the Product id and number of recommendations hopefully... is it ?? if not plz suggest the correct format...

      Delete
  20. Hi. Thanks for this useful article! Though i have a question. Can you pls explain the format of UserFile?

    ReplyDelete
  21. when I am running the following command :
    $ bin/mahout recommenditembased --input ratings.dat --usersFile user.dat --numRecommendations 2 --output output/ --similarityClassname SIMILARITY_PEARSON_CORRELATION
    then I am getting this error ... would you please help me to get out of this ?

    "hduser@rahul-VPCEB34EN:/usr/local/mahout-distribution-0.7$ bin/mahout recommenditembased --input ratings.dat --usersFile user.dat --numRecommendations 2 --output output/ --similarityClassname SIMILARITY_PEARSON_CORRELATION
    Warning: $HADOOP_HOME is deprecated.

    Running on hadoop, using /usr/local/hadoop/bin/hadoop and HADOOP_CONF_DIR=
    MAHOUT-JOB: /usr/local/mahout-distribution-0.7/examples/target/mahout-examples-0.7-job.jar
    Warning: $HADOOP_HOME is deprecated.

    14/02/02 14:47:26 INFO common.AbstractJob: Command line arguments: {--booleanData=[false], --endPhase=[2147483647], --input=[ratings.dat], --maxPrefsPerUser=[10], --maxPrefsPerUserInItemSimilarity=[1000], --maxSimilaritiesPerItem=[100], --minPrefsPerUser=[1], --numRecommendations=[2], --output=[output/], --similarityClassname=[SIMILARITY_PEARSON_CORRELATION], --startPhase=[0], --tempDir=[temp], --usersFile=[user.dat]}
    14/02/02 14:47:27 INFO common.AbstractJob: Command line arguments: {--booleanData=[false], --endPhase=[2147483647], --input=[ratings.dat], --maxPrefsPerUser=[1000], --minPrefsPerUser=[1], --output=[temp/preparePreferenceMatrix], --ratingShift=[0.0], --startPhase=[0], --tempDir=[temp]}
    14/02/02 14:47:29 INFO mapred.JobClient: Cleaning up the staging area hdfs://localhost:54310/usr/local/hadoop/tmp/mapred/staging/hduser/.staging/job_201402021436_0002
    14/02/02 14:47:29 ERROR security.UserGroupInformation: PriviledgedActionException as:hduser cause:org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: ratings.dat
    Exception in thread "main" org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: ratings.dat

    ReplyDelete
    Replies
    1. Hey Rahul, have you made sure your java application can access the ratings.dat file? Are they in the same directory? And does it have the right permissions?

      Delete
    2. sorry I am not getting u ...
      # Hadoop Path :
      export HADOOP_HOME=/usr/local/hadoop
      export PATH=$PATH:$HADOOP_HOME/bin
      # Java Path :
      export JAVA_HOME=/usr/lib/jvm/java-7-oracle
      export PATH=$PATH:$JAVA_HOME/bin
      # Maven Path :
      export M2_HOME=/usr/local/apache-maven-2.2.1
      export PATH=$PATH:$M2_HOME/bin
      # Mahout-path
      export MAHOUT_HOME=/usr/local/mahout-distribution-0.7
      export PATH=$PATH:$MAHOUT_HOME/bin

      Delete
    3. How would i check that java application can access the rating.dat file and how would i change the permission access ?

      Delete
    4. I am facing this error while running above mentioned command

      ERROR security.UserGroupInformation: PriviledgedActionException as:hduser cause:org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: ratings.dat
      Exception in thread "main" org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: ratings.dat

      Delete
    5. try to give absolute path than relative path and see

      Delete
  22. Hi,
    I am running recommendation system on a single node hadoop using mahout. It is run on movie data obtained from grouplens (100k data).
    Versions:
    hadoop version - 1.1.1
    mahout-distribution-0.9

    I am executing the following command -

    hadoop jar /home/avatar/Desktop/Dissertation/Mahout/mahout-distribution-0.9/mahout-core-0.9-job.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob -s SIMILARITY_COOCCURRENCE --input /user/hduser/mahout/u.data --output /user/hduser/mahout/output

    After a few successful mapreduce tasks, the following error is thrown by each job-
    14/02/26 15:10:48 INFO mapred.JobClient: Task Id : attempt_201402261501_0007_m_000000_0, Status : FAILED
    Error: org.apache.lucene.util.PriorityQueue.(I)V

    What does this error mean, and how to get over with it?
    Thanks in advance!

    ReplyDelete
  23. 14/04/08 16:54:21 WARN mapred.JobClient: Error reading task outputhttp://localhost:50060/tasklog?plaintext=true&attemptid=attempt_201404081602_0004_m_000001_0&filter=stdout
    14/04/08 16:54:21 WARN mapred.JobClient: Error reading task outputhttp://localhost:50060/tasklog?plaintext=true&attemptid=attempt_201404081602_0004_m_000001_0&filter=stderr
    14/04/08 16:54:24 INFO mapred.JobClient: Task Id : attempt_201404081602_0004_m_000001_1, Status : FAILED
    Error initializing attempt_201404081602_0004_m_000001_1:
    ENOENT: No such file or directory

    ReplyDelete
  24. I tried as per your tutorial.When am trying to run the mahout with sample data,am getting the below said error.Can you help me in this?


    cyg_server@Manju-PC ~/mahout-distribution-0.7
    $ hadoop jar /C:/cygwin64/var/empty/mahout-distribution-0.7/mahout-core-0.4.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob --input mydata.dat --usersFile user.dat --numRecommendations 2 --output output/ --similarityClassname SIMILARITY_PEARSON_CORRELATION
    Exception in thread "main" java.io.IOException: Error opening job jar: /C:/cygwin64/var/empty/mahout-distribution-0.7/mahout-core-0.4.jar
    at org.apache.hadoop.util.RunJar.main(RunJar.java:90)
    Caused by: java.io.FileNotFoundException: C:\cygwin64\var\empty\mahout-distribution-0.7\mahout-core-0.4.jar (The system cannot find the file specified)
    at java.util.zip.ZipFile.open(Native Method)
    at java.util.zip.ZipFile.(ZipFile.java:214)
    at java.util.zip.ZipFile.(ZipFile.java:144)
    at java.util.jar.JarFile.(JarFile.java:153)
    at java.util.jar.JarFile.(JarFile.java:90)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:88)

    ReplyDelete
  25. 1. Why to take source file of mahout and then compile using maven. Zip files are also available which can be used without compilation.
    2. How to use eclipse for building mahout application.

    ReplyDelete
    Replies
    1. Shan,
      yes, that is true, you may use the compiled code if you are just interested in running Mahout. Nevertheless, sometimes it is necessary to change something in the code itself. Maybe adapt it to ones need. In this case, we have to re-compile the code to make the changes effective.

      Thank you for the interest, I am planning to publish a tutorial on how to use mahout on eclipse in a couple of weeks.

      Delete
    2. I just found this video showing how to use Mahout with Eclipse:
      https://www.youtube.com/watch?v=yD40rVKUwPI

      Delete
    3. Hi Renata.. Can you please explain how mahout user based recommendation works using pearson corelation and nearest neighbour algorithms with formulas by taking sample example on ratings?

      Delete
  26. This comment has been removed by the author.

    ReplyDelete
    Replies
    1. Did you install Mahout on a directory you have permission to use? Maybe you installed it as a sudo, and now you are trying to run it as a regular user. In any case look at the permission and owner of the file, and possibly change them with chmod and chown.

      Delete
    2. This comment has been removed by the author.

      Delete
    3. This comment has been removed by the author.

      Delete
  27. This comment has been removed by the author.

    ReplyDelete
  28. Hi,
    Thank you very much for the article.
    However i have some questions.
    I have configured a cluster hadoop using cloudera and i created a 4 nodes cluster.
    I did download the mahout package and execute it on the master.
    But now i want to run mahout program using files in HDFS.
    How can i do that ?
    And what is the best ay to visualize mahout output ?

    I'm new with the world of hadoop and mahout and really don't know where to begin :s
    Thanks in advance

    ReplyDelete
  29. Hi,
    Thank you very much for the article.
    However i have some questions.
    I have configured a cluster hadoop using cloudera and i created a 4 nodes cluster.
    I did download the mahout package and execute it on the master.
    But now i want to run mahout program using files in HDFS.
    How can i do that ?
    And what is the best ay to visualize mahout output ?

    I'm new with the world of hadoop and mahout and really don't know where to begin :s
    Thanks in advance

    ReplyDelete
  30. I need the book MAhout in action plis if somebody help me. This is my email mylen88a@gmail.com thanks

    ReplyDelete
  31. Hi,

    I am working with mahout random forests algorithm. I would want to know if there is a way in which we could calculate classification probabilities.

    ReplyDelete
    Replies
    1. Hum, do you mean the probability of having a given classification?

      Delete