Tuesday, December 21, 2010

MapReduce

"Easy distributed computing"

MapReduce is a framework introduced by Google for processing larges amounts of data.
The framework uses a simple idea derived from the commonly known map and reduce functions used in functional programming (ex: LISP). It divides the main problem into smaller sub-problems and distribute these to a cluster of computers. It then combines the answers to these sub-problems to obtain a final answer.

MapReduce facilitates the process of distributed computing making possible that users with no knowledge on the subject create their own distributed applications. The framework hides all the details of parallelization, data distribution load balancing and fault tolerance and the user basically has only to specify the Map and the Reduce functions.

In the process, the input is divided into small independent chunks. The map function receives a piece of the input, processes it, and passes the input in the format key/value pair as answer. These key/values are grouped in a certain way and given as input to the reduce function. This in its turn merges the values, giving the final answer.

Each map and each reduce may be processed by a different node (Computer) in the cluster. The quantity of nodes in the cluster may be as big as the availability of computers in your network. The framework is responsible for dividing the input and feeding the map function. Afterwards, it collects  map's outputs, group and send them to the reduce function. After the work of reduce if done, the framework gather the answers in a final output.

The following picture shows the MapReduce flow.


Example of Map and Reduce functions:

 This is a canonical example of the Map and the Reduce functions. It is an application to count the number of occurrence of words in a large collection of documents.

void map(String name, String document):
   // name: document name
   // document: document contents
   for each word w in document:
     EmitIntermediate(w, 1);


void reduce(String word, Iterator partialCounts):
   // word: a word
   // partialCounts: a list of aggregated partial counts
   int result = 0;
   for each pc in partialCounts:
     result += ParseInt(pc);
   Emit(result);

In this example, each map called receives a document and outputs the occurrence "1" to each word.
 The reduce function is going to  get a list with all the occurrences of a certain word and count it. Then as output it will give the total number of occurrences of that word.


Useful Links:


http://code.google.com/intl/pt-BR/edu/submissions/mapreduce-minilecture/listing.html
http://labs.google.com/papers/mapreduce.html
http://code.google.com/intl/pt-BR/edu/parallel/mapreduce-tutorial.html
http://code.google.com/intl/fr/edu/submissions/mapreduce/listing.html
http://www.mapreduce.org/
http://en.wikipedia.org/wiki/MapReduce

Monday, December 6, 2010

Datasets

I have been talking about recommender systems and data mining algorithms and a clear drawback in this area of research is the scarcity of datasets to work with. So here follows a list of open datasets available in the internet to be used as test data. The links below contain different types of data varying from implicit users web activities to explicit ratings that users have given to items. Note that I have simply gathered this data; I am just providing it here to facilitate the access.


This is a very known datasets provided by MovieLens. It is a set of explicit users ratings on items. It also contains information about the users and the items.
It provides 3 files with the .dat format.

Dataset with implicit and explicit user ratings on books.
It offers demographic information about the user as well. The files provided are mysql.

Various types of data provided by yahoo.

 Explicit ratings from a online joke recommender system. The file is in the .xls format.

Explicit users ratings from a dating agency.

Here we have web data from 3 sources.
• Microsoft: This dataset records which areas of www.microsoft.com each user visited in a one-week timeframe in Feburary 1998.
• Msnbc.com: Page visits of users who visited msnbc.com on September 28, 1999.
• Syskill and Webert: This database contains the HTML source of web pages plus the ratings of a single user on these web pages.

This data presents a real query log data from AOL. Ut is an implicit type of data.

 Here we have 800,000 search queries from end user internet search activities.

This set provides data records from a restaurant recommender system.

An implicit dataset with a day's worth of all HTTP requests to the EPA WWW server.

Here we are provided with an implicit dataset with two month's worth of all HTTP requests to the NASA Kennedy Space Center WWW server.

Tuesday, November 23, 2010

Slope One

Slope One is a simple and efficient type of recommender system. Introduced by Daniel Lemire and Anna Maclachlan in 2005, it involves a simpler idea than the majority of other collaborative filtering implementations. While these usually calculate the similarity between vectors of items using the cosine or the Pearson methods, the Slope One approach recommends items to users based on the average difference in preferences of items. 

The main idea of the algorithm is to create a linear relation between items preferences such as the relation F(x) = x + b. The name "Slope One" cames from the fact that here the "x" is multiplied by "1".

It basically calculates the difference between the ratings of items for each user (for every item the user has rated). Then, it creates and average difference (diff) for every pair of items. To make a prediction of the Item A for an User 1 for example, it would get the ratings that User 1 has given to other items and add to it the difference (diff) between each item. With this we could obtain an average.

Being R1B the rating that user 1 has given to item B, DiffAB the difference between ratings of item A and B, and supposing we have 10 items:


Prediction(Item A) for User 1  = (R1B + DiffAB) + (R1C + DiffAC) + ... (R1J + DiffAJ)
                                                                                10

Example

Suppose that we have a matrix of ratings from users to items. We have an user A that has given an item I the rating 1 and item J the rating 1.5. We also have an user B that has given item I the rating 2.
Based on that, how would slope one calculate the rating from user B to item J?



The algorithm would say that there is a linear relation between the differences of the items. So if the  preference difference of items I and J by user A is 0.5, we would apply this difference to the rating user B has given to I and obtain a rating to J.
Basically , we have

Pref(J) by user B = Pref (I) + Diff I J



Another example:

The prediction from user A to item 3 would be:

  Diff (item 3 x item 1) = 3 – 4 => - 1
  Diff (item 3 x item 2) = 3 – 3 => 0

  3 is the rating given by user A to item 1 and 5 is the rating given to item 2. So we add his given ratings to the diffs.


P(item 3) = [(- 1 + 3) + ( 0 + 5) ] / 2 =>  3.5

Pseudo-Code

Below I present a pseudo-code version of the algorithm. It can be divided in two parts:

   The preprocessing phase, in which is calculated the
   difference between all item-item preference values

1. for every item i
2.  for every other item j
3.   for every user u expressing preference for both i and j
4.     add the difference in u’s preference for i and j to an average

   The prediction part

1. for every item i the user u expresses no preference for
2.  for every item j that user u expresses a preference for
3.   find the average preference difference between j and i
4.   add this diff to u’s preference value for j
5.   add this to a running average
6. return the top items, ranked by these averages


Complexity

In terms of execution, the most expensive phase is the preprocessing one, which can be precomputed. The algorithm is very attractive because its online portion, the prediction phase, is fast.
Its running time does not depend on the number of users, it depends mostly upon the average preference difference between every pair of items. Suppose we have m users and n items, computing the average differences could take up to m n2 time steps. The storing of the diff matrix could also be expensive. It could take up to n(n-1)/2 units of storage.

Useful links:


Thursday, November 4, 2010

Apache Mahout



“Scalable machine learning library”

Mahout is a solid Java framework in the Artificial Intelligence area. It is a machine learning project by the Apache Software Foundation that tries to build intelligent algorithms that learn from some data input.

What is special about Mahout is that  it is a scalable library, prepared to deal with huge datasets. Its algorithms are built on top of the Apache Hadoop project and, so, they work with distributed computing.

Mahout offers algorithms in three major  areas: Clustering, Categorization and Recommender Systems. This lats part was incoporated in April 4th 2008, from the previous Taste Recommender System project.

Mahout currently implements  a collaborative filtering engine that supports the user-based, item-based and Slope-one recommender systems. Other algorithms available in the package are  the k-means, fuzzy k-Means clustering, Canopy, Dirichlet and Mean-Shift. They also have The Naive Bayes, Complementary Naive Bayes and Random forest decision tree based classifier.

The project present a commercial friendly license (meaning that modifications in the code can be kept proprietary)  and a vast community of users and developers, so for that, it is highly recommended!

a. Taste
Taste is the Recommender System part of Mahout and it provides a very consistent and flexible collaborative filtering engine. It supports the user-based, item-based and Slope-one recommender systems. It can be easily modified in due to its well-structured modules abstractions. The package defines the following interfaces:

  • DataModel
  • UserSimilarity and ItemSimilarity
  • UserNeighborhood
  • Recommender


With these interfaces, its possible to adapt the framework to read different types of data, personalize your recommendation or even create new recommendation methods. Below is presented a figure with Taste's architecture.
The User Similarity and Item Similarity abstractions are here represented by the box named “Similarity”. These interfaces are responsible for calculating the similarity between a pair of users or items. Their function usually returns a value from 0 to 1 indicating the level of resemblance, being 1 the most similar possible.
Trough the DataModel interface is made the access to the data set. It is possible to retrieve and store the data from databases or from filesystems (MySQLJDBCDataModel and FileDataModel respectively). The functions developed in this interface are used by the Similarity abstraction to help
computing the similarity.
The main interface in Taste is Recommender. It is responsible for actually making the recommendations to the user by comparing items or by determining users with similar taste (item-based and user-based techniques). The Recommender access the similarity interface and uses its functions to compare a pair of users or items. It then collect the highest similarity values to offer as recommendations.

The UserNeighborhood is an assistant interface to help defining the neighborhood int the User-Based recommendation technique.
It is known that for greater data sets the item-based technique provides better results. For that, many companies choose to use this approach, such as Amazon. With the Mahout framework its not different, the item-based method generally runs faster and provides more accurate recommendation.

b. Installation

Here follow a step-by-step guide to install and test the Mahout recommender system.
Firstly, is necessary to have the project manager Maven. It is very easy to use it. It runs simply by calling the “mvn install” command. This command will compile the code and download missing packages. Maven uses a “pom.xml” file for configuration, but the Mahout project already comes with this file.
To test Taste it possible to get a data set from MovieLens. It is package with 3 files in the .dat format, presenting users, items, ratings and some information on users and items.


Pre-installation

1. Make sure you have at least the Java JDK 1.6

$ javac -version
javac 1.6.0_26

2. Make sure you have the project manager Maven installed in your computer.

$ mvn -version
Apache Maven 2.2.1 (rdebian-1)

3. Download a Hadoop version, from http://www.apache.org/dyn/closer.cgi/hadoop/common/. I downloaded: archive.apache.org/dist/hadoop/core/hadoop-0.20.204.0/ , which is the version required by Mahout 0.7 on pom.xml (check hadoop.version property).

Mahout Installation

1. Download the Mahout package:
https://cwiki.apache.org/confluence/display/MAHOUT/Downloads

I downloaded http://ftp.unicamp.br/pub/apache/mahout/0.7/mahout-distribution-0.7-src.tar.gz version (for development purposes it is even better to download with svn co http://svn.apache.org/repos/asf/mahout/trunk).

2. Unpack the Mahout downloaded package

$ tar -xvzf mahout-distribution-0.7-src.tar.gz

3. In a terminal, change to the mahout directory and compile the code using Maven :

$ cd mahout-distribution-0.7/
$ mvn install

With this, you will have compiled Mahout's code, and run the UnitTests that comes with it, to make sure everything is ok with the component.

If the build was sucessfull you should see something like:

[INFO] BUILD SUCCESSFUL
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 54 minutes 28 seconds
[INFO] Finished at: Tue Jul 09 11:17:08 BRT 2013
[INFO] Final Memory: 70M/375M
[INFO] ------------------------------------------------------------------------



 Executing Taste Recommender

1. Get your data on the following format:

userid, itemid, rating

For example, copy the following data and name it as mydata.dat :

1,101,5.0
1,102,3.0
1,103,2.5
2,101,2.0
2,102,2.5
2,103,5.0
2,104,2.0
3,101,2.5
3,104,4.0
3,105,4.5
3,107,5.0
4,101,5.0
4,103,3.0
4,104,4.5
4,106,4.0
5,101,4.0
5,102,3.0
5,103,2.0
5,104,4.0
5,105,3.5
5,106,4.0
   

2. Make sure you set your JAVA_HOME, HADOOP_HOME, for example:

$ export JAVA_HOME=/usr/lib/jvm/java-1.6.0-openjdk/
$ export HADOOP_HOME=/home/myuser/Downloads/hadoop-0.20.204.0/

$ export MAHOUT_HOME=/path/to/mahout-distribution-0.7

And put them on your PATH:


$ export PATH=$HADOOP_HOME/bin:$PATH
$ export PATH=$MAHOUT_HOME/bin:$PATH

3. Now, run:

$ bin/mahout recommenditembased --input mydata.dat --usersFile user.dat --numRecommendations 2 --output output/ --similarityClassname SIMILARITY_PEARSON_CORRELATION
 

The usersFile is where you should put for which users you want to o the recommendation for. You can change numRecommendations to the number of recommendations you desire.

Also, on similarityClassname, you can choose anyone you like from the above list:

  • SIMILARITY_COOCCURRENCE
  • SIMILARITY_LOGLIKELIHOOD
  • SIMILARITY_TANIMOTO_COEFFICIENT
  • SIMILARITY_CITY_BLOCK
  • SIMILARITY_COSINE
  • SIMILARITY_PEARSON_CORRELATION       
  • SIMILARITY_EUCLIDEAN_DISTANCE

Another Example

You can also test it with some real dataset, for example, the one from MovieLens.
1. Download a copy of the “1 million” data set from MovieLens:

2. Unpack the MovieLens data ( you will find 3 files: movies.dat, ratings.dat and

users.dat)
3. Edit the ratings.dat file, so that it is in the 
userid,itemid,rating format, and not on  userid::itemid::rating format.

3. Copy the movies.dat and ratings.dat to your Mahout directory.
4. Run:

$ bin/mahout recommenditembased --input ratings.dat --usersFile user.dat --numRecommendations 2 --output output/ --similarityClassname SIMILARITY_PEARSON_CORRELATION
  
c. Your Own Examples

To create your own Recommender System example, it is necessary to construct a Java Application defining which approach it's going to be used.
If you chose, for example, the Slope-One technique, the code may be:
 DataModel model = new FileDataModel(new File("data.txt"));
 Recommender recommender = new
 SlopeOneRecommender(model);
 Recommender cachingRecommender = new
 CachingRecommender(recommender);

This code gets the data input called “data.txt” and passes it to the Recommender
interface. It then provides a recommendation to the user based on the Slope-one technique.
It is possible to construct similar examples, using other approaches such as a combination of a User-based method with the Pearson Correlation Similarity, or an Item-based approach with the Log Likelihood Similarity measurement.


d. Modifying

To modify the Mahout framework, it is advisable to focus in the interface that you are willing to change. The implementations of the interfaces resides in a folder inside the Mahout package called “impl” . So it would be necessary to check which interface is going to be modified and then take advantage of mahout's structure.
If it is desired, for example, to add a new type of recommendation, it would be simply necessary to add another Java file in the Recommender interface implementation, and call it in your final Java application.

 e. Links

http://mahout.apache.org/
http://www.ibm.com/developerworks/java/library/j-mahout/
http://www.manning.com/owen/
http://blog.jteam.nl/2010/04/15/mahout-taste-part-two-getting-started/
VIDEO on how to use Mahout on Eclipse: https://www.youtube.com/watch?v=yD40rVKUwPI

Thursday, October 21, 2010

Collaborative Filtering

"If an user A has liked the movies "Matrix " and "The Lord of the Rings" and many other users that have liked these two movies also liked "Memento", then it is likely that "Memento" will be recommended to user A."

Collaborative Filtering is a type of recommender system widely implemented, and it is known for giving more accurated predictions than other approaches.

The basic idea of the algorithms in the collaborative filtering area is to provide recommendations based on what people with similar taste have liked in the past. These people, the neighbors, are selected by comparing the user's past preferences (usually presented as ratings on items). So, by measuring the ratings similarity its possible to recommend items liked by the neighborhood. There are two major techniques to compare ratings.


User-Based

Let us consider a user as an N-dimensional vector of ratings, where each cell represents the rating of the user on a certain item. Then, in the user-based technique, the similarity between a user A and a user B is measured by the similarity between its vectors. It possible, for example, to apply the Pearson Correlation to the two vectors and find what we call the similarity coefficient between A and B (Sim AB). Once we have done this we can rank each item according to how many similar users have liked it.
  

Item-Based

In the item-based method instead of comparing the vectors of users we compare items. Suppose that all users vectors are together in a matrix. It is, then, possible to look at its columns instead of its rows. If the rows represents the users, then the columns intuitively represents the items. The basic idea of the algorithm consists in comparing the columns of this matrix and thus finding the similarity between the vectors of items. The prediction is then made by selecting items similar with the ones that the target user has preferred.
This approach was implement by Amazon, and it is described in Amazon.com recommendations: Item-to-item collaborative filtering by Greg Linden, Brent Smith, and Jeremy York.


Links

http://www.serpentine.com/blog/2006/12/12/collaborative-filtering-made-easy/
http://ict.ewi.tudelft.nl/~jun/CollaborativeFiltering.html#bibliography
http://www.daniel-lemire.com/fr/abstracts/TRD01.html

Monday, October 18, 2010

Recommender Systems

"Suggest new items that fit the user’s preference."
 

Introduction

The increasing amount of information in the web has promoted the advance of the recommender systems research area. 
These systems help users by offering useful suggestions to them. The aim of Recommender Systems is to provide personalized recommendations, representing a fundamental role on e-commerce (widely used by companies such as Amazon, Netflix and Google).
They highlight items that the users have not yet seen and may appreciate. Such items include books, restaurants, webpages or even lifestyles. A suggestion is usually made based on the user's historical preferences.
These preferences may be collected implicitly or explicitly. When a user is buying an item, or entering a web-page, for example, he is giving an implicit preference feedback. In the case of a user giving a rating to an article, he is providing an explicit feedback.
A substantial challenge in this area is the volume of information available. A recommendation algorithm may have to deal with enormous data sets, and for that, scalability and effectiveness have to be taken in count.
These systems are usually classified based on how recommendations are made. The most common categories are: Collaborative Filtering, Content-Based Filtering and Hybrid Filtering.

Recommender Systems Classification 

  • Content-Based Filtering
"If an user has liked the movie "Titanic", a typical recommendation would be "Ghost" because both of the items present the feature "Romance category". "

This approach relates the user's  past  preferences with content information. Therefore, it defines objects of interest to the users based on the associated features of items. 
The content-based technique proposes a method that creates an user profile based on his preferences on items. This profile is then compared with information about other items. Those with the most similarity, namely, the items with closer descriptions regarding the profile, are then recommended to the user.

  • Collaborative Filtering
"If an user A has liked the movie "Matrix " and "The Lord of the Rings" and many other users that have liked these two movies also liked "Memento", then it is likely that "Memento" will be recommended to user A."

The second technique recommends an item by comparing an user with a neighborhood of users. In collaborative filtering method, we are presented with a set of ratings of a user on items. It then identifies a group of other users and compares the ratings of the first with the ratings of the group.

  • Hybrid Filtering
These two major recommender techniques present theirs strength and weakness. The collaborative filtering doesn't need many information about the items. However, it presents some problems when a new item is inserted in the system, since no one has hated it yet. Despite of the content-based method not having this new item problem, it can lack of originality when giving recommendations. The recommended items are those very similar with the ones that the user has already seen, leading to uncreative suggestions. In addition, this method works well for items with good textual descriptions but it can fail with other types of multimedia items.
An hybrid approach is then proposed to overcome these problems. The Hybrid Filtering combines the two above cited methods to avoid the problems existent in them. In Hybrid Recommender Systems:  Survey and Experiments by Robin Burke it is presented many options of combining the two approaches and creating an hybrid method.



Where can I get a Recommender System?

If you are willing to have your own recommender system you may have two options: you can either build your own from zero or download one package from the internet and addapt it to your needs.
In case that you are planning to build you own, there are many good articles about the subject available. I suggest for you to understand your own dataset, discover wich approach is better  for you and then develop it. As testing data,y ou can use the package from MovieLens (from Grouplens Research), a set of ratings that users have given on movies.
If however, you can't spend that much time and effort on the subject, you may download an open source project from the internet. I recommend the Apache Mahout project, build in Java.


Some useful links and articles

Articles:
Greg Linden, Brent Smith, and Jeremy York. Amazon.com recommendations: Item-to-item
collaborative filtering.

Robin Burke. Hybrid recommender systems: Survey and experiments. User Modeling
and User-Adapted Interaction.

 Badrul M. Sarwar, George Karypis, Joseph A. Konstan, and John Reidl. Item-based collaborative filtering recommendation algorithms. 

Links:
    http://mahout.apache.org/

    http://www.grouplens.org/

    http://ict.ewi.tudelft.nl/~jun/CollaborativeFiltering.html

    http://glaros.dtc.umn.edu/gkhome/