Posts

Recommender System Implementation

Hello, I have just posted an implementation of the SlopeOne Recommender System in Java. I designed it to work with Movielens Dataset (a bunch of ratings on movies). I hope that helps everybody that is starting to develop their own recommender system. https://github.com/renataghisloti/SlopeOne-with-Movielens-Dataset

Articles about Recommender Systems, Mahout and Hadoop Framework

Seeing that Recommender Systems has drawn a lot of attention in this past year, I would like to recommend further reading to those who want to obtain greater knowledge in the subject. I will indicate some articles that have helped me study the matter: G. Adomavicius and A. Tuzhilin Towards the Next Generation of Recommender Systems: A Survey of the State-of-the-Art and Possible Extensions. 2001 This article written by Adomavicius introduces Recommender Systems very well. It explains the main three types of these systems (Content-Based, Collaborative Filtering and Hybrid Recommendation). I also gives a formal mathematical definition of a Recommender Systems, which for some people can be great. I greatly recommend any other article you may find of Adomavicius. Laurent Candillier , Frank Meyer , Kris Jack, Françoise Fessant.A State-of-the-Art Recommender Systems. This paper also provides great overview of Recommender Systems and a very interesting comparison between...

Apache Hadoop for Beginners

Image
The Apache Hadoop is a framework for distributed computing applications, inspired by Google's MapReduce  and GFS paper. It is an Open Source software that enables the processing of  massive amounts of data with commodity. First introduced by Doug Cutting, who named the project after his son's toy (a yellow elephant), Hadoop it is now one of the greatest Apache projects. It involves many contributors and users around the world such as Yahoo!, IBM, Facebook and many others.   The framework presents a master/worker shared nothing architecture . The Hadoop cluster is composed of a group of single nodes (computers), being one of these nodes the master server and the other nodes the workers. On the master node, the Namenode deamon and the JobTracker daemon usually run. The Namenode deamon keeps files metadata, and the JobTracker one manages the mapreduce tasks executed on the cluster. The management and monitoring of tasks are made by the Hadoop serv...

MapReduce

Image
"Easy distributed computing" MapReduce is a framework introduced by Google for processing larges amounts of data . The framework uses a simple idea derived from the commonly known map and reduce functions used in functional programming (ex: LISP). It divides the main problem into smaller sub-problems and distribute these to a cluster of computers . It then combines the answers to these sub-problems to obtain a final answer . MapReduce facilitates the process of distributed computing making possible that users with no knowledge on the subject create their own distributed applications. The framework hides all the details of parallelization , data distribution load balancing and fault tolerance and the user basically has only to specify the Map and the Reduce functions. In the process, the inp ut is divided into small independent chunks. The map function receives a piece of the input, processes it, and passes the input in the format key/value pair as answer. These k...

Datasets

I have been talking about recommender systems and data mining algorithms and a clear drawback in this area of research is the scarcity of datasets to work with. So here follows a list of open datasets available in the internet to be used as test data. The links below contain different types of data varying from implicit users web activities to explicit ratings that users have given to items. Note that I have simply gathered this data; I am just providing it here to facilitate the access. http://grouplens.org/datasets/movielens/ This is a very known datasets provided by MovieLens. It is a set of explicit users ratings on items. It also contains information about the users and the items. It provides 3 files with the .dat format. http://www.informatik.uni-freiburg.de/~cziegler/BX/ Dataset with implicit and explicit user ratings on books. It offers demographic information about the user as well. The files provided are mysql. http://webscope.sandbox.yahoo.com/ Vario...

Slope One

Image
Slope One is a simple and efficient type of recommender system. Introduced by Daniel Lemire and Anna Maclachlan in 2005, it involves a simpler idea than the majority of other collaborative filtering implementations. While these usually calculate the similarity between vectors of items using the cosine or the Pearson methods, the Slope One approach recommends items to users based on the average difference in preferences of items.  The main idea of the algorithm is to create a linear relation between items preferences such as the relation F(x) = x + b. The name "Slope One" cames from the fact that here the "x" is multiplied by "1". It basically calculates the difference between the ratings of items for each user (for every item the user has rated). Then, it creates and average difference (diff) for every pair of items. To make a prediction of the Item A for an User 1 for example, it would get the ratings that User 1 has given to other items and a...

Apache Mahout

Image
“Scalable machine learning library” Mahout is a solid Java framework in the Artificial Intelligence area. It is a machine learning project by the Apache Software Foundation that tries to build intelligent algorithms that learn from some data input. What is special about Mahout is that  it is a scalable library, prepared to deal with huge datasets. Its algorithms are built on top of the Apache Hadoop project and, so, they work with distributed computing. Mahout offers algorithms in three major  areas: Clustering, Categorization and Recommender Systems. This lats part was incoporated in April 4 th 2008, from the previous Taste Recommender System project. Mahout currently implements  a collaborative filtering engine that supports the user-based, item-based and Slope-one recommender systems. Other algorithms available in the package are  the k-means, fuzzy k-Means clustering, Canopy, Dirichlet and Mean-Shift. They also have The Naive Bayes, Complement...