Posts

Showing posts from 2010

MapReduce

Image
"Easy distributed computing" MapReduce is a framework introduced by Google for processing larges amounts of data . The framework uses a simple idea derived from the commonly known map and reduce functions used in functional programming (ex: LISP). It divides the main problem into smaller sub-problems and distribute these to a cluster of computers . It then combines the answers to these sub-problems to obtain a final answer . MapReduce facilitates the process of distributed computing making possible that users with no knowledge on the subject create their own distributed applications. The framework hides all the details of parallelization , data distribution load balancing and fault tolerance and the user basically has only to specify the Map and the Reduce functions. In the process, the inp ut is divided into small independent chunks. The map function receives a piece of the input, processes it, and passes the input in the format key/value pair as answer. These k

Datasets

I have been talking about recommender systems and data mining algorithms and a clear drawback in this area of research is the scarcity of datasets to work with. So here follows a list of open datasets available in the internet to be used as test data. The links below contain different types of data varying from implicit users web activities to explicit ratings that users have given to items. Note that I have simply gathered this data; I am just providing it here to facilitate the access. http://grouplens.org/datasets/movielens/ This is a very known datasets provided by MovieLens. It is a set of explicit users ratings on items. It also contains information about the users and the items. It provides 3 files with the .dat format. http://www.informatik.uni-freiburg.de/~cziegler/BX/ Dataset with implicit and explicit user ratings on books. It offers demographic information about the user as well. The files provided are mysql. http://webscope.sandbox.yahoo.com/ Vario

Slope One

Image
Slope One is a simple and efficient type of recommender system. Introduced by Daniel Lemire and Anna Maclachlan in 2005, it involves a simpler idea than the majority of other collaborative filtering implementations. While these usually calculate the similarity between vectors of items using the cosine or the Pearson methods, the Slope One approach recommends items to users based on the average difference in preferences of items.  The main idea of the algorithm is to create a linear relation between items preferences such as the relation F(x) = x + b. The name "Slope One" cames from the fact that here the "x" is multiplied by "1". It basically calculates the difference between the ratings of items for each user (for every item the user has rated). Then, it creates and average difference (diff) for every pair of items. To make a prediction of the Item A for an User 1 for example, it would get the ratings that User 1 has given to other items and a

Apache Mahout

Image
“Scalable machine learning library” Mahout is a solid Java framework in the Artificial Intelligence area. It is a machine learning project by the Apache Software Foundation that tries to build intelligent algorithms that learn from some data input. What is special about Mahout is that  it is a scalable library, prepared to deal with huge datasets. Its algorithms are built on top of the Apache Hadoop project and, so, they work with distributed computing. Mahout offers algorithms in three major  areas: Clustering, Categorization and Recommender Systems. This lats part was incoporated in April 4 th 2008, from the previous Taste Recommender System project. Mahout currently implements  a collaborative filtering engine that supports the user-based, item-based and Slope-one recommender systems. Other algorithms available in the package are  the k-means, fuzzy k-Means clustering, Canopy, Dirichlet and Mean-Shift. They also have The Naive Bayes, Complementary Naive Bayes and

Collaborative Filtering

Image
"If an user A has liked the movies "Matrix " and "The Lord of the Rings" and many other users that have liked these two movies also liked "Memento", then it is likely that "Memento" will be recommended to user A." Collaborative Filtering is a type of recommender system widely implemented, and it is known for giving more accurated predictions than other approaches. The basic idea of the algorithms in the collaborative filtering area is to provide recommendations based on what people with similar taste have liked in the past. These people, the neighbors, are selected by comparing the user's past preferences (usually presented as ratings on items). So, by measuring the ratings similarity its possible to recommend items liked by the neighborhood. There are two major techniques to compare ratings. User-Based Let us consider a user as an N-dimensional vector of ratings, where each cell represents the rating

Recommender Systems

"Suggest new items that fit the user’s preference."   Introduction The increasing amount of information in the web has promoted the advance of the recommender systems research area.  These systems help users by offering useful suggestions to them . The aim of Recommender Systems is to provide personalized recommendations, representing a fundamental role on e-commerce (widely used by companies such as Amazon , Netflix and Google ). They highlight items that the users have not yet seen and may appreciate. Such items include books, restaurants, webpages or even lifestyles. A suggestion is usually made based on the user's historical preferences. These preferences may be collected implicitly or explicitly . When a user is buying an item, or entering a web-page, for example, he is giving an implicit preference feedback. In the case of a user giving a rating to an article, he is providing an explicit feedback. A substantial challenge in this ar