Monday, December 6, 2010

Datasets

I have been talking about recommender systems and data mining algorithms and a clear drawback in this area of research is the scarcity of datasets to work with. So here follows a list of open datasets available in the internet to be used as test data. The links below contain different types of data varying from implicit users web activities to explicit ratings that users have given to items. Note that I have simply gathered this data; I am just providing it here to facilitate the access.


This is a very known datasets provided by MovieLens. It is a set of explicit users ratings on items. It also contains information about the users and the items.
It provides 3 files with the .dat format.

Dataset with implicit and explicit user ratings on books.
It offers demographic information about the user as well. The files provided are mysql.

Various types of data provided by yahoo.

 Explicit ratings from a online joke recommender system. The file is in the .xls format.

Explicit users ratings from a dating agency.

Here we have web data from 3 sources.
• Microsoft: This dataset records which areas of www.microsoft.com each user visited in a one-week timeframe in Feburary 1998.
• Msnbc.com: Page visits of users who visited msnbc.com on September 28, 1999.
• Syskill and Webert: This database contains the HTML source of web pages plus the ratings of a single user on these web pages.

This data presents a real query log data from AOL. Ut is an implicit type of data.

 Here we have 800,000 search queries from end user internet search activities.

This set provides data records from a restaurant recommender system.

An implicit dataset with a day's worth of all HTTP requests to the EPA WWW server.

Here we are provided with an implicit dataset with two month's worth of all HTTP requests to the NASA Kennedy Space Center WWW server.

4 comments:

  1. Nice blog , it will help others definitely.

    ReplyDelete
  2. Thanks, this is what I was looking for. I will try them and after that I will come back to comment. I noticed this was published in 2010, do you know about any new data set currently available ?

    ReplyDelete
    Replies
    1. hmm I haven't been working with this lately, so unfortunately no. =/ But I will try to look for more datasets and update the post latter

      Delete