Just a girl in Tech

Posts

Showing posts with the label Data Mining

Error when using smooth.spline

May 16, 2016

When trying to interpolate a series of data the cubic spline is a great technique to be used. I choose to use the smooth.spline function, from the R stats package. > smooth.spline(data$x, data$y ) Nevertheless, while running smooth.spline on a collection of datasets with different sizes I got the following error: Error in smooth.spline( data$x, data$y), : 'tol' must be strictly positive and finite After digging a little bit I discovered that the problem was that some datasets were really small and smooth.spline wasn't being able to compute anything. Hence, make sure your dataset is big enough before applying smooth.spline to it. > if(length(data$x) > 30) { smooth.spline( data$x, data$y) } UPDATE: A more generalized solution would be: > if(IQR(data$x) > 0) { smooth.spline( data$x, data$y) }

Working with Big Datasets in R

September 26, 2015

When dealing with a significant amount of data in R the are some points to consider. How do I know if my data is too big? Well, the term "BigData" can be thought of as a data that is too big to fit in the available memory. As R works with the entire dataset in memory (unless you specify it not to do so), the first thing is to check how large is the dataset in question, and if it does fit in memory . Remember that you actually should have at least double memory of the size of your dataset. So for example if you dataset has a size of 2 GB, you should have at least 4 GB of memory. If you don't have enough memory, you should consider breaking your data into smaller chunks and working with them separately. You can use the command split to do this in Linux: split -l 10000 file.txt new_file This should create several new files (new_filea, new_fileb, etc..) with ten thousand lines each. Well, once you know your date will fit into memory, you can read it with th...

Removing Outliers to Plot Data

July 31, 2015

I am currently working a lot with R . One simple thing that helps me to better visualize data is to plot it excluding outliers. To do so, first read the data data = read.table(“myfile.txt”) Then, you can check how data is distributed quantile(data, c(.02, .05, .10, .50, .90, .95, .98)) An example output would be 2% 5% 10% 50% 90% 95% 98% 189 190 190 194 241 275 316 Now, to plot your data discarding the 1% lowest values and 1% higher values, you could use x <- quantile(data, c(.01, .99)) And then plot(data, xlim=c(x[[1]], x[[2]])) ...

SVM in Practice

February 11, 2015

Many Machine Learning articles and papers describe the wonders of the Support Vector Machine (SVM) algorithm. Nevertheless, when using it on real data trying to obtain a high accuracy classification, I stumbled upon several issues. I will try to describe the steps I took to make the algorithm work in practice. This model was implemented using R and the library "e1071". To install and use it type: > install.packages("e1071") > library("e1071") When you want to classify data in two categories , few algorithms are better than SVM. It usually divides data in two different sets by finding a "line" that better separates the points. It is capable to classify data linearly (put a straight line to differentiate sets) or do a nonlinear classification (separates sets with a curve). This "separator" is called a hyperplane . Picture 1 - Linear hyperplane separator Normalize Features Before you even start running the algor...

Genetic Algorithm for Knapsack using Hadoop

July 16, 2014

Development of Genetic Algorithm using Apache Hadoop framework to solve optimization problems Introduction This project I developed during a course on my Master intends to construct a Genetic algorithm to solve optimization problems, focusing on the Knapsack Problem. It uses as base the distributed framework Apache Hadoop. The idea is to show that the MapReduce paradigm implemented by Hadoop is a good fit for several NP-Complete optimization problems. As knapsack, many problems present a simple structure and converge to optimal solutions given a proper amount of computation. Genetic Algorithm The algorithm was developed based on a Genetic paradigm. It starts with a initial random population (random instances to the problem). Then, the best individuals are selected among the population (instances that generate the best profits for the knapsack). A phase of crossover was then implemented to generate new instances as combination of the selected indi...

BigData Free Course Online

May 21, 2013

Coursera offers several great online courses from the best universities around the world. The courses involve video lectures being released weekly, work assignments for the student, and reading material indications. I had enrolled on this course about BigData a couple of months ago, and I confess I didn't have time to start doing it since last week. Once I started the course I was pleased with the content presented. They talk about important Data Mining algorithms for dealing with great amount of data such as PageRank . MapReduce and Distributed File Systems are also two very well explained topics on this course. So, for those who want to know more about computing related to BigData this course is certainly recommended! https://www.coursera.org/course/bigdata PS: The course is being offered since march, and its inscriptions period must soon be over. But keep watching the course page, because they open new courses often.

Frequent Itemset problem for MapReduce

January 29, 2013

I have received many emails asking for tips for starting Hadoop projects with Data Mining. In this post I describe how the Apriori algorithm solves the frequent itemset problem, and how it can be applied to a MapReduce framework. The Problem The frequent itemset problem consists of mining a set of items to find a subset of items that have a strong connexion between them . A simple example to clear the concept would be: given a set of baskets in a supermarket, a frequent itemset would be hamburgers and ketchup. These items appear frequently in the baskets, and very often, together. In the general a set of items that appear in many baskets is said to be frequent . In the computer world, we could use this algorithm to recommend items of purchase for a user. If A and B are a frequent itemset, once a user buys A, B would certainly be a good recommendation. In this problem, the number of "baskets" in assumed to be very large. Greater than what could fit in memory. The ...

Collaborative Filtering

October 21, 2010

"If an user A has liked the movies "Matrix " and "The Lord of the Rings" and many other users that have liked these two movies also liked "Memento", then it is likely that "Memento" will be recommended to user A." Collaborative Filtering is a type of recommender system widely implemented, and it is known for giving more accurated predictions than other approaches. The basic idea of the algorithms in the collaborative filtering area is to provide recommendations based on what people with similar taste have liked in the past. These people, the neighbors, are selected by comparing the user's past preferences (usually presented as ratings on items). So, by measuring the ratings similarity its possible to recommend items liked by the neighborhood. There are two major techniques to compare ratings. User-Based Let us consider a user as an N-dimensional vector of ratings, where each cell represents the rating...

Recommender Systems

October 18, 2010

"Suggest new items that fit the user’s preference." Introduction The increasing amount of information in the web has promoted the advance of the recommender systems research area. These systems help users by offering useful suggestions to them . The aim of Recommender Systems is to provide personalized recommendations, representing a fundamental role on e-commerce (widely used by companies such as Amazon , Netflix and Google ). They highlight items that the users have not yet seen and may appreciate. Such items include books, restaurants, webpages or even lifestyles. A suggestion is usually made based on the user's historical preferences. These preferences may be collected implicitly or explicitly . When a user is buying an item, or entering a web-page, for example, he is giving an implicit preference feedback. In the case of a user giving a rating to an article, he is providing an explicit feedback. A substantial challenge in this ar...