Posts

Showing posts from 2015

Working with Big Datasets in R

When dealing with a significant amount of data in R the are some points to consider. How do I know if my data is too big? Well, the term "BigData" can be thought of as a data that is too big to fit in the available memory. As R works with the entire dataset in memory (unless you specify it not to do so), the first thing is to check how large is the dataset in question, and if it does fit in memory . Remember that you actually should have at least double memory of the size of your dataset. So for example if you dataset has a size of 2 GB, you should have at least 4 GB of memory. If you don't have enough memory, you should consider breaking your data into smaller chunks and working with them separately. You can use the command split to do this in Linux: split -l 10000 file.txt new_file This should create several new files (new_filea, new_fileb, etc..) with ten thousand lines each. Well, once you know your date will fit into memory, you can read it with th

Removing Outliers to Plot Data

I am currently working a lot with R . One simple thing that helps me to better visualize data is to plot it excluding outliers. To do so, first read the data data = read.table(“myfile.txt”)                                      Then, you can check how data is distributed quantile(data, c(.02, .05, .10, .50, .90, .95, .98))                 An example output would be    2%   5%  10%  50%  90%  95%  98%   189  190 190  194  241  275  316  Now,  to plot your data discarding the 1% lowest values and 1% higher values, you could use x <- quantile(data, c(.01, .99))                                    And then plot(data, xlim=c(x[[1]], x[[2]]))                                    

SVM in Practice

Image
Many Machine Learning articles and papers describe the wonders of the Support Vector Machine (SVM) algorithm. Nevertheless, when using it on real data trying to obtain a high accuracy classification, I stumbled upon several issues. I will try to describe the steps I took to make the algorithm work in practice. This model was implemented using R and the library "e1071". To install and use it type: > install.packages("e1071") > library("e1071") When you want to classify data in two categories , few algorithms are better than SVM. It usually divides data in two different sets by finding a "line" that better separates the points. It is capable to classify data linearly (put a straight line to differentiate sets) or do a nonlinear classification (separates sets with a curve). This "separator" is called a hyperplane . Picture 1 - Linear hyperplane separator Normalize Features Before you even start running the algor