Saturday, September 26, 2015

Working with Big Datasets in R

When dealing with a significant amount of data in R the are some points to consider.

How do I know if my data is too big?

Well, the term "BigData" can be thought of as a data that is too big to fit in the available memory.

As R works with the entire dataset in memory (unless you specify it not to do so), the first thing is to check how large is the dataset in question, and if it does fit in memory.

Remember that you actually should have at least double memory of the size of your dataset.
So for example if you dataset has a size of 2 GB, you should have at least 4 GB of memory.

If you don't have enough memory, you should consider breaking your data into smaller chunks and working with them separately.

You can use the command split to do this in Linux:

split -l 10000 file.txt new_file

This should create several new files (new_filea, new_fileb, etc..) with ten thousand lines each.

Well, once you know your date will fit into memory, you can read it with the commands read.table or read.csv. The difference between them is that read.csv sets the parameter sep (from separator) as ",".

If your data does fit in memory, but even so, it occupies almost the entire available space, there are some parameter you can tune to make R faster.

We know that not all parameters are mandatory when calling the read.table command. When we leave some parameters blank, R is going to try to discover automatically what are those. Setting them previously will spare R some calculation, which for large datasets, can be a considerable time.
Some of these parameters are:

  • comment.char - define the comment character in your text. If there are none, you can set it to the empty string ""

  • colclasses - define the class of each column on your data.frame. If they are all numeric, for example, just put "numeric"

If colclasses is not specified, all columns are read as characters and then converted to the appropriated   class.

For more information:

1 comment:

  1. Hello,
    The Article on Working with Big Datasets in R is nice.It give detail information about it thanks for Sharing the information about Big Data data science consulting