Friday, November 10, 2017

BigData White Papers

I don't know about you, but I always like to read the white papers that originate OpenSource projects (when available of course :) ).

I have been working with BigData quite a lot lately and this area is mostly dominated by Apache OpenSource projects.

 So, naturally (given the nerd that I am) I tried to investigate their history. I created a list of articles and companies that originated most BigData Apache projects.

Here it is! Hope you guys find it interesting too. :)

Apache Hadoop 

Based on: Google MapReduce and GFS 

Apache Spark 

Created by: University of California, Berkeley 

Apache Hive 

Created by: Facebook

Apache Impala 

Based on: Google F1

Apache HBase

Based on: Google BigTable

Apache Drill 

Based on: Google Dremel

Apache Pig 

Created by: Yahoo!

Apache Oozie 

Created by: Yahoo!

Apache Sqoop 

Started as a module for Apache Hadoop on issue by Aaron Kimball.

Apache Flume


Friday, August 11, 2017

Deep Learning, TensorFlow and Tensor Core

I was lucky enough to get a ticket to the Google I/O 2017 on a Google Code Jam for Women (for girls that don't know, Google has some programming contest for women and the best classified win tickets to the conference).

One of the main topics of the conference was for sure its new Deep Learning library TensorFlow. TensorFlow is Google's OpenSource Machine Learning library that runs both on CPU and GPU.

Two very cool things were presented at Google I/O:

  •  TPU (Tensor Processing Unit) - a GPU optimized specifically for TensorFlow that can be used on the Google Cloud Engine
  •  TensorFlow Lite - a TensorFlow low weight version to run on Android and make developer's lives easier

Last week, at a BigData meetup in Chicago, I discovered that Nvidia also created a specific GPU hardware for processing Deep Learning, the Tensor Core.

 With all this infrastructure and APIs being made available, Deep Learning can be done considerably easier and faster. At Google I/O, Sundar Pichai mentioned that at Google they have been using Machine Learning for almost everything, and even Deep Learning to train the Deep Learning networks!

TensorFlow's API is so high level, that even someone with little technical background can develop something interesting with it. Sundar also shared a story of a high school guy that used the library to help detecting some types of cancer.

It seems that Data Science is becoming attainable.

Wednesday, August 2, 2017

Dummy errors when using neuralnet package in R

Ok, so you read a bunch of stuff on how to do Neural Networks and how many layers or nodes you should add, and etc... But when you start to implement the actual Neural Networks you face a ton of dummy errors that stop your beautiful inspirational programming.

This post talks about some errors you might face when using the neuralnet package in R.

First, remember, to use the package you should install it:




to load the package.

Error 1

One error that might happen training your neural network is this:

nn <- neuralnet(formula1,data=new_data, hidden=c(5,3))

Error in terms.formula(formula) : invalid model formula in ExtractVars

This happens when the name of the variables in formula "formula1" are in a non desired format. For example if you named your columns (or variables) as numbers you would get this error. So change your column names and re-run the model!


label ~ 1 + 2 + 3 + 4 + 5

Change to:

label ~ v1 + v2 + v3 + v4 + v5

Error 2

Another error you might get is the following:

nn <- neuralnet(f, data=train[,-1], hidden=c(3,3))

Warning message:  algorithm did not converge in 1 of 1 repetition(s) within the stepmax

To solve this, you can increase the size of "stepmax" parameter:

nn <- neuralnet(f, data=train[,-1], hidden=c(3,3), stepmax=1e6)

If that doesn't work, you might have to change other parameters to make it converge.  Try reducing the number of hidden nodes or layers. Or changing your training data size.

Error 3

The third error I want to discuss happens when actually computing the output of the neural network:

net.compute <- compute(net, matrix.train2[,1:10])
Error in neurons[[i]] %*% weights[[i]] : non-conformable arguments
This error occurs when the number of columns in the dataframe you are using to predict is different from the columns used to train the neural network. The data frames used in neuralnet and compute should have the same columns and the same names!

That is it! If you faced any other dummy error with the neuralnet package send me and I can add it to the post! Good luck! :D

Tuesday, November 8, 2016

Running k-Means Clustering on Spark with Cloudera in your Machine

Here are some steps to start using Spark. You can download a VirtualBox and a Cloudera Hadoop distribution and start testing it locally on your machine.


Download example that uses MLLIB furnished by Spark.

Create a kmeans_data.txt file that looks like this:

0.0 0.0 0.0
0.1 0.1 0.1
0.2 0.2 0.2
9.0 9.0 9.0
9.1 9.1 9.1
9.2 9.2 9.2

Download VirtualBox.

Download Cloudera CDH5 trial version.
Open VirtualBox, import the downloaded Cloudera's Virtual Box and run it.

Inside VirtualBox:

1 - (needs internet access) Install python numpy library. In a terminal, type:

$ sudo yum install numpy

2 - Copy kmeans_data.txt and to /home/cloudera/ (or wherever you want)

3 - Launch Cloudera Enterprise Trial by clicking on an icon on Cloudera's Desktop or run this command:

$ sudo cloudera-manager --force --enterprise

4 - Open Cloudera Manager Webinterface on your browser. Here are the credentials for that:

user: cloudera
password: cloudera

5 - Start HDFS on ClouderaManager Webinterface (on your browser)

6 - Start Spark on ClouderaManager Webinterface (on your browser)

7 - Put the kmeans_data.txt into HDFS. Run:

$ hadoop fs -put kmeans_data.txt

8 - Run the Spark job locally with 2 threads:

$ spark-submit --master local[2]

7 - Get the result from HDFS, and put it in your current directory:

$ hadoop fs -get KMeansModel/*

8 - The result will be stored in parquet. Read the result with parquet-tools:

$ parquet-tools cat KMeansModel/data/part-r-000..

Here is an example output of what this command should give:

Small note: While running these steps,  errors might appear in some part of the process due to initialization timing issues. I know that is a annoying advice, but if that happens just try running the command again in a couple of minutes. Also, you have to change the location of the kmeans_data.txt file inside to point it to your data, and also maybe change where the output will be written (target/org/apache/spark/PythonKMeansExample/KMeansModel).

Thursday, August 11, 2016

Use SAS for Free

I Recently had the necessity of developing some basic SAS® software for personal use.  I decided to share my experience with you, because I think many people don't know this free option of SAS is available for public, and I did find it quite resourceful and easy to use.

SAS is a is a statistical software suite developed by SAS Institute for advanced analytics, multivariate analysesbusiness intelligencedata management, and predictive analytics. [1]

I have the feeling that it is kind of a mixture of R, SQL and Excel all in one. You can make fairly advanced analysis and data mining on it. It is quite easy to use, and even non-programers can do nice data analysis with it. They provide several snippet of codes,

built in function and online documentation. But for me its biggest advantage is the data visualization. 

Data Vizualization

You can download the SAS Studio University Edition for free on .

To use it you have to download VMware (or Oracle virtual box), which they indicate in their website.
You do have to create a profile to download the software. The virtual box is  already configured to start development. Once you "turn on" the SAS Virtual Machine you just have to came back to your real browser and enter on the address


You should be able to see the screen below.

The interface is quite simple to use, so I won't dig into too many details on that. But here is a quick link to start to program in SAS Studio:

The down side is that to use it commercially you have to pay for it. To use it in your business or not it is a question of pros and cons. I believe it really depends on your business scenario. Nevertheless, I think it is good to know there is a free try-out version out there.

Also, if you are a developer willing to learn this tool, this is a great great way of doing it. Just download it and have fun.

Let me know if you guys have any questions! :D

SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of
SAS Institute Inc. in the USA and other countries. ® indicates USA registration.

Monday, May 16, 2016

Error when using smooth.spline

When trying to interpolate a series of data the cubic spline is a great technique to be used.
I choose to use the smooth.spline function, from the R stats package.

> smooth.spline(data$x, data$y)

Nevertheless, while running smooth.spline on a collection of datasets with different sizes I got the following error:

Error in smooth.spline(data$x, data$y),  :
  'tol' must be strictly positive and finite

After digging a little bit I discovered that the problem was that some datasets were really small and smooth.spline wasn't being able to compute anything.
Hence, make sure your dataset is big enough before applying smooth.spline to it.

> if(length(data$x) > 30) { smooth.spline(data$x, data$y) }


A more generalized solution would be:

> if(IQR(data$x) > 0) { smooth.spline(data$x, data$y) }

Saturday, September 26, 2015

Working with Big Datasets in R

When dealing with a significant amount of data in R the are some points to consider.

How do I know if my data is too big?

Well, the term "BigData" can be thought of as a data that is too big to fit in the available memory.

As R works with the entire dataset in memory (unless you specify it not to do so), the first thing is to check how large is the dataset in question, and if it does fit in memory.

Remember that you actually should have at least double memory of the size of your dataset.
So for example if you dataset has a size of 2 GB, you should have at least 4 GB of memory.

If you don't have enough memory, you should consider breaking your data into smaller chunks and working with them separately.

You can use the command split to do this in Linux:

split -l 10000 file.txt new_file

This should create several new files (new_filea, new_fileb, etc..) with ten thousand lines each.

Well, once you know your date will fit into memory, you can read it with the commands read.table or read.csv. The difference between them is that read.csv sets the parameter sep (from separator) as ",".

If your data does fit in memory, but even so, it occupies almost the entire available space, there are some parameter you can tune to make R faster.

We know that not all parameters are mandatory when calling the read.table command. When we leave some parameters blank, R is going to try to discover automatically what are those. Setting them previously will spare R some calculation, which for large datasets, can be a considerable time.
Some of these parameters are:

  • comment.char - define the comment character in your text. If there are none, you can set it to the empty string ""

  • colclasses - define the class of each column on your data.frame. If they are all numeric, for example, just put "numeric"

If colclasses is not specified, all columns are read as characters and then converted to the appropriated   class.

For more information: