Tuesday, November 8, 2016

Running k-Means Clustering on Spark with Cloudera

While running these steps,  errors might appear in some part of the process due to initialization timing issues. I know that is a annoying advice, but if that happens just try running the command again in a couple of minutes. Also, you have to change the location of the kmeans_data.txt file inside kmeans.py to point it to your data, and also maybe change where the output will be written (target/org/apache/spark/PythonKMeansExample/KMeansModel).


Download kmeans.py example that uses MLLIB furnished by Spark.

Create a kmeans_data.txt file that looks like this:

0.0 0.0 0.0
0.1 0.1 0.1
0.2 0.2 0.2
9.0 9.0 9.0
9.1 9.1 9.1
9.2 9.2 9.2

Download VirtualBox.

Download Cloudera CDH5 trial version.
Open VirtualBox, import the downloaded Cloudera's Virtual Box and run it.

Inside VirtualBox:

1 - (needs internet access) Install python numpy library. In a terminal, type:

$ sudo yum install numpy

2 - Copy kmeans_data.txt and kmeans.py to /home/cloudera/ (or wherever you want)

3 - Launch Cloudera Enterprise Trial by clicking on an icon on Cloudera's Desktop or run this command:

$ sudo cloudera-manager --force --enterprise

4 - Open Cloudera Manager Webinterface on your browser. Here are the credentials for that:

user: cloudera
password: cloudera

5 - Start HDFS on ClouderaManager Webinterface (on your browser)

6 - Start Spark on ClouderaManager Webinterface (on your browser)

7 - Put the kmeans_data.txt into HDFS. Run:

$ hadoop fs -put kmeans_data.txt

8 - Run the Spark job kmeans.py locally with 2 threads:

$ spark-submit --master local[2] kmeans.py

7 - Get the result from HDFS, and put it in your current directory:

$ hadoop fs -get KMeansModel/*

8 - The result will be stored in parquet. Read the result with parquet-tools:

$ parquet-tools cat KMeansModel/data/part-r-000..

Here is an example output of what this command should give:

Thursday, August 11, 2016

Use SAS for Free

I Recently had the necessity of developing some basic SAS® software for personal use.  I decided to share my experience with you, because I think many people don't know this free option of SAS is available for public, and I did find it quite resourceful and easy to use.

SAS is a is a statistical software suite developed by SAS Institute for advanced analytics, multivariate analysesbusiness intelligencedata management, and predictive analytics. [1]

I have the feeling that it is kind of a mixture of R, SQL and Excel all in one. You can make fairly advanced analysis and data mining on it. It is quite easy to use, and even non-programers can do nice data analysis with it. They provide several snippet of codes,

built in function and online documentation. But for me its biggest advantage is the data visualization. 

Data Vizualization

You can download the SAS Studio University Edition for free on http://www.sas.com/en_us/software/university-edition.html .

To use it you have to download VMware (or Oracle virtual box), which they indicate in their website.
You do have to create a profile to download the software. The virtual box is  already configured to start development. Once you "turn on" the SAS Virtual Machine you just have to came back to your real browser and enter on the address


You should be able to see the screen below.

The interface is quite simple to use, so I won't dig into too many details on that. But here is a quick link to start to program in SAS Studio: http://support.sas.com/training/tutorial/studio/get-started.html.

The down side is that to use it commercially you have to pay for it. To use it in your business or not it is a question of pros and cons. I believe it really depends on your business scenario. Nevertheless, I think it is good to know there is a free try-out version out there.

Also, if you are a developer willing to learn this tool, this is a great great way of doing it. Just download it and have fun.

Let me know if you guys have any questions! :D

SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of
SAS Institute Inc. in the USA and other countries. ® indicates USA registration.

Monday, May 16, 2016

Error when using smooth.spline

When trying to interpolate a series of data the cubic spline is a great technique to be used.
I choose to use the smooth.spline function, from the R stats package.

> smooth.spline(data$x, data$y)

Nevertheless, while running smooth.spline on a collection of datasets with different sizes I got the following error:

Error in smooth.spline(data$x, data$y),  :
  'tol' must be strictly positive and finite

After digging a little bit I discovered that the problem was that some datasets were really small and smooth.spline wasn't being able to compute anything.
Hence, make sure your dataset is big enough before applying smooth.spline to it.

> if(length(data$x) > 30) { smooth.spline(data$x, data$y) }