Tuesday, December 27, 2016

Getting Started with Regression in R

Regressions are widely used to estimate relations between variables or predict future values for a certain dataset.

If you want to know how much of variable "x" interferes with variable "y" you might want to do a regression in your data. If you have a bunch of data points in time, and you want to know what is your data going to look like in the future, you also might want to do regression. 

I will try to describe the steps that helped me successfully build linear and non-linear regression in R, using polynomials and splines. I am not going to go on too much details on each method. I just want to give an overall step-by-step on how to do a general regression with R, so that you guys can go further on your own.

First Steps: get to know your data

The first thing you should do is see what your data looks like. Plot the data, maybe try to get some statistics out of it, and try to understand what type of relation there is between variables.

There might be a linear (line) or non-linear (curvy) relation between your data points. The data in question might be dependent of only one variable or several variables.

Suppose you have the following dummy dataset:


Here we have one variable x that varies according to another variable y.

Plotting this data would get us the following graph

> plot(x, y, col="blue", main="Example Graph")
> grid(nx = 12, ny = 12, col = "lightgray", lty = "dotted", lwd = par("lwd"), equilogs = TRUE)

Visually, we can say that this data seems to follow a non-linear pattern. Even further, the relation between y and x seems to be a second degree polynomial. 


The natural first thing to do is to check if a second degree polynomial would match our data. To create a model, we can use the lm() function. You should pass as parameter the equation you think might suit your data. Here we are actually "guessing" which model better fits to the data.

my_model <- lm(y ~ poly(x, 2))

In the example above we are creating a model where y=x^2.

Of course, you might want to test other alternatives:

my_model_linear <- lm(y ~ poly(x, 1))

Or (why not?)

my_model_degree_20 <- lm(y ~ poly(x, 20))

For more complex datasets, spline is a nice method to be used:

my_model_spline <- lm(y ~ bs(x))

 Here, bs is the base function. Use the parameters knots and df to make the function smoother or curvier.

We were lucky in our example with the second degree polynomial, but the idea here is to mess around a little with these functions and parameters, trying to find the best model possible.

Check Results

After you created some models, to visually check how they fit your data, you can plot your x values against the model values you created. Here we use the lines() method to do that:

lines(x, predict(lm(y ~ poly(x, 2))))

To further check how well your model fit your data, you can plot the model itself 


This is going to give you a bunch of information like the residuals against the fitted values. For more on that click here and here.

You can also use a t.test()  to see if the two groups (real versus modeled values) are similar. This test is going to compare their means, assuming they both are under a normal distribution.

t.test(y, predict(my_model))

Read more about the t.test() here.

Predict New Data

Now, suppose you were able to find a good function to model your data. With that, we are able to predict future values for our small dataset.

One important thing about the predict() function in R is that it expects a similar dataframe with the same column name and type as the one you used in your model.

For example:

my_prediction <- predict(my_model, data.frame(column_name = c(value_to_be_predicted))).  
If you had used dates in numeric form, for example you would have:

my_date <- "2016-05-10"
date_df  <- data.frame(x=as.numeric(as.Date(my_date))) my_pred <- predict(cubic_model, date_df)

In our example we used generic numbers  with the name "x". 

my_pred <- predict(my_model, data.frame(x = c(31.1)))



OBS: If you only desire to interpolate your data, create a "line" between your data points, check the smooth.spline function. It will interpolate your data, and you don't have to keep guessing the relation between data.

OBS2: If your data function is complex, if you are not being able to model your dataset correctly or if you are just willing to try new stuff, Neural Networks can be a very powerful way of learning your data.

Tuesday, November 8, 2016

Running k-Means Clustering on Spark with Cloudera

While running these steps,  errors might appear in some part of the process due to initialization timing issues. I know that is a annoying advice, but if that happens just try running the command again in a couple of minutes. Also, you have to change the location of the kmeans_data.txt file inside kmeans.py to point it to your data, and also maybe change where the output will be written (target/org/apache/spark/PythonKMeansExample/KMeansModel).


Download kmeans.py example that uses MLLIB furnished by Spark.

Create a kmeans_data.txt file that looks like this:

0.0 0.0 0.0
0.1 0.1 0.1
0.2 0.2 0.2
9.0 9.0 9.0
9.1 9.1 9.1
9.2 9.2 9.2

Download VirtualBox.

Download Cloudera CDH5 trial version.
Open VirtualBox, import the downloaded Cloudera's Virtual Box and run it.

Inside VirtualBox:

1 - (needs internet access) Install python numpy library. In a terminal, type:

$ sudo yum install numpy

2 - Copy kmeans_data.txt and kmeans.py to /home/cloudera/ (or wherever you want)

3 - Launch Cloudera Enterprise Trial by clicking on an icon on Cloudera's Desktop or run this command:

$ sudo cloudera-manager --force --enterprise

4 - Open Cloudera Manager Webinterface on your browser. Here are the credentials for that:

user: cloudera
password: cloudera

5 - Start HDFS on ClouderaManager Webinterface (on your browser)

6 - Start Spark on ClouderaManager Webinterface (on your browser)

7 - Put the kmeans_data.txt into HDFS. Run:

$ hadoop fs -put kmeans_data.txt

8 - Run the Spark job kmeans.py locally with 2 threads:

$ spark-submit --master local[2] kmeans.py

7 - Get the result from HDFS, and put it in your current directory:

$ hadoop fs -get KMeansModel/*

8 - The result will be stored in parquet. Read the result with parquet-tools:

$ parquet-tools cat KMeansModel/data/part-r-000..

Here is an example output of what this command should give:

Thursday, August 11, 2016

Use SAS for Free

I Recently had the necessity of developing some basic SAS® software for personal use.  I decided to share my experience with you, because I think many people don't know this free option of SAS is available for public, and I did find it quite resourceful and easy to use.

SAS is a is a statistical software suite developed by SAS Institute for advanced analytics, multivariate analysesbusiness intelligencedata management, and predictive analytics. [1]

I have the feeling that it is kind of a mixture of R, SQL and Excel all in one. You can make fairly advanced analysis and data mining on it. It is quite easy to use, and even non-programers can do nice data analysis with it. They provide several snippet of codes,

built in function and online documentation. But for me its biggest advantage is the data visualization. 

Data Vizualization

You can download the SAS Studio University Edition for free on http://www.sas.com/en_us/software/university-edition.html .

To use it you have to download VMware (or Oracle virtual box), which they indicate in their website.
You do have to create a profile to download the software. The virtual box is  already configured to start development. Once you "turn on" the SAS Virtual Machine you just have to came back to your real browser and enter on the address


You should be able to see the screen below.

The interface is quite simple to use, so I won't dig into too many details on that. But here is a quick link to start to program in SAS Studio: http://support.sas.com/training/tutorial/studio/get-started.html.

The down side is that to use it commercially you have to pay for it. To use it in your business or not it is a question of pros and cons. I believe it really depends on your business scenario. Nevertheless, I think it is good to know there is a free try-out version out there.

Also, if you are a developer willing to learn this tool, this is a great great way of doing it. Just download it and have fun.

Let me know if you guys have any questions! :D

SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of
SAS Institute Inc. in the USA and other countries. ® indicates USA registration.

Monday, May 16, 2016

Error when using smooth.spline

When trying to interpolate a series of data the cubic spline is a great technique to be used.
I choose to use the smooth.spline function, from the R stats package.

> smooth.spline(data$x, data$y)

Nevertheless, while running smooth.spline on a collection of datasets with different sizes I got the following error:

Error in smooth.spline(data$x, data$y),  :
  'tol' must be strictly positive and finite

After digging a little bit I discovered that the problem was that some datasets were really small and smooth.spline wasn't being able to compute anything.
Hence, make sure your dataset is big enough before applying smooth.spline to it.

> if(length(data$x) > 30) { smooth.spline(data$x, data$y) }

Saturday, September 26, 2015

Working with Big Datasets in R

When dealing with a significant amount of data in R the are some points to consider.

How do I know if my data is too big?

Well, the term "BigData" can be thought of as a data that is too big to fit in the available memory.

As R works with the entire dataset in memory (unless you specify it not to do so), the first thing is to check how large is the dataset in question, and if it does fit in memory.

Remember that you actually should have at least double memory of the size of your dataset.
So for example if you dataset has a size of 2 GB, you should have at least 4 GB of memory.

If you don't have enough memory, you should consider breaking your data into smaller chunks and working with them separately.

You can use the command split to do this in Linux:

split -l 10000 file.txt new_file

This should create several new files (new_filea, new_fileb, etc..) with ten thousand lines each.

Well, once you know your date will fit into memory, you can read it with the commands read.table or read.csv. The difference between them is that read.csv sets the parameter sep (from separator) as ",".

If your data does fit in memory, but even so, it occupies almost the entire available space, there are some parameter you can tune to make R faster.

We know that not all parameters are mandatory when calling the read.table command. When we leave some parameters blank, R is going to try to discover automatically what are those. Setting them previously will spare R some calculation, which for large datasets, can be a considerable time.
Some of these parameters are:

  • comment.char - define the comment character in your text. If there are none, you can set it to the empty string ""

  • colclasses - define the class of each column on your data.frame. If they are all numeric, for example, just put "numeric"

If colclasses is not specified, all columns are read as characters and then converted to the appropriated   class.

For more information:

Friday, July 31, 2015

Removing Outliers to Plot Data

I am currently working a lot with R. One simple thing that helps me to better visualize data is to plot it excluding outliers.

To do so, first read the data

data = read.table(“myfile.txt”)                                     
Then, you can check how data is distributed

quantile(data, c(.02, .05, .10, .50, .90, .95, .98))                

An example output would be

  2%   5%  10%  50%  90%  95%  98% 
 189  190 190  194  241  275  316 

Now,  to plot your data discarding the 1% lowest values and 1% higher values, you could use

x <- quantile(data, c(.01, .99))                                   

And then

plot(data, xlim=c(x[[1]], x[[2]]))                                    

Wednesday, February 11, 2015

SVM in Practice

Many Machine Learning articles and papers describe the wonders of the Support Vector Machine (SVM) algorithm. Nevertheless, when using it on real data trying to obtain a high accuracy classification, I stumbled upon several issues.
I will try to describe the steps I took to make the algorithm work in practice.

This model was implemented using R and the library "e1071".
To install and use it type:

> install.packages("e1071")
> library("e1071")

When you want to classify data in two categories, few algorithms are better than SVM.
It usually divides data in two different sets by finding a "line" that better separates the points. It is capable to classify data linearly (put a straight line to differentiate sets) or do a nonlinear classification (separates sets with a curve). This "separator" is called a hyperplane.

Picture 1 - Linear hyperplane separator

Normalize Features

Before you even start running the algorithm, the first thing needed is to normalize your data features. SVM uses features to classify data, and these should be obtained by analyzing the dataset and seeing what better represents it (like what is done with SIFT and SURF for images).  Remember: the best these features describe you data, the best your classification is going to be. You might want to use/combine the mean value, the derivative, standard deviation or several other ones. When parameters are not normalized, the ones with greater absolute value have greater effect on the hyperplane margin. This means that some parameters are going to influence more your algorithms than others. If that is not what you want, make sure all data features have the same value range.

Tune Parameters

Another important point is to check the SVM algorithm parameters. As many Machine Learning algorithms, SVM has some parameters that have to be tuned to gain better performance. This is very important: SVM is very sensitive to the choice of parameters. Even close parameters values might lead to very different classification results. Really! In order to find the best for your problem, you might want to test some different values. A great tool to help this job in R is the tune.svm() method.  It can test several different values, and return the ones which minimizes the classification error for the 10-fold cross validation.

Example of tune.svm() output:

> parameters <- tune.svm(class~., data = train_set, gamma = 10^(-5:-1), cost = 10^(-3:1))
> summary(parameters )

Parameter tuning of ‘svm’:

- sampling method: 10-fold cross validation 

- best parameters:
 gamma cost
   0.1    1

- best performance: 0.1409453 

- Detailed performance results:
   gamma cost     error  dispersion
1  1e-05  0.1 0.2549098 0.010693238
2  1e-04  0.1 0.2548908 0.010689828
3  1e-03  0.1 0.2546062 0.010685683
4  1e-02  0.1 0.2397427 0.010388229
5  1e-01  0.1 0.1776163 0.014591070
6  1e-05  1.0 0.2549043 0.010691266
7 1e-03  1.0 0.2524830 0.010660262
8 1e-02  1.0 0.2262167 0.010391502
9 1e-01  1.0 0.1409453 0.009898745
10 1e-05 10.0 0.2548687 0.010690819
11 1e-04 10.0 0.2545997 0.010686525
12 1e-03 10.0 0.2403118 0.010394169
13 1e-02 10.0 0.1932509 0.009984875

14 1e-01 10.0 0.1529182 0.013780632

The γ (gama) has to be tuned to better fit the hyperplane to the data.  It is responsible for the linearity degree of the hyperplane, and for that, it is not present when using linear kernels. The smaller γ is, the more the hyperplane is going to look like a straight line. If γ is too great, the hyperplane will be more curvy and might delineate the data too well and lead to overfitting.

Picture 2 - great value of  γ

Another parameter to be tuned to help improve accuracy is C. It is responsible for the size of the "soft margin" of SVM. The soft margin is a "gray" area around the hyperplane. This means that points inside this soft margin are not classified as any of the two categories.  The smaller the value of C, the greater the soft margin.

Picture 3 - Great values of C
Picture 4 - Small values of C

How to Prepare Data to SVM 

The svm() method in R expects a matrix or dataframe with one column identifying the class of that row and several features that describes that data. The following table shows an example of two classes, 0 and 1, and some features. Each row is a data entry.

  class    f1    f2    f3
1     0 0.100 0.500 0.900
2     0 0.101 0.490 0.901
3     0 0.110 0.540 0.890
4     0 0.100 0.501 0.809
5     1 0.780 0.730 0.090
6     1 0.820 0.790 0.100
7     1 0.870 0.750 0.099
8     1 0.890 0.720 0.089

The input for the svm() method could be:

> svm(class ~., data = my_data,  kernel = "radial", gamma = 0.1, cost = 1)

Here "class" is the name of the column that describes the classes of your data and "my_data" is obviously your dataset. The parameters should be the ones best suitable for your problem.

Test Your Results

Always separate a part of your data to test. It is a common practice to get 2/3 of data as training set (to find your model) and 1/3 as test set. Your final error should be reported based on the test set, otherwise it can be biased.

You can divide your data in R like the following example:

> data_index <- 1:nrow(my_data)
> testindex <- sample(data_index, trunc(length(data_index)*30/100))
> test_set <- my_data[testindex,]
> train_set <- my_data[-testindex,]

So when you would actually run the svm() method you would do it:

> my_model <- svm(class ~., data = train_set,  kernel = "radial", gamma = 0.1, cost = 1)

And then to test the results on the test_set:

> my_prediction <- predict(my_model, test_set[,-1])

test_set[,-1] removes the first column (the class column) to make the predictions only based on the features of the data. You should remove the column that labels your data.

Final Considerations

  1. The tune.svm() method might take a while to run depending on your data size. Nevertheless, usually it is worth it.
  2. We usually use logarithmically spaced values for the SVM parameters, varying from 10^-6 to 10^6.  Here is some explanation: http://stats.stackexchange.com/questions/81537/gridsearch-for-svm-parameter-estimation
  3. If your label classes are numeric (as in our example 0 and 1) your prediction results will probably be real numbers indicating how close this test input is of one class or the other. If you want to receive the integer and original class values, set the parameter "type" to "C-classification" when calling the svm() method.  Be aware that this is a SVM parameter, and changing this will change your classifier.
  4. If you try to run tune.svm() with a dataset of less than 10 rows, you will get this error:
     Error in tune("svm", train.x = x, data = data, ranges = ranges, ...) : 
     ‘cross’ must not exceed sampling size!

         So make sure you add more lines to this data test.

More about SVM