When trying to interpolate a series of data the cubic spline is a great technique to be used.

I choose to use the smooth.spline function, from the R stats package.

> smooth.spline(data$x, data$y)

Nevertheless, while running smooth.spline on a collection of datasets with different sizes I got the following error:

Error in smooth.spline(data$x, data$y), :

'tol' must be strictly positive and finite

After digging a little bit I discovered that the problem was that some datasets were really small and smooth.spline wasn't being able to compute anything.

Hence, make sure your dataset is big enough before applying smooth.spline to it.

> if(length(data$x) > 30) { smooth.spline(data$x, data$y) }

# Girl in the World of Computer Science

## Monday, May 16, 2016

### Error when using smooth.spline

Labels:
Data Mining,
Machine Learning,
R

## Saturday, September 26, 2015

### Working with Big Datasets in R

When dealing with a significant amount of data in R the are some points to consider.

Well, the term "BigData" can be thought of as a data that is too big to fit in the available memory.

As R works with the entire dataset in memory (unless you specify it not to do so), the first thing is to

Remember that you actually should have at least double memory of the size of your dataset.

So for example if you dataset has a size of 2 GB, you should have at least 4 GB of memory.

If you don't have enough memory, you should consider breaking your data into smaller chunks and working with them separately.

You can use the command split to do this in Linux:

This should create several new files (new_filea, new_fileb, etc..) with ten thousand lines each.

Well, once you know your date will fit into memory, you can read it with the commands

If your data does fit in memory, but even so, it occupies almost the entire available space,

We know that not all parameters are mandatory when calling the read.table command. When we leave some parameters blank, R is going to try to discover automatically what are those. Setting them previously will spare R some calculation, which for large datasets, can be a considerable time.

Some of these parameters are:

If

For more information:

https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html

**How do I know if my data is too big?**Well, the term "BigData" can be thought of as a data that is too big to fit in the available memory.

As R works with the entire dataset in memory (unless you specify it not to do so), the first thing is to

**check how large is the dataset in question, and if it does fit in memory**.Remember that you actually should have at least double memory of the size of your dataset.

So for example if you dataset has a size of 2 GB, you should have at least 4 GB of memory.

If you don't have enough memory, you should consider breaking your data into smaller chunks and working with them separately.

You can use the command split to do this in Linux:

`split -l 10000 file.txt new_file`

This should create several new files (new_filea, new_fileb, etc..) with ten thousand lines each.

Well, once you know your date will fit into memory, you can read it with the commands

*read.table*or*read.csv*. The difference between them is that*read.csv*sets the parameter sep (from separator) as ",".If your data does fit in memory, but even so, it occupies almost the entire available space,

**there are some parameter you can tune to make R faster**.We know that not all parameters are mandatory when calling the read.table command. When we leave some parameters blank, R is going to try to discover automatically what are those. Setting them previously will spare R some calculation, which for large datasets, can be a considerable time.

Some of these parameters are:

- comment.char - define the comment character in your text. If there are none, you can set it to the empty string ""

- colclasses - define the class of each column on your data.frame. If they are all numeric, for example, just put "numeric"

If

*colclasses*is not specified, all columns are read as characters and then converted to the appropriated class.For more information:

https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html

## Friday, July 31, 2015

### Removing Outliers to Plot Data

I am currently working a lot with R. One simple thing that helps me to better visualize data is to plot it excluding outliers.

To do so, first read the data

To do so, first read the data

data = read.table(“myfile.txt”)

Then, you can check how data is distributed

quantile(data, c(.02, .05, .10, .50, .90, .95, .98))

An example output would be

2% 5% 10% 50% 90% 95% 98%

189 190 190 194 241 275 316

Now, to plot your data discarding the 1% lowest values and 1% higher values, you could use

x <- quantile(data, c(.01, .99))

Now, to plot your data discarding the 1% lowest values and 1% higher values, you could use

x <- quantile(data, c(.01, .99))

And then

plot(data, xlim=c(x[[1]], x[[2]]))

plot(data, xlim=c(x[[1]], x[[2]]))

## Wednesday, February 11, 2015

### SVM in Practice

Many Machine Learning articles and papers describe the wonders of the Support Vector Machine (SVM) algorithm. Nevertheless, when using it on real data trying to obtain a high accuracy classification, I stumbled upon several issues.

I will try to describe the steps I took to make the algorithm work in practice.

This model was implemented using R and the library "e1071".

To install and use it type:

> install.packages("e1071")

> library("e1071")

When you want to

It usually divides data in two different sets by finding a "line" that better separates the points. It is capable to classify data linearly (put a straight line to differentiate sets) or do a nonlinear classification (separates sets with a curve). This "separator" is called a hyperplane.

Before you even start running the algorithm, the first thing needed is to

Another important point is

Example of tune.svm() output:

The γ (gama) has to be tuned to better fit the hyperplane to the data. It is responsible for the linearity degree of the hyperplane, and for that, it is not present when using linear kernels.

Another parameter to be tuned to help improve accuracy is C. It is responsible for the size of the "soft margin" of SVM. The soft margin is a "gray" area around the hyperplane. This means that points inside this soft margin are not classified as any of the two categories.

The

1 0 0.100 0.500 0.900

2 0 0.101 0.490 0.901

3 0 0.110 0.540 0.890

4 0 0.100 0.501 0.809

5 1 0.780 0.730 0.090

6 1 0.820 0.790 0.100

7 1 0.870 0.750 0.099

8 1 0.890 0.720 0.089

The input for the svm() method could be:

> svm(class ~., data = my_data, kernel = "radial", gamma = 0.1, cost = 1)

Here "class" is the name of the column that describes the classes of your data and "my_data" is obviously your dataset. The parameters should be the ones best suitable for your problem.

You can divide your data in R like the following example:

> data_index <- 1:nrow(my_data)

> testindex <- sample(data_index, trunc(length(data_index)*30/100))

> test_set <- my_data[testindex,]

> train_set <- my_data[-testindex,]

So when you would actually run the svm() method you would do it:

> my_model <- svm(class ~., data = train_set, kernel = "radial", gamma = 0.1, cost = 1)

And then to test the results on the test_set:

> my_prediction <- predict(my_model, test_set[,-1])

test_set[,-1] removes the first column (the class column) to make the predictions only based on the features of the data. You should remove the column that labels your data.

‘cross’ must not exceed sampling size!

So make sure you add more lines to this data test.

https://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Classification/SVM

http://rischanlab.github.io/SVM.html

https://cran.r-project.org/web/packages/e1071/vignettes/svmdoc.pdf

http://pyml.sourceforge.net/doc/howto.pdf

http://neerajkumar.org/writings/svm/

I will try to describe the steps I took to make the algorithm work in practice.

This model was implemented using R and the library "e1071".

To install and use it type:

> install.packages("e1071")

> library("e1071")

When you want to

**classify data in two categories**, few algorithms are better than SVM.It usually divides data in two different sets by finding a "line" that better separates the points. It is capable to classify data linearly (put a straight line to differentiate sets) or do a nonlinear classification (separates sets with a curve). This "separator" is called a hyperplane.

*Picture 1 - Linear hyperplane separator*

**Normalize Features**Before you even start running the algorithm, the first thing needed is to

**normalize your data features.**SVM uses features to classify data, and these should be obtained by analyzing the dataset and seeing what better represents it (like what is done with SIFT and SURF for images). Remember:**t****he****best these features describe you data, the best your classification is going to be**. You might want to use/combine the mean value, the derivative, standard deviation or several other ones. When parameters are not normalized,**the ones with greater absolute value have greater effect on the hyperplane margin**. This means that some parameters are going to influence more your algorithms than others. If that is not what you want, make sure all data features have the same value range.**Tune Parameters**Another important point is

**to check the SVM algorithm parameters**. As many Machine Learning algorithms, SVM has some parameters that have to be tuned to gain better performance. This is very important:**SVM is very sensitive to the choice of parameters**. Even close parameters values might lead to very different classification results. Really! In order to find the best for your problem, you might want to test some different values. A great tool to help this job in R is the**tune.svm()**method. It can test several different values, and return the ones which minimizes the classification error for the 10-fold cross validation.Example of tune.svm() output:

> parameters <- tune.svm(class~., data = train_set, gamma = 10^(-5:-1), cost = 10^(-3:1))

> summary(parameters )

Parameter tuning of ‘svm’:

- sampling method: 10-fold cross validation

- best parameters:

gamma cost

0.1 1

- best performance: 0.1409453

- Detailed performance results:

gamma cost error dispersion

1 1e-05 0.1 0.2549098 0.010693238

2 1e-04 0.1 0.2548908 0.010689828

3 1e-03 0.1 0.2546062 0.010685683

4 1e-02 0.1 0.2397427 0.010388229

5 1e-01 0.1 0.1776163 0.014591070

6 1e-05 1.0 0.2549043 0.010691266

7 1e-03 1.0 0.2524830 0.010660262

8 1e-02 1.0 0.2262167 0.010391502

9 1e-01 1.0 0.1409453 0.009898745

10 1e-05 10.0 0.2548687 0.010690819

11 1e-04 10.0 0.2545997 0.010686525

12 1e-03 10.0 0.2403118 0.010394169

13 1e-02 10.0 0.1932509 0.009984875

14 1e-01 10.0 0.1529182 0.013780632

The γ (gama) has to be tuned to better fit the hyperplane to the data. It is responsible for the linearity degree of the hyperplane, and for that, it is not present when using linear kernels.

**The smaller**γ**is, the more the hyperplane is going to look like a straight line**.**If**γ**is too great, the hyperplane will be more curvy**and might delineate the data too well and lead to overfitting.*Picture 2 - great value of*

**γ**Another parameter to be tuned to help improve accuracy is C. It is responsible for the size of the "soft margin" of SVM. The soft margin is a "gray" area around the hyperplane. This means that points inside this soft margin are not classified as any of the two categories.

**The smaller the value of C, the greater the soft margin**.*Picture 3 - Great values of C*

*Picture 4 - Small values of C*

**How to Prepare Data to SVM**The

**svm() method in R expects a matrix or dataframe with one column identifying the class of that row and several features that describes that data**. The following table shows an example of two classes, 0 and 1, and some features. Each row is a data entry.**class f1 f2 f3**The input for the svm() method could be:

> svm(class ~., data = my_data, kernel = "radial", gamma = 0.1, cost = 1)

Here "class" is the name of the column that describes the classes of your data and "my_data" is obviously your dataset. The parameters should be the ones best suitable for your problem.

**Test Your Results****Always separate a part of your data to test. It is a common practice to get 2/3 of data as training set (to find your model) and 1/3 as test set. Your final error should be reported based on the test set, otherwise it can be biased.**

You can divide your data in R like the following example:

> data_index <- 1:nrow(my_data)

> testindex <- sample(data_index, trunc(length(data_index)*30/100))

> test_set <- my_data[testindex,]

> train_set <- my_data[-testindex,]

So when you would actually run the svm() method you would do it:

> my_model <- svm(class ~., data = train_set, kernel = "radial", gamma = 0.1, cost = 1)

And then to test the results on the test_set:

> my_prediction <- predict(my_model, test_set[,-1])

test_set[,-1] removes the first column (the class column) to make the predictions only based on the features of the data. You should remove the column that labels your data.

**Final Considerations**- The tune.svm() method might take a while to run depending on your data size. Nevertheless, usually it is worth it.
- We usually use logarithmically spaced values for the SVM parameters, varying from 10^-6 to 10^6. Here is some explanation: http://stats.stackexchange.com/questions/81537/gridsearch-for-svm-parameter-estimation
- If your label classes are numeric (as in our example 0 and 1) your prediction results will probably be real numbers indicating how close this test input is of one class or the other. If you want to receive the integer and original class values, set the parameter "type" to "C-classification" when calling the svm() method. Be aware that this is a SVM parameter, and changing this will change your classifier.
- If you try to run tune.svm() with a dataset of less than 10 rows, you will get this error:

‘cross’ must not exceed sampling size!

So make sure you add more lines to this data test.

**More about SVM**https://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Classification/SVM

http://rischanlab.github.io/SVM.html

https://cran.r-project.org/web/packages/e1071/vignettes/svmdoc.pdf

http://pyml.sourceforge.net/doc/howto.pdf

http://neerajkumar.org/writings/svm/

Labels:
Algorithms,
Data Mining,
Machine Learning,
R

## Tuesday, August 5, 2014

### Lecture on Recommender Systems

Great lecture on Recommender Systems by Xavier Amatriain, Researcher on Netflix.

# https://www.youtube.com/watch?v=bLhq63ygoU8

Labels:
Data Mining,
Recommender Systems

## Wednesday, July 16, 2014

### Genetic Algorithm for Knapsack using Hadoop

*Development of Genetic Algorithm using Apache Hadoop framework to solve optimization problems*

**Introduction**

This project I developed during a course on my Master intends to construct a Genetic algorithm to solve optimization
problems, focusing on the Knapsack Problem. It uses as base the
distributed framework Apache Hadoop.

The idea is to show
that the MapReduce paradigm implemented by Hadoop is a good fit for
several NP-Complete optimization problems. As knapsack, many problems
present a simple structure and converge to optimal solutions given a
proper amount of computation.

**Genetic Algorithm**

The algorithm was
developed based on a Genetic paradigm. It starts with a

*initial random population*(random instances to the problem). Then, the best*individuals are selected*among the population (instances that generate the best profits for the knapsack). A phase of*crossover*was then implemented to generate new instances as combination of the selected individuals. In this phase, parts of each selected individual are randomly selected to form a new instance.
The next step of
the process is the

*mutation*phase. In this phase, “genes” are selected also randomly, and changed. In the knapsack context this means that random items are substituted from the solutions found.
This cycle can be
repeated several times, until good individuals are generated.

The steps are
reproduced below:

**Initial Population → Selection → Crossover → Mutation**

**The Knapsack Problem**

The goal is to maximize the objective function, where vi represents the value of each element i, and variable xi represents whether element i is going to be taken or not. The maximization has to take into account the weight of each element represented by the first constraint.

This problem was
chosen because its solution only require enough amount of processing,
and this processing can be done in parallel. The space of possible
solutions can be seen as a set of local maximums that are going to be
tested by the genetic algorithm. Hopefully, with enough computation, the global
maximum can be achieved.

**Validation Results**

To demonstrate that
the algorithm find a correct result with high probability, it was
tested against two validation datasets. One was composed of a set of
10 items and another with 15 items.

With the validation
input P01, we have:

P01 is a set of 10
weights and profits for a knapsack of capacity 165.

The optimal result value is 309, by selecting items 0, 1, 2, 3 and 5 .

The algorithm finds
the correct answer:

Best 5 answers:

**Value**

**Items Selected**

270 0 1 2 9

276 0 2 3 6

277 0 1 3 4

284 0 1 3 6

309 0 1 2 3 5

The input P02 has
optimal profit value 1458, and the algorithm also finds the correct
answer.

These dataset were found on:

**Setting up the Hadoop Cluster on a Single Machine**

For testing purposes, here follows how to start on Hadoop, and set it up on a single machine.

Download hadoop 1.21
from:

http://hadoop.apache.org/releases.html

Install a Java 6:

$ sudo apt-get
install oracle-java6-installer

Inside the hadoop
home directory, update a configuration file in conf/ :

$ vim
conf/hadoop-env.sh

export
JAVA_HOME=/usr/lib/jvm/java-6-oracle/

export
HADOOP_HOME=/home/renata/hadoop-1.2.1

export
HADOOP_VERSION=1.2.1

Create an input
directory, to place your input file:

$ mkdir input

Put some content in:

$ cp conf/*.xml
input

To test if your
hadoop is working properly:

$ bin/hadoop jar
hadoop-examples*.jar grep input output 'dfs[a-z.]+'

Don't forget to
remove the output file after running a Job. Hadoop doesn't do this
for you.

$ rm -rf output

Now, to test the
Genetic jar file:

$ export
HADOOP_HOME=/home/renata/hadoop-1.2.1

$ export
HADOOP_VERSION=1.2.1

To compile the code
run :

$ javac -classpath
${HADOOP_HOME}/hadoop-core*.jar -d genetic_classes Genetic.java

And create the jar:

$ jar -cvf
genetic.jar -C genetic_classes/ .

Now, to run the code once you put the input data on "input" directory:

$ bin/hadoop jar genetic.jar org.myorg.Genetic input output

Now, to run the code once you put the input data on "input" directory:

$ bin/hadoop jar genetic.jar org.myorg.Genetic input output

*PS: A large large dataset consisting of 10.000 items, with optimal profit value of 7145 and a weight capacity of 431 can be found on: https://github.com/jaredtking/knapsack-lisp/blob/master/datasets/large01.*

Labels:
Algorithms,
Apache Hadoop,
Data Mining,
Map Reduce

## Friday, January 31, 2014

### Dealing with NP-Hard Problems: An Introduction to Approximation Algorithms

*This is just a quick overview on approximation algorithms. It is a broad topic to discuss. For more information go to References.*

The famous

**NP-Complete class is known for its possible intractability.**NP means

*non deterministic polynomial*and for a problem to be NP-Complete it has to be

- NP (verified in polynomial time) and

- NP-Hard (as hard as any other problem in the NP class).

Among the several important problems that are NP-Complete or NP-Hard (on its optimization form) we can name the Knapsack, the Travel Salesmen, and the Set Cover problem.

Even though no efficient optimal solution might exist for NP-Complete problems we still need to address this issue due to the amount of practical problems existent in the NP-Complete class.

Considering that even for medium volumes of data exponential

*brute-force*is impractical, the option is abdicating the optimum solution as minimum as possible and pursuing an efficient algorithm.

**Approximation algorithms are a way to deal with NP-Hard problems**. They are

**polynomial-time algorithms, that return guaranteed near optimal solutions**for any instance of the problem. The objective is to make the output as close to the optimal value as possible. We say that an

*α-approximation*, is an algorithm that is within a α distance from the optimal value.

### Measures of Approximation

Among the measures of approximation it is possible to highlight:

- Absolute approximation
- Approximation factor

**absolute approximation**for an instance I if it stays under a absolute distance k of the optimum algorithm OPT:

|A(I) − OPT(I)| ≤ k ∀ instance I.

An algorithm has an

**approximation factor**of α if:

A(I) ≤ α · OPT(I) ∀I, for minimization

A(I) ≥ α · OPT(I) ∀I, for maximization

being that α>1 for minimization and α<1 for maximization problems.

### Approximation Techniques

**Greedy approaches**are a highly intuitive and usually easy to implement approximation algorithm. It basically consists of making the "best" local choice at each step. For example, for the Set Cover problem, a greedy algorithm would be:

"At each round choose the set that minimizes the ratio weight of its weight to the number of currently uncovered elements it contains."

You can check a proof of approximation here [1].

Another way to create an approximation is by making a

**linear relaxation**of a integral program. Remembering that many problems can be written in a more mathematical form, were we have a

*objective function*to maximize (or minimize),

*constraints*and

*variables*.

The Knapsack is a good example:

The goal here is to maximize the objective function, where

*vi*represents the value of each element

*i*, and variable

*xi*represents whether element

*i*is going to be taken or not. The maximization has to take into account the weight of each element represented by the first constraint.

Approximation through linear relaxation can be done in a very simple way. While Integral programming is NP-Hard, linear programming is not. If we relax (substitute) the constraint

to

Here is what is would look like:

The linear relaxation serves as a

*lowerbound*of the optimal solution in minimization problems and

*upperbound*in maximization problems.

There is also a very powerful method called

**Primal-dual schema**.

It is well known that, in some cases, optimal solutions satisfy the complementary slackness conditions.

The Primal-Dual schema works with a relaxed version of the complementary slackness conditions. It starts with a integral feasible solution for the primal and a feasible solution for the dual, for example. Iteratively, it satisfies the slackness conditions so that no constraint problem is broken in the process. When all conditions are satisfied the program ends.

It can be a little complex, so I am going to do a post about it on the future. For now, to understand it in details please read [2].

### Classes of Approximation

Algorithms as the mentioned above can be classified by their approximation "degree", or how much it is possible to approximate them of the optimal solution.

Following there are listed the classifications:

- PO - problems for which there are exact polynomial algorithms. This an extension of the P class to optimization problems.

- PTAS (Polynomial Time Approximation Scheme) - It presents an approximation scheme which is a (1 +
*e*)OPT approximation for minimization, and, (1 -*e*)OPT for maximization, where*e*is a relative error (rational number).

- FPTAS (Fully Polynomial Time Approximation Scheme) - as PTAS, FPTAS is also a polynomial scheme, but FPTAS running time is polynomial in the inverse of the error (1/
*e)*.

- APX - there is at least one polynomial α-approximation (for some constant α).

- NPO - is an extension of the NP class to optimization problems. This means that an algorithm
*A*is in the NP Optimization class if its instances are polynomial in size, and solution for the problem can be verified in polynomial time and the size of the objective function (solution) can be calculated in polynomial time.

PO ⊆ FPTAS ⊆ PTAS ⊆ APX ⊆ NPO

If an algorithm is in the FTPAS class, this usually means that it can have better approximations than one in the APX class.

The knapsack problem, for example, has been proved to be in FPTAS [3]. If by any chance someone proved it to be also in any other class of the above, than P=NP.

### References

*[1] D. P. Williamson and D. B. Shmoys. The Design of Approximation Algorithms. 2010. Cambridge University Press*

*[2] V. Vazirani. Approximation Algorithms. 2001. Springer-Verlag.*

*[3]*

**(PORTUGUESE)**M.H. Carvalho, M.R. Cerioli, R. Dahab, P. Feofiloff, C.G. Fernandes, C.E. Ferreira, K.S. Guimarães, F.K. Miyazawa, J.C. Pina Jr., J. Soares, Y. Wakabayashi. Uma introdução sucinta a algoritmos de aproximação. 23o Colóquio Brasileiro de Matemática, IMPA, Rio de Janeiro.*[4] Giorgio Ausiello, Pierluigi Crescenzi, and Marco Protasi. Approximate Solution of NP Optimization Problems.*

*Theor. Comput. Sci.**150(1):1-55*(*1995*)
Subscribe to:
Posts (Atom)