Just a girl in Tech

Posts

Working with Big Datasets in R

September 26, 2015

When dealing with a significant amount of data in R the are some points to consider. How do I know if my data is too big? Well, the term "BigData" can be thought of as a data that is too big to fit in the available memory. As R works with the entire dataset in memory (unless you specify it not to do so), the first thing is to check how large is the dataset in question, and if it does fit in memory . Remember that you actually should have at least double memory of the size of your dataset. So for example if you dataset has a size of 2 GB, you should have at least 4 GB of memory. If you don't have enough memory, you should consider breaking your data into smaller chunks and working with them separately. You can use the command split to do this in Linux: split -l 10000 file.txt new_file This should create several new files (new_filea, new_fileb, etc..) with ten thousand lines each. Well, once you know your date will fit into memory, you can read it with th...

Removing Outliers to Plot Data

July 31, 2015

I am currently working a lot with R . One simple thing that helps me to better visualize data is to plot it excluding outliers. To do so, first read the data data = read.table(“myfile.txt”) Then, you can check how data is distributed quantile(data, c(.02, .05, .10, .50, .90, .95, .98)) An example output would be 2% 5% 10% 50% 90% 95% 98% 189 190 190 194 241 275 316 Now, to plot your data discarding the 1% lowest values and 1% higher values, you could use x <- quantile(data, c(.01, .99)) And then plot(data, xlim=c(x[[1]], x[[2]])) ...

SVM in Practice

February 11, 2015

Many Machine Learning articles and papers describe the wonders of the Support Vector Machine (SVM) algorithm. Nevertheless, when using it on real data trying to obtain a high accuracy classification, I stumbled upon several issues. I will try to describe the steps I took to make the algorithm work in practice. This model was implemented using R and the library "e1071". To install and use it type: > install.packages("e1071") > library("e1071") When you want to classify data in two categories , few algorithms are better than SVM. It usually divides data in two different sets by finding a "line" that better separates the points. It is capable to classify data linearly (put a straight line to differentiate sets) or do a nonlinear classification (separates sets with a curve). This "separator" is called a hyperplane . Picture 1 - Linear hyperplane separator Normalize Features Before you even start running the algor...

Lecture on Recommender Systems

August 05, 2014

Great lecture on Recommender Systems by Xavier Amatriain, Researcher on Netflix. https://www.youtube.com/watch?v=bLhq63ygoU8 https://www.youtube.com/watch?v=mRToFXlNBpQ

Genetic Algorithm for Knapsack using Hadoop

July 16, 2014

Development of Genetic Algorithm using Apache Hadoop framework to solve optimization problems Introduction This project I developed during a course on my Master intends to construct a Genetic algorithm to solve optimization problems, focusing on the Knapsack Problem. It uses as base the distributed framework Apache Hadoop. The idea is to show that the MapReduce paradigm implemented by Hadoop is a good fit for several NP-Complete optimization problems. As knapsack, many problems present a simple structure and converge to optimal solutions given a proper amount of computation. Genetic Algorithm The algorithm was developed based on a Genetic paradigm. It starts with a initial random population (random instances to the problem). Then, the best individuals are selected among the population (instances that generate the best profits for the knapsack). A phase of crossover was then implemented to generate new instances as combination of the selected indi...

Dealing with NP-Hard Problems: An Introduction to Approximation Algorithms

January 31, 2014

This is just a quick overview on approximation algorithms. It is a broad topic to discuss. For more info rmation go to References. The famous NP-Complete class is known for its possible intractability. NP means non deterministic polynomial and for a problem to be NP-Complete it has to be NP (verified in polynomial time) and NP-Hard (as hard as any other problem in the NP class). Among the several important problems that are NP-Complete or NP-Hard (on its optimization form) we can name the Knapsack, the Travel Salesmen, and the Set Cover problem. Even though no efficient optimal solution might exist for NP-Complete problems we still need to address this issue due to the amount of practical problems existent in the NP-Complete class. Considering that even for medium volumes of data exponential brute-force is impractical, the option is abdicating the optimum solution as minimum as possible and pursuing an efficient algorithm. Approximation a...

Overview of Digital Cloning

December 20, 2013

Introduction The growth of the image processing and editing software availability has made it easy to manipulate digital images. With the amount of digital content being generated nowadays, developing techniques to verify the authenticity and integrity of digital content might be essential to provide truthful evidences in a forensics case. In this context, copy-move is a type of forgery in which a part of an image is copied and pasted somewhere else in the same image . This forgery might be particularly challenging to discover due to properties like illumination and noise matching on the source and the tampered regions. An example of copy-move forgery can be seen in picture 1. First we can see the original image, followed by the tampered one, and then a picture with the indication of the cloned areas. Several techniques have been proposed to solve this problem. The Block-based methods [1] divide an image in blocks of pixels and compare them to find a forgery. Ke...