Posts

Removing Outliers to Plot Data

I am currently working a lot with R . One simple thing that helps me to better visualize data is to plot it excluding outliers. To do so, first read the data data = read.table(“myfile.txt”)                                      Then, you can check how data is distributed quantile(data, c(.02, .05, .10, .50, .90, .95, .98))                 An example output would be    2%   5%  10%  50%  90%  95%  98%   189  190 190  194  241  275  316  Now,  to plot your data discarding the 1% lowest values and 1% higher values, you could use x <- quantile(data, c(.01, .99))                                    And then plot(data, xlim=c(x[[1]], x[[2]])) ...

SVM in Practice

Image
Many Machine Learning articles and papers describe the wonders of the Support Vector Machine (SVM) algorithm. Nevertheless, when using it on real data trying to obtain a high accuracy classification, I stumbled upon several issues. I will try to describe the steps I took to make the algorithm work in practice. This model was implemented using R and the library "e1071". To install and use it type: > install.packages("e1071") > library("e1071") When you want to classify data in two categories , few algorithms are better than SVM. It usually divides data in two different sets by finding a "line" that better separates the points. It is capable to classify data linearly (put a straight line to differentiate sets) or do a nonlinear classification (separates sets with a curve). This "separator" is called a hyperplane . Picture 1 - Linear hyperplane separator Normalize Features Before you even start running the algor...

Lecture on Recommender Systems

Great lecture on Recommender Systems by   Xavier Amatriain,  Researcher on Netflix.  https://www.youtube.com/watch?v=bLhq63ygoU8 https://www.youtube.com/watch?v=mRToFXlNBpQ

Genetic Algorithm for Knapsack using Hadoop

Image
Development of Genetic Algorithm using Apache Hadoop framework to solve optimization problems Introduction This project I developed during a course on my Master intends to construct a Genetic algorithm to solve optimization problems, focusing on the Knapsack Problem. It uses as base the distributed framework Apache Hadoop. The idea is to show that the MapReduce paradigm implemented by Hadoop is a good fit for several NP-Complete optimization problems. As knapsack, many problems present a simple structure and converge to optimal solutions given a proper amount of computation. Genetic Algorithm The algorithm was developed based on a Genetic paradigm. It starts with a initial random population (random instances to the problem). Then, the best individuals are selected among the population (instances that generate the best profits for the knapsack). A phase of crossover was then implemented to generate new instances as combination of the selected indi...

Dealing with NP-Hard Problems: An Introduction to Approximation Algorithms

Image
This is just a quick overview on approximation algorithms. It is a broad topic to discuss. For more info rmation go to References. The famous NP-Complete class is known for its possible intractability. NP means non deterministic polynomial and for a problem to be NP-Complete it has to be   NP  (verified in polynomial time) and   NP-Hard (as hard as any other problem in the NP class). Among the several important problems that are NP-Complete or NP-Hard (on its optimization form) we can name the Knapsack, the Travel Salesmen, and the Set Cover problem. Even though no efficient optimal solution might exist for NP-Complete problems we still need to address this issue due to the amount of practical problems existent in the NP-Complete class. Considering that even for medium volumes of data exponential brute-force is impractical, the option is abdicating the optimum solution as minimum as possible and pursuing an efficient algorithm.   Approximation a...

Overview of Digital Cloning

Image
Introduction The growth of the image processing and editing software availability has made it easy to manipulate digital images. With the amount of digital content being generated nowadays, developing  techniques to verify the authenticity and integrity of digital content might be essential to provide truthful evidences in a forensics case. In this context, copy-move is a type of forgery in which a part of an image is copied and pasted somewhere else in the same image . This forgery might be particularly challenging to discover due to properties like illumination  and noise matching on the source and the tampered regions. An example of copy-move forgery can be seen in picture 1. First we can see the original image, followed by the tampered one, and then a picture with the indication of the cloned areas. Several techniques have been proposed to solve this problem. The Block-based methods [1] divide an image in blocks of pixels and compare them to find a forgery. Ke...

Understanding Apache Hive

Image
   Introduction   BigData and Hive Apache Hive is a software application created to facilitate data analyses on Apache Hadoop. It is a Java framework that helps extracting knowledge from data placed on a HDFS cluster by providing a SQL-like interface to it. The Apache Hadoop platform is a major project on distributed computing and it is commonly assumed to be the best approach when dealing with BigData challenges. It is now very well established that great volume of data is produced everyday. Whether it is by system logs or by users purchases, the amount of information generated is such that previous existing Databases and Datawarehouses solutions don’t seem to scale well enough. The MapReduce programming paradigm was uncovered in 2004 as a new approach on processing large datasets. In 2005 its OpenSource version, Hadoop, was created by Doug Cutting. Although Hadoop is not set for substituting relational databases, it is a good solution for big...