Genetic Algorithm for Knapsack using Hadoop


Development of Genetic Algorithm using Apache Hadoop framework to solve optimization problems



Introduction


This project I developed during a course on my Master intends to construct a Genetic algorithm to solve optimization problems, focusing on the Knapsack Problem. It uses as base the distributed framework Apache Hadoop.
The idea is to show that the MapReduce paradigm implemented by Hadoop is a good fit for several NP-Complete optimization problems. As knapsack, many problems present a simple structure and converge to optimal solutions given a proper amount of computation.



Genetic Algorithm


The algorithm was developed based on a Genetic paradigm. It starts with a initial random population (random instances to the problem). Then, the best individuals are selected among the population (instances that generate the best profits for the knapsack). A phase of crossover was then implemented to generate new instances as combination of the selected individuals. In this phase, parts of each selected individual are randomly selected to form a new instance.
The next step of the process is the mutation phase. In this phase, “genes” are selected also randomly, and changed. In the knapsack context this means that random items are substituted from the solutions found.
This cycle can be repeated several times, until good individuals are generated.
The steps are reproduced below:

Initial Population → Selection → Crossover → Mutation




The Knapsack Problem











The goal is to maximize the objective function, where vi represents the value of each element i, and variable xi represents whether element i is going to be taken or not. The maximization has to take into account the weight of each element represented by the first constraint.

This problem was chosen because its solution only require enough amount of processing, and this processing can be done in parallel. The space of possible solutions can be seen as a set of local maximums that are going to be tested by the genetic algorithm. Hopefully, with enough computation, the global maximum can be achieved.




Validation Results


To demonstrate that the algorithm find a correct result with high probability, it was tested against two validation datasets. One was composed of a set of 10 items and another with 15 items.

With the validation input P01, we have:

P01 is a set of 10 weights and profits for a knapsack of capacity 165.
















The optimal result value is 309, by selecting items 0, 1, 2, 3 and 5 .

The algorithm finds the correct answer:

Best 5 answers:

Value Items Selected
270 0 1 2 9
276 0 2 3 6
277 0 1 3 4
284 0 1 3 6
309 0 1 2 3 5

The input P02 has optimal profit value 1458, and the algorithm also finds the correct answer.

These dataset were found on:


Setting up the Hadoop Cluster on a Single Machine



For testing purposes, here follows how to start on Hadoop, and set it up on a single machine. 

Download hadoop 1.21 from:

http://hadoop.apache.org/releases.html

Install a Java 6:

$ sudo apt-get install oracle-java6-installer

Inside the hadoop home directory, update a configuration file in conf/ :

$ vim conf/hadoop-env.sh

export JAVA_HOME=/usr/lib/jvm/java-6-oracle/
export HADOOP_HOME=/home/renata/hadoop-1.2.1
export HADOOP_VERSION=1.2.1

Create an input directory, to place your input file:

$ mkdir input

Put some content in:

$ cp conf/*.xml input

To test if your hadoop is working properly:

$ bin/hadoop jar hadoop-examples*.jar grep input output 'dfs[a-z.]+'

Don't forget to remove the output file after running a Job. Hadoop doesn't do this for you.

$ rm -rf output

Now, to test the Genetic jar file:

$ export HADOOP_HOME=/home/renata/hadoop-1.2.1
$ export HADOOP_VERSION=1.2.1

To compile the code run :

$ javac -classpath ${HADOOP_HOME}/hadoop-core*.jar -d genetic_classes Genetic.java

And create the jar:

$ jar -cvf genetic.jar -C genetic_classes/ .

Now, to run the code once you put the input data on "input" directory:

$ bin/hadoop jar genetic.jar org.myorg.Genetic input output 


PS: A large large dataset consisting of 10.000 items, with optimal profit value of 7145 and a weight capacity of 431 can be found on: https://github.com/jaredtking/knapsack-lisp/blob/master/datasets/large01.

Comments

  1. This comment has been removed by a blog administrator.

    ReplyDelete
  2. Thegeospatial analyticsfield is a relatively new one that is breaking new ground. This is partially because of the recent emergence of the internet and the ability to share data between various organizations and individuals. In the past, if you wanted to share your data, you needed to be in the same location as the person you wanted it to go to. Now that is no longer the case. Geospatial analytics is also a new field because it takes the best aspects of other more established fields and combines them to create something new.

    ReplyDelete

Post a Comment

Popular posts from this blog

Apache Mahout

Slope One

Error when using smooth.spline