Genetic Algorithm for Knapsack using Hadoop
Development
of Genetic Algorithm using Apache Hadoop framework to solve
optimization problems
Introduction
This project I developed during a course on my Master intends to construct a Genetic algorithm to solve optimization
problems, focusing on the Knapsack Problem. It uses as base the
distributed framework Apache Hadoop.
The idea is to show
that the MapReduce paradigm implemented by Hadoop is a good fit for
several NP-Complete optimization problems. As knapsack, many problems
present a simple structure and converge to optimal solutions given a
proper amount of computation.
Genetic
Algorithm
The algorithm was
developed based on a Genetic paradigm. It starts with a initial
random population (random instances to the problem). Then, the best
individuals are selected among the population (instances that
generate the best profits for the knapsack). A phase of crossover was
then implemented to generate new instances as combination of the
selected individuals. In this phase, parts of each selected
individual are randomly selected to form a new instance.
The next step of
the process is the mutation phase. In this phase, “genes” are
selected also randomly, and changed. In the knapsack context this
means that random items are substituted from the solutions found.
This cycle can be
repeated several times, until good individuals are generated.
The steps are
reproduced below:
Initial
Population → Selection → Crossover → Mutation
The Knapsack Problem
The goal is to maximize the objective function, where vi represents the value of each element i, and variable xi represents whether element i is going to be taken or not. The maximization has to take into account the weight of each element represented by the first constraint.
This problem was
chosen because its solution only require enough amount of processing,
and this processing can be done in parallel. The space of possible
solutions can be seen as a set of local maximums that are going to be
tested by the genetic algorithm. Hopefully, with enough computation, the global
maximum can be achieved.
Validation
Results
To demonstrate that
the algorithm find a correct result with high probability, it was
tested against two validation datasets. One was composed of a set of
10 items and another with 15 items.
With the validation
input P01, we have:
P01 is a set of 10
weights and profits for a knapsack of capacity 165.
The optimal result value is 309, by selecting items 0, 1, 2, 3 and 5 .
The algorithm finds
the correct answer:
Best 5 answers:
Value Items
Selected
270 0 1 2 9
276 0 2 3 6
277 0 1 3 4
284 0 1 3 6
309 0 1 2 3 5
The input P02 has
optimal profit value 1458, and the algorithm also finds the correct
answer.
These dataset were found on:
Setting
up the Hadoop Cluster on a Single Machine
For testing purposes, here follows how to start on Hadoop, and set it up on a single machine.
Download hadoop 1.21
from:
http://hadoop.apache.org/releases.html
Install a Java 6:
$ sudo apt-get
install oracle-java6-installer
Inside the hadoop
home directory, update a configuration file in conf/ :
$ vim
conf/hadoop-env.sh
export
JAVA_HOME=/usr/lib/jvm/java-6-oracle/
export
HADOOP_HOME=/home/renata/hadoop-1.2.1
export
HADOOP_VERSION=1.2.1
Create an input
directory, to place your input file:
$ mkdir input
Put some content in:
$ cp conf/*.xml
input
To test if your
hadoop is working properly:
$ bin/hadoop jar
hadoop-examples*.jar grep input output 'dfs[a-z.]+'
Don't forget to
remove the output file after running a Job. Hadoop doesn't do this
for you.
$ rm -rf output
Now, to test the
Genetic jar file:
$ export
HADOOP_HOME=/home/renata/hadoop-1.2.1
$ export
HADOOP_VERSION=1.2.1
To compile the code
run :
$ javac -classpath
${HADOOP_HOME}/hadoop-core*.jar -d genetic_classes Genetic.java
And create the jar:
$ jar -cvf
genetic.jar -C genetic_classes/ .
Now, to run the code once you put the input data on "input" directory:
$ bin/hadoop jar genetic.jar org.myorg.Genetic input output
Now, to run the code once you put the input data on "input" directory:
$ bin/hadoop jar genetic.jar org.myorg.Genetic input output
PS: A large large dataset consisting of
10.000 items, with optimal profit value of 7145 and a weight capacity
of 431 can be found on:
https://github.com/jaredtking/knapsack-lisp/blob/master/datasets/large01.
This comment has been removed by a blog administrator.
ReplyDeleteThegeospatial analyticsfield is a relatively new one that is breaking new ground. This is partially because of the recent emergence of the internet and the ability to share data between various organizations and individuals. In the past, if you wanted to share your data, you needed to be in the same location as the person you wanted it to go to. Now that is no longer the case. Geospatial analytics is also a new field because it takes the best aspects of other more established fields and combines them to create something new.
ReplyDelete