Wednesday, January 5, 2011

Apache Hadoop for Beginners


The Apache Hadoop is a framework for distributed computing applications, inspired by Google's MapReduce and GFS paper. It is an Open Source software that enables the processing of massive amounts of data with commodity. First introduced by Doug Cutting, who named the project after his son's toy (a yellow elephant), Hadoop it is now one of the greatest Apache projects. It involves many contributors and users around the world such as Yahoo!, IBM, Facebook and many others.

 

The framework presents a master/worker shared nothing architecture. The Hadoop cluster is composed of a group of single nodes (computers), being one of these nodes the master server and the other nodes the workers. On the master node, the Namenode deamon and the JobTracker daemon usually run. The Namenode deamon keeps files metadata, and the JobTracker one manages the mapreduce tasks executed on the cluster. The management and monitoring of tasks are made by the Hadoop server itself, so the user has basically to specify the input/output locations and supply a map/reduce function with the appropriated interface.
The framework is composed mainly by the MapReduce project, the Hadoop Common and the Hadoop Distributed Filesystem. Hadoop Common package provides the necessary libraries and utilities for the other Hadoop subprojects. The distributed filesystem (HDFS) reliably stores large amounts of data on a computer cluster.
As for the MapReduce subproject, it is a java software for easily writing applications which process huge amounts of data. Directly derived from the MapReduce google project, this implementation was created so that people with no distributed knowledge can create distributed applications. Both HDFS and MapReduce systems are designed so that node failures are automatically handled by the framework.
One of the major project feature is that typically the compute nodes and the storage nodes are the same, meaning that the MapReduce framework and the HDFS are running on the same set of computers. This configuration allows the framework to effectively schedule tasks on the nodes where data is already present, resulting in very high aggregate bandwidth across the cluster. 


Hadoop Installation and Execution

Before you read this, I just have to note that there is an amazing Hadoop installation Tutorial made by Michael NollThe steps that are described below is what have worked for me, but Michael's tutorial is much more complete.

The framework is mainly designed to run in a commodity hardware, but it is also adapted to run on a single machine for tests purposes. When running on a single machine, we have two possible options: running on a Standalone Operation or a Pseudo-Distributed Operation. All methods were interesting to my research so all of them were installed and tested. The Standalone is just a matter of downloading the code and running a simple example. The Pseudo-distributed was attractive to the research due to it capacity of ensuring the right operation of Hadoop. Latter, the Pseudo version could be compared to the distributed one to measure its true efficiency. The Fully-Distributed mode was installed as my main goal.

a. Pseudo-Distributed Operation

When using Hadoop some deamons are started (for example the HFDS and the MapReduce deamon). In this mode, each deamon runs in a separate Java process, simulating the distributed approach.

Configuration

Use the following configuration:

In conf/core-site.xml:

     
         fs.default.name
         hdfs://localhost:9000
     


In conf/hdfs-site.xml:

     
         dfs.replication
         1
     

In conf/mapred-site.xml: 
     
         mapred.job.tracker
         localhost:9001
     

Setup passphraseless ssh

It is necessary to be able to access your own localhost without a password. Test if you are able to do so with:
$ ssh localhost

If you cannot connect with ssh to localhost without a passphrase, execute the following commands: 

$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa                                                                                            
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys                                                                         

Hadoop Execution

Format a new distributed filesystem:
$ bin/hadoop namenode -format

Start the Hadoop daemons:

$ bin/start-all.sh

To check any erros or logs, verify ${HADOOP_LOG_DIR} directory (defaults to
${HADOOP_HOME}/logs).

With the above configuration, it is possible to browse the web interface for the NameNode and the JobTracker:
  • NameNode - http://localhost:50070/
  • JobTracker - http://localhost:50030/
To start working with Hadoop copy your input files into the distributed filesystem:

$ bin/hadoop fs -put conf input
 
Run some of the examples provided by the Hadoop package:
$ bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+'
 
You can obtain your output files by copying it to your local filesystem:

$ bin/hadoop fs -get output output
$ cat output/*
 
Now we have finished the MapReduce execution. Stop all the deamons with:

$ bin/stop-all.sh


b. Fully-Distributed Operation

This tutorial explain how to install and execute the Hadoop distributed framework. As example, we used only to machines, one being the master and a slave and the other being just a slave.

Networking
We first have to make sure that all machines are able reach each other. It is possible to put both machines in the same network with regard to hardware and software configuration.
To simplify the explanation, we will assign the IP address 192.168.0.1 to the master machine and 192.168.0.2 to the slave machine. It is necessary to update /etc/hosts on both machines with corresponding IPs of each machine.

SSH access

The hadoop user on the master ( hadoop@master) must be able to connect to its own user account on the master. It also must be able reach to the hadoop user account on the slave (hadoop@slave) via a password-less SSH login. You just have to add the hadoop@master‘s public SSH key (which should be in $HOME/.ssh/id_rsa.pub) to the authorized_keys file of hadoop@slave (in this user’s $HOME/.ssh/authorized_keys). Use the following SSH command:

hadoop@master:~$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub hadoop@slave
This command will prompt you for the login password for user hadoop on slave, then copy the public SSH key for you, creating the correct directory and fixing the permissions as necessary.
The final step is to test the SSH setup by connecting with user hadoop from the master to the user account hadoop on the slave. The step is also needed to save slave‘s host key fingerprint to the hadoop@master‘s known_hosts file.

 

Cluster Overview

Next, we will describe how to configure one computer as a master node and the other as a slave node. The master node will also act as a slave because we only have two machines available in our cluster but still want to spread data storage and processing to multiple machines.

Configuration

On the master machine only, update /conf/masters that it looks like this:
master
 
Again on master machine only, update /conf/slaves that it looks like this:
master
slave   
Assuming you configured each machine as described in the Pseudo-Distributed tutorial, you will only have to change a few variables.
Important: You have to change the configuration files conf/core-site.xml, conf/mapred-site.xml and conf/hdfs-site.xml on ALL machines on the cluster as follows.
In conf/core-site.xml, set the fs.default.name to “master”:


  fs.default.name
  hdfs://master:54310
  The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem.

Now, we have to change the mapred.job.tracker variable in conf/mapred-site.xml:


  mapred.job.tracker
  master:54311
  The host and port that the MapReduce job tracker runs
  at.  If "local", then jobs are run in-process as a single map
  and reduce task.
  

Third, change the dfs.replication variable in conf/hdfs-site.xml (this values correspond to the number of slaves machines in your cluster):


  dfs.replication
  2
  Default block replication.
  The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time.
  

 
Formatting the Namenode
 
Formatting  will erase all data in your Hadoop filesystems, so it is advisable that you run this command only when first starting the deamons.To format the filesystem run the command:
hadoop@master:/usr/local/hadoop$ bin/hadoop namenode -format

Starting the multi-node cluster

Starting the cluster is done in two steps. First, the HDFS daemons are started, and second the MapReduce demons are started.

HDFS daemons

Run the command /bin/start-dfs.sh on the master machine. It is possible to examine the log file on the slave to check if the operation was alright ( /logs/hadoop-hadoop-datanode-slave.log).
With the jps command, you are able to inspect what Java process are running on the computer. The following process should be running on the master:

hadoop@master:/usr/local/hadoop$ jps  
14799 NameNode
15314 Jps
14880 DataNode
14977 SecondaryNameNode
hadoop@master:/usr/local/hadoop$

And the following processes on slave:
hadoop@slave:/usr/local/hadoop$ jps
15183 DataNode
15616 Jps
hadoop@slave:/usr/local/hadoop$

MapReduce daemons
 
Run the command /bin/start-mapred.sh on the master machine . Processes that should be running on master.
hadoop@master:/usr/local/hadoop$ jps
16017 Jps
14799 NameNode
15686 TaskTracker
14880 DataNode
15596 JobTracker
14977 SecondaryNameNode
hadoop@master:/usr/local/hadoop$

And the following on slave.
hadoop@slave:/usr/local/hadoop$ jps
15183 DataNode
15897 TaskTracker
16284 Jps
hadoop@slave:/usr/local/hadoop$

Stopping the multi-node cluster
 
Like starting the cluster, stopping it is done in two steps. The workflow is the opposite of starting, however. First, we begin with stopping the MapReduce daemons, and second, the HDFS daemons are stopped.
Run the command /bin/stop-mapred.sh on the master machine and then run the command /bin/stop-dfs.sh also on master.

Running Your Application
Once you have the Hadoop Deamons running on the cluster (or on the pseudo-distributed way), you can run your application.
It is first necessary to create a .jar package with your application's files. After you have done so, you have to copy your input data to the Hadoop filesystem. Supposing that your data is in a folder named input, you can copy it to another folder named input in the HDFS with:
$hadoop/bin/hadoop dfs -copyFromLocal input/ input/

Then, to run your application, you have to call the hadoop software, passing the arguments that are expected by your main program and also specifying the main class to be executed.
$hadoop/bin/hadoop jar myapplication.jar myaplication/MyApplication

In this case, we were just calling the MyApplication class, without any arguments.
Note that the output generated should be stored in HDFS, and it is necessary to copy it to your local machine.