Monday, October 22, 2012

Introduction to Apache Hive

Hive is a distributed data warehouse that runs on top of Apache Hadoop and enables analyses on huge amount of data.

 It provides its own query language HiveQL (similar to SQL) for querying data on a Hadoop cluster. It can manage data in HDFS and run jobs in MapReduce without translating the queries into Java. The mechanism is explained below:

"When MapReduce jobs are required, Hive doesn’t generate Java MapReduce programs.
Instead, it uses built-in, generic Mapper and Reducer modules that are driven by an
XML file representing the “job plan.” In other words, these generic modules function
like mini language interpreters and the “language” to drive the computation is encoded
in XML.

This text was extracted from Programming Hive.

Hive was initially developed by Facebook, with the intention to facilitate running MapReduce jobs on a Hadoop cluster, since sometimes writing Java programs can be challenging for non-Java developers (and for some Java developers as well). The language created HiveQL provides a more approachable manner to make MapReduce jobs.

Hive Architecture Overview

Hive can be accessed via a command line and  Web User interfaces. You can also use Hive through the JDBC or ODBC APIs provided. The Thrift server exposes an API to execute HiveQL statements with a different set of languages (PHP, Perl, Pyhton and Java).

The Metastore component is a system catalogue that contains metadata regarding tables, partitions and databases.

It is in the Driver and in the Compiler components that most of the core operations are made. They parse, optimize and execute queries.
The SQL statements are converted to a graph (a DAG graph actually) of map/reduce jobs in run time, and this are run in the Hadoop cluster.

For more information about the Hive architecture take a look at the Facebook article about it.


 Hive has some dependencies that you should get before building it, like ant, svn and Java. It also depends on Hadoop, Hbase and Zookeper, but this packages are automatically downloaded by ivy. If you wish to change the Hadoop package it is going to build, take a look at the last section in this post.
  •  Download ant:
#yum install ant.x86_64

 (or apt-get install if you are using Debian-like systems).
  • Download Hive:
$svn co hive
  • Set the Java environment:
$export JAVA_HOME=/usr/lib/jvm/java-1.6.0-ibm-
$export HIVE_HOME=/my_hive_home
$export PATH=$HIVE_HOME/bin:$PATH
  • Build ant:
$ant clean package
  • Run hive:


ant java.lang.ClassNotFoundException:

download ant-trax

Problem: is not on the classpath. Perhaps JAVA_HOME does not point to the JDK.

The thing here is that tools.jar must be found by ant when using javac.

Download or find the library tools.jar (you can use $locate tools.jar to find it)  and make sure it is on your JAVA_HOME directory.
It might also be the case that you are pointing your  JAVA_HOME to a JRE Java and not a JDK Java.

Building Hive with different versions of Hadoop

When running ant package command, Ant  by default is going to download and build Hive against version 0.20.0 of Hadoop (check on

If you, like me, wants to use Hive with your own version of Hadoop, specifically, a newer version,
you can  pass the -Dhadoop.version flag or change this hadoop.version property in

You might want to know that Hive has an interface called Shims that is made exactly for this situation.
The Shims interface makes it possible for you to create your own Hadoop compatible class, or use one of that is already provided.
Hive provides a 0.20 class, a 0.20 Secure class and a 0.23 class.

If you are willing to build Hive against Hadoop version 1.x you the 20S class. If you are willing to built it with Hadoop 2.x you should use 23 class.

With this 0.20, you can build Hadoop 1.0.0 or newer.

If you want to build only one interface, then on shims/build.xml file edit this line:

<property name="shims.include" value="0.20,0.20S,0.23"/>                      

and set this line instead:

<property name="shims.include" value="0.20S"/>                     

It is not necessary to exclude the undesired interfaces, because Hive will choose on running time which interface to use, depending on the Hadoop present in your classpath.

Using Hive

For start using Hive, you should run the Hive shell (after configuring the environment as shown above):

 $ hive                                                         

As a response, you should see the Hive shell promt:


Now, you should be able to create your first table:

hive> CREATE TABLE my_test (id INT, test STRING);                

This table was assumed to be in the default format: lines followed by a '\n'.

hive> LOAD DATA LOCAL INPATH 'my_data' INTO my_test;              

With this command, you loaded the data my_data on your local machine to Hive. You can execute queries over this data now:

hive> SELECT * FROM my_test;                                      


More Information

Tuesday, October 9, 2012

Working and Testing with Ant

Recently I've been working with Hive and had some troubles working with Ant. For this reason, I bought the book "Ant in Action" and I'll share some thing I've learned with it, and some other experiences I had working with Ant.

Ant is a tool written in Java to build Java. By build I mean compile, run, package it and even test it.
It is designed for helping software teams develop big projects, automating tasks of compiling code.

To run Ant in you project you should have a build.xml file that describes how the projects should be built.
There should be one build per project, except if the project is too big. Then you might have subprojects with "sub build" files. These sub builds are coordinated by the main build file.

The goals of the build are listed as targets. You can have, for example, a target init that creates initial necessary directories.


You can also have a target compile that depends on the previous init target, and compiles the Java code.


After installing Ant, if you run the command:

$ant -p

you will see all the available targets in your project. These are all tests you can perform in your project.

In build.xml, inside of each target tag, you should see a command. In the compile target you should probably see a javac command. That is the action Ant is going to execute to perform the appropriate task.

To test, Ant uses JUNit, a unit test framework that verifies if your software components are working individually.
JUnit is API that facilitates writing Java test cases, whose execution can be fully automated. To use it, you just need to download a junit.jar file and set it in your Ant path. In my case, it was just a matter of adding it to the /usr/share/ant/lib/ directory.

Writing a test for JUnit involves three steps:

• Create a subclass of junit.framework.TestCase.
• Provide a constructor, accepting a single String name parameter, which calls super (name).
• Write some public no-argument void methods prefixed by the word test.

See the example:

public class SimpleTest extends TestCase {
  public SimpleTest(String s) {
  public void testCreation() {
    Event event=new Event();
The only part that actually  tests anything in this program is the testCreation() method, which is simply going to try to create an Event.
Beware that methods without the "test" prefix are ignored. Also your methods shouldn't have any arguments neither a return type.

Pay attention that this regards to JUnit 3. There are several differences between JUnit 3 and 4. You can check some of them here.
Just so you know, if the testcase fails, you will be presented with a junit.framework.AssertionFailedError.

To run your tests with ant, you should add a test target, with a <junit> tag.
A build.xml example would be:


It should work by just typing:

$ant test-basic