Introduction to Apache Hive

October 22, 2012

Hive is a distributed data warehouse that runs on top of Apache Hadoop and enables analyses on huge amount of data.

It provides its own query language HiveQL (similar to SQL) for querying data on a Hadoop cluster. It can manage data in HDFS and run jobs in MapReduce without translating the queries into Java. The mechanism is explained below:

"When MapReduce jobs are required, Hive doesn’t generate Java MapReduce programs.
Instead, it uses built-in, generic Mapper and Reducer modules that are driven by an
XML file representing the “job plan.” In other words, these generic modules function
like mini language interpreters and the “language” to drive the computation is encoded
in XML."

This text was extracted from Programming Hive.

Hive was initially developed by Facebook, with the intention to facilitate running MapReduce jobs on a Hadoop cluster, since sometimes writing Java programs can be challenging for non-Java developers (and for some Java developers as well). The language created HiveQL provides a more approachable manner to make MapReduce jobs.

Hive Architecture Overview

Hive can be accessed via a command line and Web User interfaces. You can also use Hive through the JDBC or ODBC APIs provided. The Thrift server exposes an API to execute HiveQL statements with a different set of languages (PHP, Perl, Pyhton and Java).

The Metastore component is a system catalogue that contains metadata regarding tables, partitions and databases.

It is in the Driver and in the Compiler components that most of the core operations are made. They parse, optimize and execute queries.
The SQL statements are converted to a graph (a DAG graph actually) of map/reduce jobs in run time, and this are run in the Hadoop cluster.

For more information about the Hive architecture take a look at the Facebook article about it.

Building

Hive has some dependencies that you should get before building it, like ant, svn and Java. It also depends on Hadoop, Hbase and Zookeper, but this packages are automatically downloaded by ivy. If you wish to change the Hadoop package it is going to build, take a look at the last section in this post.

Download ant:

#yum install ant.x86_64

(or apt-get install if you are using Debian-like systems).

Download Hive:

$svn co http://svn.apache.org/repos/asf/hive/trunk hive

Set the Java environment:

$export JAVA_HOME=/usr/lib/jvm/java-1.6.0-ibm-1.6.0.11.0.x86_64/

$export HIVE_HOME=/my_hive_home

$export PATH=$HIVE_HOME/bin:$PATH 

Build ant:

$ant clean package

Run hive:

$build/dist/bin/hive 

Troubleshooting

Problem:

ant java.lang.ClassNotFoundException: org.apache.tools.ant.taskdefs.optional.TraXLiaison

Solution:

download ant-trax

Problem:

com.sun.tools.javac.Main is not on the classpath. Perhaps JAVA_HOME does not point to the JDK.

The thing here is that tools.jar must be found by ant when using javac.

Solution:

Download or find the library tools.jar (you can use $locate tools.jar to find it) and make sure it is on your JAVA_HOME directory.
It might also be the case that you are pointing your JAVA_HOME to a JRE Java and not a JDK Java.

Building Hive with different versions of Hadoop

When running ant package command, Ant by default is going to download and build Hive against version 0.20.0 of Hadoop (check on build.properties).

If you, like me, wants to use Hive with your own version of Hadoop, specifically, a newer version,

you can pass the -Dhadoop.version flag or change this hadoop.version property in build.properties.

You might want to know that Hive has an interface called Shims that is made exactly for this situation.

The Shims interface makes it possible for you to create your own Hadoop compatible class, or use one of that is already provided.

Hive provides a 0.20 class, a 0.20 Secure class and a 0.23 class.

If you are willing to build Hive against Hadoop version 1.x you the 20S class. If you are willing to built it with Hadoop 2.x you should use 23 class.

With this 0.20, you can build Hadoop 1.0.0 or newer.

If you want to build only one interface, then on shims/build.xml file edit this line:

and set this line instead:

It is not necessary to exclude the undesired interfaces, because Hive will choose on running time which interface to use, depending on the Hadoop present in your classpath.

Using Hive

For start using Hive, you should run the Hive shell (after configuring the environment as shown above):

$ hive

As a response, you should see the Hive shell promt:

hive>

Now, you should be able to create your first table:

hive> CREATE TABLE my_test (id INT, test STRING);

This table was assumed to be in the default format: lines followed by a '\n'.

hive> LOAD DATA LOCAL INPATH 'my_data' INTO my_test;

With this command, you loaded the data my_data on your local machine to Hive. You can execute queries over this data now:

hive> SELECT * FROM my_test;