Monday, October 22, 2012

Introduction to Apache Hive

Hive is a distributed data warehouse that runs on top of Apache Hadoop and enables analyses on huge amount of data.


 It provides its own query language HiveQL (similar to SQL) for querying data on a Hadoop cluster. It can manage data in HDFS and run jobs in MapReduce without translating the queries into Java. The mechanism is explained below:

"When MapReduce jobs are required, Hive doesn’t generate Java MapReduce programs.
Instead, it uses built-in, generic Mapper and Reducer modules that are driven by an
XML file representing the “job plan.” In other words, these generic modules function
like mini language interpreters and the “language” to drive the computation is encoded
in XML.
"

This text was extracted from Programming Hive.

Hive was initially developed by Facebook, with the intention to facilitate running MapReduce jobs on a Hadoop cluster, since sometimes writing Java programs can be challenging for non-Java developers (and for some Java developers as well). The language created HiveQL provides a more approachable manner to make MapReduce jobs.

Hive Architecture Overview

Hive can be accessed via a command line and  Web User interfaces. You can also use Hive through the JDBC or ODBC APIs provided. The Thrift server exposes an API to execute HiveQL statements with a different set of languages (PHP, Perl, Pyhton and Java).

The Metastore component is a system catalogue that contains metadata regarding tables, partitions and databases.

It is in the Driver and in the Compiler components that most of the core operations are made. They parse, optimize and execute queries.
The SQL statements are converted to a graph (a DAG graph actually) of map/reduce jobs in run time, and this are run in the Hadoop cluster.

For more information about the Hive architecture take a look at the Facebook article about it.

Building

 Hive has some dependencies that you should get before building it, like ant, svn and Java. It also depends on Hadoop, Hbase and Zookeper, but this packages are automatically downloaded by ivy. If you wish to change the Hadoop package it is going to build, take a look at the last section in this post.
  •  Download ant:
#yum install ant.x86_64

 (or apt-get install if you are using Debian-like systems).
  • Download Hive:
$svn co http://svn.apache.org/repos/asf/hive/trunk hive
  • Set the Java environment:
$export JAVA_HOME=/usr/lib/jvm/java-1.6.0-ibm-1.6.0.11.0.x86_64/
$export HIVE_HOME=/my_hive_home
$export PATH=$HIVE_HOME/bin:$PATH
  • Build ant:
$ant clean package
  • Run hive:
$build/dist/bin/hive


Troubleshooting


Problem:
ant java.lang.ClassNotFoundException: org.apache.tools.ant.taskdefs.optional.TraXLiaison

Solution:
download ant-trax

Problem:
 com.sun.tools.javac.Main is not on the classpath. Perhaps JAVA_HOME does not point to the JDK.

The thing here is that tools.jar must be found by ant when using javac.

Solution:
Download or find the library tools.jar (you can use $locate tools.jar to find it)  and make sure it is on your JAVA_HOME directory.
It might also be the case that you are pointing your  JAVA_HOME to a JRE Java and not a JDK Java.



Building Hive with different versions of Hadoop


When running ant package command, Ant  by default is going to download and build Hive against version 0.20.0 of Hadoop (check on build.properties).


If you, like me, wants to use Hive with your own version of Hadoop, specifically, a newer version,
you can  pass the -Dhadoop.version flag or change this hadoop.version property in build.properties.

You might want to know that Hive has an interface called Shims that is made exactly for this situation.
The Shims interface makes it possible for you to create your own Hadoop compatible class, or use one of that is already provided.
Hive provides a 0.20 class, a 0.20 Secure class and a 0.23 class.

If you are willing to build Hive against Hadoop version 1.x you the 20S class. If you are willing to built it with Hadoop 2.x you should use 23 class.

With this 0.20, you can build Hadoop 1.0.0 or newer.

If you want to build only one interface, then on shims/build.xml file edit this line:

<property name="shims.include" value="0.20,0.20S,0.23"/>                      

and set this line instead:

<property name="shims.include" value="0.20S"/>                     


It is not necessary to exclude the undesired interfaces, because Hive will choose on running time which interface to use, depending on the Hadoop present in your classpath.



Using Hive


For start using Hive, you should run the Hive shell (after configuring the environment as shown above):

 $ hive                                                         

As a response, you should see the Hive shell promt:

hive>                                                           

Now, you should be able to create your first table:

hive> CREATE TABLE my_test (id INT, test STRING);                

This table was assumed to be in the default format: lines followed by a '\n'.

hive> LOAD DATA LOCAL INPATH 'my_data' INTO my_test;              

With this command, you loaded the data my_data on your local machine to Hive. You can execute queries over this data now:

hive> SELECT * FROM my_test;                                      

 

More Information

 

https://cwiki.apache.org/confluence/display/Hive/GettingStarted

https://cwiki.apache.org/confluence/display/Hive/DeveloperGuide 

http://www.youtube.com/watch?v=U0r9s4iXwo0

http://vimeo.com/29732341

http://www.youtube.com/watch?v=Pn7Sp2-hUXE







3 comments: