Understanding Apache Hive
Introduction
BigData and Hive
Apache Hive is a software application created to facilitate data
analyses on Apache Hadoop. It is a Java framework that helps
extracting knowledge from data placed on a HDFS cluster by providing
a SQL-like interface to it.
The Apache Hadoop platform is a major project on distributed
computing and it is commonly assumed to be the best approach when
dealing with BigData challenges.
It is now very well established that great volume of data is produced
everyday. Whether it is by system logs or by users purchases, the
amount of information generated is such that previous existing
Databases and Datawarehouses solutions don’t seem to scale well
enough.
The MapReduce programming paradigm was uncovered in 2004 as a new
approach on processing large datasets. In 2005 its OpenSource
version, Hadoop, was created by Doug Cutting. Although Hadoop is not
set for substituting relational databases, it is a good solution for
big data analyses.
Hadoop facilitates large data processing, but it still requires
skillful programmers to create the Map and Reduce functions to
analyze the data. All analyzes made through Hadoop had to be
condensed on these two functions. Creating this type of applications
might be challenging and difficult to maintain. Previous data
developers had difficulty on extracting intelligence from their data.
Hive was created to overcome this issues.
Apache Hive
First introduced by Facebook and latter
donated to the Apache Software Foundation, it is a data warehouse
interface for Hadoop. With Hive, users can create SQL statements that
will be automatically converted to MapReduce jobs and run on a HDFS
cluster.
Data can be inserted or dealt with on the Hadoop cluster through
command line interface using statements from the Hive Quey Language,
or HiveQL, such as SELECT, INSERT or CREATE TABLE. Users can also
create their own User Defined Functions, by extending the UDF class
already provided. Within these statements tables can be defined
using primitive types as integers, floating points, strings, dates
and booleans. Furthermore, new types can be created by grouping these
primitives types into maps and arrays. Please check
https://cwiki.apache.org/Hive/languagemanual.html
for more information on HiveQL.
Although Hive presents a data warehouse interface for Hadoop, it is
still a batch processing framework. As Hive’s data is located on
Hadoop, it is limited to Hadoop’s constraints.
Hadoop does not index data it is not made for editing Data. There is
no UPDATE on Hive, because this functionality could not be executed
on data over HDFS. Hive does not support transactions. If you want
these kind of database on top of Hadoop you should look for options
such as HBase. Check http://wiki.apache.org/hadoop/HadoopIsNot
to read more about this Hadoop limitations.
Even so, Apache Hive made it possible for developers with basic SQL
knowledge to create complicated meaningful queries and quickly
extract value from big data.
Architecture
Users can start interacting with Hive though a Command Line Interface
(CLI), Hive Web Interface (HWI), JDBC or ODBC.
The CLI interface is
a command line tool accessed through a terminal. It can be initiated
by calling the HIVE_HOME/bin/hive script, inside Hive
downloaded source code. Hive also provides a Hive server, so that
users can use JDBC or ODBC to communicate with it.
When you type a query through the CLI interface, this HiveQL
statement will be handled by the Driver component. The Driver
connects a bunch of modules that transform the statement into
MapReduce jobs to be run in Hadoop. It is importante to note that the
query is not transformed in Java code in this process. Its goes
direclty to MapReduce jobs. The modules involved in this process are:
Parser, Semantic Analyzes, Logical Plan generator, Optimizer,
Physical Plan Generator and Executor.
First, the Driver creates a session to remember details about the
process, to maintain dates and statistics.
Some metadata (information about tables and columns) is then
collected and stored on Metastore as soon as the input data
(tables) are created. This metadata is actually stored in a
relational database and it is latter on used on the Semantic
Analyses.
ANTLR software is used to create a parser on the Parser module
and parse the query. As in a compiler, the statement in broken down
into token values and a Abstract Syntax Tree (AST) is created.
The following HiveQL statement
FROM src src1 JOIN src src2 ON (src1.key = src2.key) JOIN src src3 ON
(src1.key + src2.key = src3.key)
INSERT OVERWRITE TABLE dest1 SELECT src1.key, src3.value
would became this AST
(TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_JOIN (TOK_TABREF (TOK_TABNAME
src) src1) (TOK_TABREF (TOK_TABNAME src) src2) (= (.
(TOK_TABLE_OR_COL src1) key) (. (TOK_TABLE_OR_COL src2) key)))
(TOK_TABREF (TOK_TABNAME src) src3) (= (+ (. (TOK_TABLE_OR_COL src1)
key) (. (TOK_TABLE_OR_COL src2) key)) (. (TOK_TABLE_OR_COL src3)
key)))) (TOK_INSERT (TOK_DESTINATION (TOK_TAB (TOK_TABNAME dest1)))
(TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL src1) key))
(TOK_SELEXPR (. (TOK_TABLE_OR_COL src3) value))))) null
The process follows on with a Semantic Analyses on the
generated AST. The information provided on the query is verified to
be valid by confronting the schema information from the input tables,
stored on the Metastore component.
Type checking is a example of operations performed by the Semantic
Analyzes.
After this analyses, a operator tree is created by the Logical
Plan Generator, based on the parsed information and on the AST
created. This operator tree is then, passed to the Optimizer
procedure, which will perform a set of transformations to, not
surprisingly, optimize the operations. The improvements accomplished
by the Optimizer include column pruning (only column really needed
will be fetched) and join reordering (to make sure only small tables
are kept in memory).
The Physical Plan Generator gets the optimized operator tree
and creates a Directed Acyclic Graph of MapReduce jobs of it. This
physical plan is displayed in a XML file, and it is delivered to the
Executor to be executed into the Hadoop cluster finally.
Hive and the different Hadoop versions
Hive can be built with Hadoop 1.x or with Hadoop 2.x. It presents
interfaces for this purpose, and these interfaces are defined in the
Shims interface.
There are three interfaces for Hadoop: 0.20, 0.20S, 0.23. 0.20 is
supposed to work with Hadoop 1.x, 0.20s is for a secure version of
Hadoop 1.x and 0.23 is for building Hive against Hadoop 2.x.
You can prevent a interface to be built by editing the property
shims.include on HIVE_HOME/shims/build.xml:
<property name="shims.include"
value="0.20,0.20S,0.23"/>
Hive uses a Factory Method to decide which Hadoop interface to use,
based on the version of Hadoop on the classpath. This is situated on
HIVE_HOME/shims/src/common/java/org/apache/hadoop/hive/shims/ShimLoader.java.
HIVE_HOME/shims/src/common/java/org/apache/hadoop/hive/shims/HadoopShims.java
encapsulates the
interfaces.
hadoop.version is defined on HIVE_HOME/build.properties but
you can overwrite it by using the flag -Dhadoop.version.
To build Hive with Hadoop 1.1.2, use
$ ant clean package -Dhadoop.version=1.1.2
To build Hive with Hadoop 2.0.4-alpha, use
$ ant clean package -Dhadoop.version=2.0.4-alpha
-Dhadoop-0.23.version=2.0.4-alpha -Dhadoop.mr.rev=2
Building
To use some of Hive features such as UDF, or even to adapt Hive’s
code to your own needs, you might have to build the source code from
source.
Before start, make sure you have a Java JDK, Ant and Subversion
installed on your computer.
Then, start by downloading the last stable release version from Hive
repository.
$ svn checkout
https://svn.opensource.ibm.com/svn/stg-hadoop/hive/0.11.0/trunk
hive-0.11.0
Enter on your Hive home directory (which from now on, we will call
HIVE_HOME):
$ cd hive-0.11.0
And finally build the code with:
$ ant package
This will automatically download and install all dependencies
required for Hive’s use.
Hive depends on (or uses) other Hadoop-related components. As from
Hive 0.11 version, these are:
Apache Hadoop
Apache HBase
Apache Avro
Apache Zookeper
This components will be automatically downloaded by Ant and Ivy, when
you run the ant package command. You can check which version
of each component will be downloaded in
HIVE_HOME/ivy/libraries.properties
and, as explained on last session, Hadoop version can be chekced
here:
HIVE_HOME/build.properties
To check all ant command possibilities with Hive, type:
$ ant -p
This should show how to built it, test it and even how to create a
tar file from the source. The testing will be explained a little bit
further next.
Unit Tests
Hive provides several buitltin Unit Tests to verify its own modules
and features functionalities. They are constructed using JUnit 4 and
run queries (.q files) already provided by the framework.
To create the JUnit classes execute:
$ ant package
To run the unit tests type:
$ ant test
To run a specific test run:
$ ant test -Dtestcase=TestCliDriver
To run a specific query inside one Unit Test run:
$ ant test -Dtestcase=TestCliDriver -Dqfile=alter5.q
The command described above will produce a output that will be
compared with Hive’s expected output. It will also generate a .xml
log file, very helpuf for debbugging purposes:
HIVE_HOME/build/ql/test/TEST-org.apache.hadoop.hive.cli.TestCliDriver.xml
If you are having troubles with a certain testcase, and trying to
debug it, pay attention: some java test files (all files under ql
module) for Hive are created on build time from Velocity Templates
(.vm). If you want to modify this tests you have to change the .vm
file, not the .java one.
References:
Book: Programming Hive:Data Warehouse and Query Language for Hadoop.
Edward
Capriolo,
Dean
Wampler,
Jason
Rutherglen
Article: Hive A Warehousing Solution Over a MapReduce
Framework. Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng
Shao,
Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff and Raghotham
Murthy.
Facebook Data Infrastructure Team
Great delivery. Sound arguments. Keep up the amazing effort.
ReplyDeleteHere is my blog paris 16