Posts

Showing posts from October, 2012

Introduction to Apache Hive

Image
Hive is a distributed data warehouse that runs on top of Apache Hadoop and enables analyses on huge amount of data.  It provides its own query language HiveQL (similar to SQL) for querying data on a Hadoop cluster. It can manage data in HDFS and run jobs in MapReduce without translating the queries into Java. The mechanism is explained below: " When MapReduce jobs are required, Hive doesn’t generate Java MapReduce programs. Instead, it uses built-in, generic Mapper and Reducer modules that are driven by an XML file representing the “job plan.” In other words, these generic modules function like mini language interpreters and the “language” to drive the computation is encoded in XML. " This text was extracted from Programming Hive . Hive was initially developed by Facebook, with the intention to facilitate running MapReduce jobs on a Hadoop cluster, since sometimes writing Java programs can be challenging for non-Java developers (and for some Java develope...

Working and Testing with Ant

Recently I've been working with Hive and had some troubles working with Ant . For this reason, I bought the b ook " Ant in Action " and I'll share some thing I've learned with it, and some other experiences I had working with Ant. Ant is a tool written in Java to build Java . By build I mean compile, run, package it and even test it. It is designed for helping software teams develop big projects, automating tasks of compiling code . To run Ant in you project you should have a build.xml file that describes how the projects should be built. There should be one build per project, except if the project is too big. Then you might have subprojects with " sub build " files. These sub builds are coordinated by the main build file. The goals of the build are listed as targets . You can have, for example, a target init that creates initial necessary directories. You can also have a target compile that depends on the previous ini...