Posts

Showing posts with the label Apache Hadoop

Apache Hadoop Admin Tricks and Tips

In this post I will share some tips I learned after using the Apache Hadoop environment for some years, and  doing many many workshops and courses. The information here considers Apache Hadoop around version 2.9, but it could probably be extended to other similar versions. These are considerations for when building or using a Hadoop cluster. Some are considerations over the Cloudera distribution. Anyway, hope it helps!  Don't use Hadoop for millions of small files. It overloads the namenode and makes it slower. It is not difficult to overload the namenode. Always check capability vs number of files. Files on Hadoop usually should be more than 100 MB. You have to have a 1 GB of memory for around 1 million files in the namenode. Nodes usually fail after 5 years. Node failures is one of the most frequent problems in H adoop . Big companies like facebook and google should have node failures by the minute. The MySQL on Cloudera Manager does not have redunda...

BigData White Papers

I don't know about you, but I always like to read the white papers that originate OpenSource projects (when available of course :) ). I have been working with BigData quite a lot lately and this area is mostly dominated by Apache OpenSource projects.  So, naturally (given the nerd that I am) I tried to investigate their history. I created a list of articles and companies that originated most BigData Apache projects. Here it is! Hope you guys find it interesting too. :) Apache Hadoop  Based on: Google MapReduce and GFS  Papers: https://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf https://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf Apache Spark   Created by: University of California, Berkeley  Papers:  http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf http://people.csail.mit.edu/matei/papers/2010/hotcloud_spark.pdf http://peo...

Running k-Means Clustering on Spark with Cloudera in your Machine

Image
Here are some steps to start using Spark. You can download a VirtualBox and a Cloudera Hadoop distribution and start testing it locally on your machine. Steps : Download kmeans.py example that uses MLLIB furnished by Spark. Create a kmeans_data.txt file that looks like this: 0.0 0.0 0.0 0.1 0.1 0.1 0.2 0.2 0.2 9.0 9.0 9.0 9.1 9.1 9.1 9.2 9.2 9.2 Download VirtualBox . Download Cloudera CDH5 trial version. Open VirtualBox, import the downloaded Cloudera's Virtual Box and run it. Inside VirtualBox: 1 - (needs internet access) Install python numpy library. In a terminal, type: $ sudo yum install numpy 2 - Copy kmeans_data.txt and kmeans.py to /home/cloudera/ (or wherever you want) 3 - Launch Cloudera Enterprise Trial by clicking on an icon on Cloudera's Desktop or run this command: $ sudo cloudera-manager --force --enterprise 4 - Open Cloudera Manager Webinterface on your browser. Here are the credentials for that: user:...

Genetic Algorithm for Knapsack using Hadoop

Image
Development of Genetic Algorithm using Apache Hadoop framework to solve optimization problems Introduction This project I developed during a course on my Master intends to construct a Genetic algorithm to solve optimization problems, focusing on the Knapsack Problem. It uses as base the distributed framework Apache Hadoop. The idea is to show that the MapReduce paradigm implemented by Hadoop is a good fit for several NP-Complete optimization problems. As knapsack, many problems present a simple structure and converge to optimal solutions given a proper amount of computation. Genetic Algorithm The algorithm was developed based on a Genetic paradigm. It starts with a initial random population (random instances to the problem). Then, the best individuals are selected among the population (instances that generate the best profits for the knapsack). A phase of crossover was then implemented to generate new instances as combination of the selected indi...

Understanding Apache Hive

Image
   Introduction   BigData and Hive Apache Hive is a software application created to facilitate data analyses on Apache Hadoop. It is a Java framework that helps extracting knowledge from data placed on a HDFS cluster by providing a SQL-like interface to it. The Apache Hadoop platform is a major project on distributed computing and it is commonly assumed to be the best approach when dealing with BigData challenges. It is now very well established that great volume of data is produced everyday. Whether it is by system logs or by users purchases, the amount of information generated is such that previous existing Databases and Datawarehouses solutions don’t seem to scale well enough. The MapReduce programming paradigm was uncovered in 2004 as a new approach on processing large datasets. In 2005 its OpenSource version, Hadoop, was created by Doug Cutting. Although Hadoop is not set for substituting relational databases, it is a good solution for big...