Thursday, May 24, 2018

Apache Hadoop Admin Tricks and Tips

In this post I will share some tips I learned after using the Apache Hadoop environment for some years, and  doing many many workshops and courses. The information here considers Apache Hadoop around version 2.9, but it could definably be extended to other similar versions.

These are considerations for when building or using a Hadoop cluster. Some are considerations over the Cloudera distribution. Anyway, hope it helps! 

  • Don't use Hadoop for millions of small files. It overloads the namenode and makes it slower. It is not difficult to overload the namenode. Always check capability vs number of files. Files on Hadoop usually should be more than 100 MB.
  • You have to have a 1 GB of memory for around 1 million files in the namenode.
  • Nodes usually fail after 5 years. Node failures is one of the most frequent problems in Hadoop. Big companies like facebook and google should have node failures by the minute.
  • The MySQL on Cloudera Manager does not have redundancy. This could be a point of failure.
  • Information: the merging of fsimage files happens on the secondary namenode.
  • Hadoop can cache blocks to improve performance. By default it caches 0. 
  • You can set a parameter that sends an acknowledgment message from datanodes back to the namenode after only the first or second data block has been copied to the datanodes. That might make writing data  faster. 
  • Hadoop has rack awareness: it knows which node is connected to witch switch. Actually, it it the Hadoop Admin who configures that.
  • Files are checked from time to time to verify if there was any data corruption (usually every three weeks). This is possible because datanodes store files checksum.
  • Log file stores by default 7 days.
  • part-m-000 are from mapper and part-r-000 are from reducer jobs. The number in the end corresponds to the number of reducers that ran for that job. So part-r008 had 9 reducers  (starts from 0).
  • You can change the log.level of mapper and reducers tasks yo get more information.
  • mapreduce.reduce.log.level=DEBUG
  • yarn server checks what spark did. localhost:4040 also shows what has been done.
  • It is important to check where to put the namenode fsimage file.  You might want to replicate this file.
  • You have to save a lot of disk space (25%) to dfs.datanode.du.reserve, for the shuffle phase.
  • This phase is going to be written in disk, so there needs to be space!
  • When you remove files, they stay on the .Trash directory after removing for a while. The default time is 1 day.
  • You can build a lamdba architecture with flume (consume data in one way and save it on disk for example).
  • Regarding hardware, worker nodes need more cores for more processing. The master nodes don't process that much.
  • For the namenode you want more quality disks and better hardware (like raid - and raid makes no sense on worker nodes).
  • The rule of thumb is: if you want to store 1 TB of data you have to have 4 TB space.
  • Hadoop applications are typically not cpu bound. 
  • Virtualization might give you some benefits (easier to manage), but it hits performance. Usually it brings between 5% and 30% of overhead.
  • Hadoop does not support ipv6. You can disable ipv6. You can also disable selinux inside the cluster. Both give overhead.
  • A good size for a starting cluster is around 6 nodes.
  • Sometimes, when the clusters is too full, you might have to remove a small file to remove a bigger file.

That is it for now. I will try to write a part 2 soon. Let me know if there is anything I missed here!

Friday, November 10, 2017

BigData White Papers

I don't know about you, but I always like to read the white papers that originate OpenSource projects (when available of course :) ).

I have been working with BigData quite a lot lately and this area is mostly dominated by Apache OpenSource projects.

 So, naturally (given the nerd that I am) I tried to investigate their history. I created a list of articles and companies that originated most BigData Apache projects.

Here it is! Hope you guys find it interesting too. :)

Apache Hadoop 

Based on: Google MapReduce and GFS 

Apache Spark 

Created by: University of California, Berkeley 

Apache Hive 

Created by: Facebook

Apache Impala 

Based on: Google F1

Apache HBase

Based on: Google BigTable

Apache Drill 

Based on: Google Dremel

Apache Pig 

Created by: Yahoo!

Apache Oozie 

Created by: Yahoo!

Apache Sqoop 

Started as a module for Apache Hadoop on issue by Aaron Kimball.

Apache Flume


Friday, August 11, 2017

Deep Learning, TensorFlow and Tensor Core

I was lucky enough to get a ticket to the Google I/O 2017 on a Google Code Jam for Women (for girls that don't know, Google has some programming contest for women and the best classified win tickets to the conference).

One of the main topics of the conference was for sure its new Deep Learning library TensorFlow. TensorFlow is Google's OpenSource Machine Learning library that runs both on CPU and GPU.

Two very cool things were presented at Google I/O:

  •  TPU (Tensor Processing Unit) - a GPU optimized specifically for TensorFlow that can be used on the Google Cloud Engine
  •  TensorFlow Lite - a TensorFlow low weight version to run on Android and make developer's lives easier

Last week, at a BigData meetup in Chicago, I discovered that Nvidia also created a specific GPU hardware for processing Deep Learning, the Tensor Core.

 With all this infrastructure and APIs being made available, Deep Learning can be done considerably easier and faster. At Google I/O, Sundar Pichai mentioned that at Google they have been using Machine Learning for almost everything, and even Deep Learning to train the Deep Learning networks!

TensorFlow's API is so high level, that even someone with little technical background can develop something interesting with it. Sundar also shared a story of a high school guy that used the library to help detecting some types of cancer.

It seems that Data Science is becoming attainable.

Wednesday, August 2, 2017

Dummy errors when using neuralnet package in R

Ok, so you read a bunch of stuff on how to do Neural Networks and how many layers or nodes you should add, and etc... But when you start to implement the actual Neural Networks you face a ton of dummy errors that stop your beautiful inspirational programming.

This post talks about some errors you might face when using the neuralnet package in R.

First, remember, to use the package you should install it:




to load the package.

Error 1

One error that might happen training your neural network is this:

nn <- neuralnet(formula1,data=new_data, hidden=c(5,3))

Error in terms.formula(formula) : invalid model formula in ExtractVars

This happens when the name of the variables in formula "formula1" are in a non desired format. For example if you named your columns (or variables) as numbers you would get this error. So change your column names and re-run the model!


label ~ 1 + 2 + 3 + 4 + 5

Change to:

label ~ v1 + v2 + v3 + v4 + v5

Error 2

Another error you might get is the following:

nn <- neuralnet(f, data=train[,-1], hidden=c(3,3))

Warning message:  algorithm did not converge in 1 of 1 repetition(s) within the stepmax

To solve this, you can increase the size of "stepmax" parameter:

nn <- neuralnet(f, data=train[,-1], hidden=c(3,3), stepmax=1e6)

If that doesn't work, you might have to change other parameters to make it converge.  Try reducing the number of hidden nodes or layers. Or changing your training data size.

Error 3

The third error I want to discuss happens when actually computing the output of the neural network:

net.compute <- compute(net, matrix.train2[,1:10])
Error in neurons[[i]] %*% weights[[i]] : non-conformable arguments
This error occurs when the number of columns in the dataframe you are using to predict is different from the columns used to train the neural network. The data frames used in neuralnet and compute should have the same columns and the same names!

That is it! If you faced any other dummy error with the neuralnet package send me and I can add it to the post! Good luck! :D

Tuesday, November 8, 2016

Running k-Means Clustering on Spark with Cloudera in your Machine

Here are some steps to start using Spark. You can download a VirtualBox and a Cloudera Hadoop distribution and start testing it locally on your machine.


Download example that uses MLLIB furnished by Spark.

Create a kmeans_data.txt file that looks like this:

0.0 0.0 0.0
0.1 0.1 0.1
0.2 0.2 0.2
9.0 9.0 9.0
9.1 9.1 9.1
9.2 9.2 9.2

Download VirtualBox.

Download Cloudera CDH5 trial version.
Open VirtualBox, import the downloaded Cloudera's Virtual Box and run it.

Inside VirtualBox:

1 - (needs internet access) Install python numpy library. In a terminal, type:

$ sudo yum install numpy

2 - Copy kmeans_data.txt and to /home/cloudera/ (or wherever you want)

3 - Launch Cloudera Enterprise Trial by clicking on an icon on Cloudera's Desktop or run this command:

$ sudo cloudera-manager --force --enterprise

4 - Open Cloudera Manager Webinterface on your browser. Here are the credentials for that:

user: cloudera
password: cloudera

5 - Start HDFS on ClouderaManager Webinterface (on your browser)

6 - Start Spark on ClouderaManager Webinterface (on your browser)

7 - Put the kmeans_data.txt into HDFS. Run:

$ hadoop fs -put kmeans_data.txt

8 - Run the Spark job locally with 2 threads:

$ spark-submit --master local[2]

7 - Get the result from HDFS, and put it in your current directory:

$ hadoop fs -get KMeansModel/*

8 - The result will be stored in parquet. Read the result with parquet-tools:

$ parquet-tools cat KMeansModel/data/part-r-000..

Here is an example output of what this command should give:

Small note: While running these steps,  errors might appear in some part of the process due to initialization timing issues. I know that is a annoying advice, but if that happens just try running the command again in a couple of minutes. Also, you have to change the location of the kmeans_data.txt file inside to point it to your data, and also maybe change where the output will be written (target/org/apache/spark/PythonKMeansExample/KMeansModel).

Monday, May 16, 2016

Error when using smooth.spline

When trying to interpolate a series of data the cubic spline is a great technique to be used.
I choose to use the smooth.spline function, from the R stats package.

> smooth.spline(data$x, data$y)

Nevertheless, while running smooth.spline on a collection of datasets with different sizes I got the following error:

Error in smooth.spline(data$x, data$y),  :
  'tol' must be strictly positive and finite

After digging a little bit I discovered that the problem was that some datasets were really small and smooth.spline wasn't being able to compute anything.
Hence, make sure your dataset is big enough before applying smooth.spline to it.

> if(length(data$x) > 30) { smooth.spline(data$x, data$y) }


A more generalized solution would be:

> if(IQR(data$x) > 0) { smooth.spline(data$x, data$y) }

Saturday, September 26, 2015

Working with Big Datasets in R

When dealing with a significant amount of data in R the are some points to consider.

How do I know if my data is too big?

Well, the term "BigData" can be thought of as a data that is too big to fit in the available memory.

As R works with the entire dataset in memory (unless you specify it not to do so), the first thing is to check how large is the dataset in question, and if it does fit in memory.

Remember that you actually should have at least double memory of the size of your dataset.
So for example if you dataset has a size of 2 GB, you should have at least 4 GB of memory.

If you don't have enough memory, you should consider breaking your data into smaller chunks and working with them separately.

You can use the command split to do this in Linux:

split -l 10000 file.txt new_file

This should create several new files (new_filea, new_fileb, etc..) with ten thousand lines each.

Well, once you know your date will fit into memory, you can read it with the commands read.table or read.csv. The difference between them is that read.csv sets the parameter sep (from separator) as ",".

If your data does fit in memory, but even so, it occupies almost the entire available space, there are some parameter you can tune to make R faster.

We know that not all parameters are mandatory when calling the read.table command. When we leave some parameters blank, R is going to try to discover automatically what are those. Setting them previously will spare R some calculation, which for large datasets, can be a considerable time.
Some of these parameters are:

  • comment.char - define the comment character in your text. If there are none, you can set it to the empty string ""

  • colclasses - define the class of each column on your data.frame. If they are all numeric, for example, just put "numeric"

If colclasses is not specified, all columns are read as characters and then converted to the appropriated   class.

For more information: