Posts

Showing posts from May, 2018

Apache Hadoop Admin Tricks and Tips

In this post I will share some tips I learned after using the Apache Hadoop environment for some years, and  doing many many workshops and courses. The information here considers Apache Hadoop around version 2.9, but it could probably be extended to other similar versions. These are considerations for when building or using a Hadoop cluster. Some are considerations over the Cloudera distribution. Anyway, hope it helps!  Don't use Hadoop for millions of small files. It overloads the namenode and makes it slower. It is not difficult to overload the namenode. Always check capability vs number of files. Files on Hadoop usually should be more than 100 MB. You have to have a 1 GB of memory for around 1 million files in the namenode. Nodes usually fail after 5 years. Node failures is one of the most frequent problems in H adoop . Big companies like facebook and google should have node failures by the minute. The MySQL on Cloudera Manager does not have redundancy. This could