Thursday, May 24, 2018

Apache Hadoop Admin Tricks and Tips

In this post I will share some tips I learned after using the Apache Hadoop environment for some years, and  doing many many workshops and courses. The information here considers Apache Hadoop around version 2.9, but it could definably be extended to other similar versions.

These are considerations for when building or using a Hadoop cluster. Some are considerations over the Cloudera distribution. Anyway, hope it helps! 

  • Don't use Hadoop for millions of small files. It overloads the namenode and makes it slower. It is not difficult to overload the namenode. Always check capability vs number of files. Files on Hadoop usually should be more than 100 MB.
  • You have to have a 1 GB of memory for around 1 million files in the namenode.
  • Nodes usually fail after 5 years. Node failures is one of the most frequent problems in Hadoop. Big companies like facebook and google should have node failures by the minute.
  • The MySQL on Cloudera Manager does not have redundancy. This could be a point of failure.
  • Information: the merging of fsimage files happens on the secondary namenode.
  • Hadoop can cache blocks to improve performance. By default it caches 0. 
  • You can set a parameter that sends an acknowledgment message from datanodes back to the namenode after only the first or second data block has been copied to the datanodes. That might make writing data  faster. 
  • Hadoop has rack awareness: it knows which node is connected to witch switch. Actually, it it the Hadoop Admin who configures that.
  • Files are checked from time to time to verify if there was any data corruption (usually every three weeks). This is possible because datanodes store files checksum.
  • Log file stores by default 7 days.
  • part-m-000 are from mapper and part-r-000 are from reducer jobs. The number in the end corresponds to the number of reducers that ran for that job. So part-r008 had 9 reducers  (starts from 0).
  • You can change the log.level of mapper and reducers tasks yo get more information.
  • mapreduce.reduce.log.level=DEBUG
  • yarn server checks what spark did. localhost:4040 also shows what has been done.
  • It is important to check where to put the namenode fsimage file.  You might want to replicate this file.
  • You have to save a lot of disk space (25%) to dfs.datanode.du.reserve, for the shuffle phase.
  • This phase is going to be written in disk, so there needs to be space!
  • When you remove files, they stay on the .Trash directory after removing for a while. The default time is 1 day.
  • You can build a lamdba architecture with flume (consume data in one way and save it on disk for example).
  • Regarding hardware, worker nodes need more cores for more processing. The master nodes don't process that much.
  • For the namenode you want more quality disks and better hardware (like raid - and raid makes no sense on worker nodes).
  • The rule of thumb is: if you want to store 1 TB of data you have to have 4 TB space.
  • Hadoop applications are typically not cpu bound. 
  • Virtualization might give you some benefits (easier to manage), but it hits performance. Usually it brings between 5% and 30% of overhead.
  • Hadoop does not support ipv6. You can disable ipv6. You can also disable selinux inside the cluster. Both give overhead.
  • A good size for a starting cluster is around 6 nodes.
  • Sometimes, when the clusters is too full, you might have to remove a small file to remove a bigger file.

That is it for now. I will try to write a part 2 soon. Let me know if there is anything I missed here!


  1. Great Post, Renata! I'm not into Hadoop yet, but it seems like these insights could save me from some painful experiences in the future. I'll keep this one bookmarked.

  2. Existing without the answers to the difficulties you’ve sorted out through this guide is a critical case, as well as the kind which could have badly affected my entire career if I had not discovered your website.Block Chain Training in chennai

    Block Chain Training in annanagar

    Block Chain Training in pune

    Block Chain Training in velachery

  3. Inspiring writings and I greatly admired what you have to say , I hope you continue to provide new ideas for us all and greetings success always for you..Keep update more information.

    rpa training in chennai | best rpa training in chennai | rpa training institute in chennai | rpa courses in chennai | rpa training in pune | rpa online training | rpa training in bangalore

  4. Wow it is really wonderful and awesome thus it is very much useful for me to understand many concepts and helped me a lot. it is really explainable very well and i got more information from your blog.

    rpa training in chennai | best rpa training in chennai | rpa training in chennai | rpa training in bangalore
    rpa training in pune | rpa online training

  5. It is amazing and wonderful to visit your site.Thanks for sharing this information,this is useful to me...
    java training in jayanagar | java training in electronic city

    java training in chennai | java training in USA

  6. A good blog always comes-up with new and exciting information and while reading I have feel that this blog is really have all those quality that qualify a blog to be a one.I wanted to leave a little comment to support you and wish you a good continuation. Wishing you the best of luck for all your blogging efforts read this.
    python training Course in chennai
    python training in Bangalore
    Python training institute in kalyan nagar

  7. A very nice guide. I will definitely follow these tips. Thank you for sharing such detailed article. I am learning a lot from you.

    rpa training in electronic-city | rpa training in btm | rpa training in marathahalli | rpa training in pune

  8. Great thoughts you got there, believe I may possibly try just some of it throughout my daily life.
    Best Devops Training in pune