Apache Hadoop Admin Tricks and Tips

In this post I will share some tips I learned after using the Apache Hadoop environment for some years, and  doing many many workshops and courses. The information here considers Apache Hadoop around version 2.9, but it could probably be extended to other similar versions.

These are considerations for when building or using a Hadoop cluster. Some are considerations over the Cloudera distribution. Anyway, hope it helps! 

  • Don't use Hadoop for millions of small files. It overloads the namenode and makes it slower. It is not difficult to overload the namenode. Always check capability vs number of files. Files on Hadoop usually should be more than 100 MB.
  • You have to have a 1 GB of memory for around 1 million files in the namenode.
  • Nodes usually fail after 5 years. Node failures is one of the most frequent problems in Hadoop. Big companies like facebook and google should have node failures by the minute.
  • The MySQL on Cloudera Manager does not have redundancy. This could be a point of failure.
  • Information: the merging of fsimage files happens on the secondary namenode.
  • Hadoop can cache blocks to improve performance. By default it caches 0. 
  • You can set a parameter that sends an acknowledgment message from datanodes back to the namenode after only the first or second data block has been copied to the datanodes. That might make writing data  faster. 
  • Hadoop has rack awareness: it knows which node is connected to witch switch. Actually, it it the Hadoop Admin who configures that.
  • Files are checked from time to time to verify if there was any data corruption (usually every three weeks). This is possible because datanodes store files checksum.
  • Log file stores by default 7 days.
  • part-m-000 are from mapper and part-r-000 are from reducer jobs. The number in the end corresponds to the number of reducers that ran for that job. So part-r008 had 9 reducers  (starts from 0).
  • You can change the log.level of mapper and reducers tasks yo get more information.
  • mapreduce.reduce.log.level=DEBUG
  • yarn server checks what spark did. localhost:4040 also shows what has been done.
  • It is important to check where to put the namenode fsimage file.  You might want to replicate this file.
  • You have to save a lot of disk space (25%) to dfs.datanode.du.reserve, for the shuffle phase.
  • This phase is going to be written in disk, so there needs to be space!
  • When you remove files, they stay on the .Trash directory after removing for a while. The default time is 1 day.
  • You can build a lamdba architecture with flume (consume data in one way and save it on disk for example).
  • Regarding hardware, worker nodes need more cores for more processing. The master nodes don't process that much.
  • For the namenode you want more quality disks and better hardware (like raid - and raid makes no sense on worker nodes).
  • The rule of thumb is: if you want to store 1 TB of data you have to have 4 TB space.
  • Hadoop applications are typically not cpu bound. 
  • Virtualization might give you some benefits (easier to manage), but it hits performance. Usually it brings between 5% and 30% of overhead.
  • Hadoop does not support ipv6. You can disable ipv6. You can also disable selinux inside the cluster. Both give overhead.
  • A good size for a starting cluster is around 6 nodes.
  • Sometimes, when the clusters is too full, you might have to remove a small file to remove a bigger file.



That is it for now. I will try to write a part 2 soon. Let me know if there is anything I missed here!

Comments

  1. Great Post, Renata! I'm not into Hadoop yet, but it seems like these insights could save me from some painful experiences in the future. I'll keep this one bookmarked.

    ReplyDelete
  2. Really great information!!! Thanks for your blog.

    ejobsalert
    Technology

    ReplyDelete
  3. Thanks.Very useful information.
    Thanks for sharing this excellent blog.
    hadoop administration training

    ReplyDelete
  4. We understand that selling your home can be a difficult and confusing process, especially if you are in behind in your payments or have a home in need of repair. Our We Buy Houses investors can simplify the process by making you a clear, cash offer to purchase your home, along with presenting other options that may be available to you.
    We Buy Houses Greenfield WI

    ReplyDelete
  5. pool cleaning service Get clean up your pool area with specialist clearing offerings business in an discount prices. Experienced understand how to fresh repair, matintenance and pool.

    pool cleaning companies

    ReplyDelete
  6. We buy houses in Columbia, SC. Need to sell your house fast in Columbia, SC? Simple Home Exits buys houses in Columbia, SC for cash. Contact us today!
    fast cash home buyers

    ReplyDelete
  7. The Highest Grad And Best Quality Rated commercial Bounce Houses
    great post to read

    ReplyDelete
  8. Bounce House Castle Inflatables and Party Event and Concessions Rentals. Reserve Online near South Milwaukee.
    like this

    ReplyDelete
  9. Hope you guys are well and healthy during this time. Guys if you want to utilise your time to do something interesting then we are here for you. Our institution is offering CS executive classes and free CSEET classes only for you guys. So contact us or visit our website at https://uniqueacademyforcommerce.com/

    ReplyDelete
  10. XM REVIEW If You Are A Beginner, Check Out Our Guide On How To Open An Account With XM. They Offer Copy Trading Where You Can Copy The Trades Of Successful Traders.

    ReplyDelete
  11. This comment has been removed by a blog administrator.

    ReplyDelete
  12. This comment has been removed by a blog administrator.

    ReplyDelete

Post a Comment

Popular posts from this blog

Apache Mahout

Slope One

Error when using smooth.spline