Apache Hadoop Admin Tricks and Tips
In this post I will share some tips I learned after using the Apache Hadoop environment for some years, and doing many many workshops and courses. The information here considers Apache Hadoop around version 2.9, but it could probably be extended to other similar versions.
These are considerations for when building or using a Hadoop cluster. Some are considerations over the Cloudera distribution. Anyway, hope it helps!
That is it for now. I will try to write a part 2 soon. Let me know if there is anything I missed here!
These are considerations for when building or using a Hadoop cluster. Some are considerations over the Cloudera distribution. Anyway, hope it helps!
- Don't use Hadoop for millions of small files. It overloads the namenode and makes it slower. It is not difficult to overload the namenode. Always check capability vs number of files. Files on Hadoop usually should be more than 100 MB.
- You have to have a 1 GB of memory for around 1 million files in the namenode.
- Nodes usually fail after 5 years. Node failures is one of the most frequent problems in Hadoop. Big companies like facebook and google should have node failures by the minute.
- The MySQL on Cloudera Manager does not have redundancy. This could be a point of failure.
- Information: the merging of fsimage files happens on the secondary namenode.
- Hadoop can cache blocks to improve performance. By default it caches 0.
- You can set a parameter that sends an acknowledgment message from datanodes back to the namenode after only the first or second data block has been copied to the datanodes. That might make writing data faster.
- Hadoop has rack awareness: it knows which node is connected to witch switch. Actually, it it the Hadoop Admin who configures that.
- Files are checked from time to time to verify if there was any data corruption (usually every three weeks). This is possible because datanodes store files checksum.
- Log file stores by default 7 days.
- part-m-000 are from mapper and part-r-000 are from reducer jobs. The number in the end corresponds to the number of reducers that ran for that job. So part-r008 had 9 reducers (starts from 0).
- You can change the log.level of mapper and reducers tasks yo get more information.
- mapreduce.reduce.log.level=
DEBUG
- yarn server checks what spark did. localhost:4040 also shows what has been done.
- It is important to check where to put the namenode fsimage file. You might want to replicate this file.
- You have to save a lot of disk space (25%) to dfs.datanode.du.reserve, for the shuffle phase.
- This phase is going to be written in disk, so there needs to be space!
- When you remove files, they stay on the .Trash directory after removing for a while. The default time is 1 day.
- You can build a lamdba architecture with flume (consume data in one way and save it on disk for example).
- Regarding hardware, worker nodes need more cores for more processing. The master nodes don't process that much.
- For the namenode you want more quality disks and better hardware (like raid - and raid makes no sense on worker nodes).
- The rule of thumb is: if you want to store 1 TB of data you have to have 4 TB space.
- Hadoop applications are typically not cpu bound.
- Virtualization might give you some benefits (easier to manage), but it hits performance. Usually it brings between 5% and 30% of overhead.
- Hadoop does not support ipv6. You can disable ipv6. You can also disable selinux inside the cluster. Both give overhead.
- A good size for a starting cluster is around 6 nodes.
- Sometimes, when the clusters is too full, you might have to remove a small file to remove a bigger file.
That is it for now. I will try to write a part 2 soon. Let me know if there is anything I missed here!
Great Post, Renata! I'm not into Hadoop yet, but it seems like these insights could save me from some painful experiences in the future. I'll keep this one bookmarked.
ReplyDeleteThanks! ;)
ReplyDeleteReally great information!!! Thanks for your blog.
ReplyDeleteejobsalert
Technology
Thanks.Very useful information.
ReplyDeleteThanks for sharing this excellent blog.
hadoop administration training
great java tips At SynergisticIT we offer the best java course training in california
ReplyDeleteWe understand that selling your home can be a difficult and confusing process, especially if you are in behind in your payments or have a home in need of repair. Our We Buy Houses investors can simplify the process by making you a clear, cash offer to purchase your home, along with presenting other options that may be available to you.
ReplyDeleteWe Buy Houses Greenfield WI
pool cleaning service Get clean up your pool area with specialist clearing offerings business in an discount prices. Experienced understand how to fresh repair, matintenance and pool.
ReplyDeletepool cleaning companies
We buy houses in Columbia, SC. Need to sell your house fast in Columbia, SC? Simple Home Exits buys houses in Columbia, SC for cash. Contact us today!
ReplyDeletefast cash home buyers
The Highest Grad And Best Quality Rated commercial Bounce Houses
ReplyDeletegreat post to read
Bounce House Castle Inflatables and Party Event and Concessions Rentals. Reserve Online near South Milwaukee.
ReplyDeletelike this
Hope you guys are well and healthy during this time. Guys if you want to utilise your time to do something interesting then we are here for you. Our institution is offering CS executive classes and free CSEET classes only for you guys. So contact us or visit our website at https://uniqueacademyforcommerce.com/
ReplyDeletenices blog thanku so much this information.
ReplyDeletefree classified submission sites list
KISHorsasemahal
XM REVIEW If You Are A Beginner, Check Out Our Guide On How To Open An Account With XM. They Offer Copy Trading Where You Can Copy The Trades Of Successful Traders.
ReplyDeleteThis comment has been removed by a blog administrator.
ReplyDeleteThis comment has been removed by a blog administrator.
ReplyDelete