Apache Hadoop Admin Tricks and Tips

May 24, 2018

In this post I will share some tips I learned after using the Apache Hadoop environment for some years, and doing many many workshops and courses. The information here considers Apache Hadoop around version 2.9, but it could probably be extended to other similar versions.

These are considerations for when building or using a Hadoop cluster. Some are considerations over the Cloudera distribution. Anyway, hope it helps!

Don't use Hadoop for millions of small files. It overloads the namenode and makes it slower. It is not difficult to overload the namenode. Always check capability vs number of files. Files on Hadoop usually should be more than 100 MB.

You have to have a 1 GB of memory for around 1 million files in the namenode.

Nodes usually fail after 5 years. Node failures is one of the most frequent problems in Hadoop. Big companies like facebook and google should have node failures by the minute.

The MySQL on Cloudera Manager does not have redundancy. This could be a point of failure.

Information: the merging of fsimage files happens on the secondary namenode.

Hadoop can cache blocks to improve performance. By default it caches 0.

You can set a parameter that sends an acknowledgment message from datanodes back to the namenode after only the first or second data block has been copied to the datanodes. That might make writing data faster.

Hadoop has rack awareness: it knows which node is connected to witch switch. Actually, it it the Hadoop Admin who configures that.

Files are checked from time to time to verify if there was any data corruption (usually every three weeks). This is possible because datanodes store files checksum.

Log file stores by default 7 days.

part-m-000 are from mapper and part-r-000 are from reducer jobs. The number in the end corresponds to the number of reducers that ran for that job. So part-r008 had 9 reducers (starts from 0).

You can change the log.level of mapper and reducers tasks yo get more information.

mapreduce.reduce.log.level=DEBUG

yarn server checks what spark did. localhost:4040 also shows what has been done.

It is important to check where to put the namenode fsimage file. You might want to replicate this file.

You have to save a lot of disk space (25%) to dfs.datanode.du.reserve, for the shuffle phase.

This phase is going to be written in disk, so there needs to be space!

When you remove files, they stay on the .Trash directory after removing for a while. The default time is 1 day.

You can build a lamdba architecture with flume (consume data in one way and save it on disk for example).

Regarding hardware, worker nodes need more cores for more processing. The master nodes don't process that much.

For the namenode you want more quality disks and better hardware (like raid - and raid makes no sense on worker nodes).

The rule of thumb is: if you want to store 1 TB of data you have to have 4 TB space.

Hadoop applications are typically not cpu bound.

Virtualization might give you some benefits (easier to manage), but it hits performance. Usually it brings between 5% and 30% of overhead.

Hadoop does not support ipv6. You can disable ipv6. You can also disable selinux inside the cluster. Both give overhead.

A good size for a starting cluster is around 6 nodes.

Sometimes, when the clusters is too full, you might have to remove a small file to remove a bigger file.

That is it for now. I will try to write a part 2 soon. Let me know if there is anything I missed here!

Comments

RicardoJune 13, 2018 at 3:33 PM
Great Post, Renata! I'm not into Hadoop yet, but it seems like these insights could save me from some painful experiences in the future. I'll keep this one bookmarked.
ReplyDelete
Replies
AnonymousJuly 2, 2018 at 7:46 PM
Thanks! ;)
ReplyDelete
Replies
Vicky RamNovember 15, 2018 at 12:43 AM
Really great information!!! Thanks for your blog.

ejobsalert
Technology
ReplyDelete
Replies
veeraMay 20, 2020 at 11:23 PM
Thanks.Very useful information.
Thanks for sharing this excellent blog.
hadoop administration training
ReplyDelete
Replies
ravisynitSeptember 1, 2020 at 3:10 AM
great java tips At SynergisticIT we offer the best java course training in california
ReplyDelete
Replies
kamalApril 23, 2021 at 12:50 PM
We understand that selling your home can be a difficult and confusing process, especially if you are in behind in your payments or have a home in need of repair. Our We Buy Houses investors can simplify the process by making you a clear, cash offer to purchase your home, along with presenting other options that may be available to you.
We Buy Houses Greenfield WI
ReplyDelete
Replies
video chat with strangersApril 24, 2021 at 1:22 PM
pool cleaning service Get clean up your pool area with specialist clearing offerings business in an discount prices. Experienced understand how to fresh repair, matintenance and pool.

pool cleaning companies
ReplyDelete
Replies
ALIApril 25, 2021 at 12:16 PM
We buy houses in Columbia, SC. Need to sell your house fast in Columbia, SC? Simple Home Exits buys houses in Columbia, SC for cash. Contact us today!
fast cash home buyers

ReplyDelete
Replies
UnknownApril 30, 2021 at 9:08 AM
The Highest Grad And Best Quality Rated commercial Bounce Houses
great post to read
ReplyDelete
Replies
UnknownApril 30, 2021 at 9:32 AM
Bounce House Castle Inflatables and Party Event and Concessions Rentals. Reserve Online near South Milwaukee.
like this
ReplyDelete
Replies
Free CSEET Online ClassesJuly 19, 2021 at 3:10 AM
Hope you guys are well and healthy during this time. Guys if you want to utilise your time to do something interesting then we are here for you. Our institution is offering CS executive classes and free CSEET classes only for you guys. So contact us or visit our website at https://uniqueacademyforcommerce.com/
ReplyDelete
Replies
Steven CohenNovember 17, 2021 at 1:18 AM
XM REVIEW If You Are A Beginner, Check Out Our Guide On How To Open An Account With XM. They Offer Copy Trading Where You Can Copy The Trades Of Successful Traders.
ReplyDelete
Replies
Amaze TechnologyMarch 14, 2022 at 2:26 AM
This comment has been removed by a blog administrator.
ReplyDelete
Replies
abcMarch 30, 2022 at 1:14 AM
This comment has been removed by a blog administrator.
ReplyDelete
Replies

Add comment

Search This Blog

Just a girl in Tech

Apache Hadoop Admin Tricks and Tips

Comments

Post a Comment

Popular posts from this blog

Apache Mahout

Slope One