Hadoop Sizing – A Basic Capacity Approach

Cory Minton

11 years ago

From time to time, I have the opportunity to work with customers setting up their initial Hadoop cluster and we are often asked what should be a simple question:

“How many nodes do I need in my initial Hadoop cluster?”

In my time in the field, I’ve struggled with an easy answer to that question, although it struck me as something that should be pretty straightforward. Well, after a recent conversation with some senior field engineers at one of the big Hadoop shops, I finally got a simply elegant answer based on initial storage requirements. So it goes like this:

SE: “how much data do you plan to load into your Hadoop cluster in year 1?”
Customer: 100 TB
SE: You’d need 19 nodes based on this simple excel formula =ROUNDUP(capacity needed / 5.33333333)
Customer: Wow, that was easy. Now what if I want to use Isilon?
SE: Well, you’d use the same number of compute nodes as above and you’d need just 10 Isilon nodes and we’d handle all the Namenode and Datanode fucntions for your Hadoop environment. Let me get that configured and quoted for you. I’ll also include a handy TCO showing how much better the TCO is on Isilon. Would you like me to also include the Hadoop subscription licenses for your Hadoop cluster as well in my quote?
Customer: Yes. I can’t wait to buy this from you.

Pretty easy, right?

If you are interested in how we came up with this, here is the guidance:

Take data requirements and multiply times 4.5 (scaling factor that covers 3X mirror, file system free space, and general space for safety)
That gives you the amount of capacity of DAS storage raw
Take that and divide it by 24 drives (typical server build of a slave node in a Hadoop Data Node is 12 – 2TB drives)
The outcome is a basic data node count.
Each compute node should be sized with 4GB RAM/job/compute node.

Extrapolated further for Isilon sizing, I came up with this:

Take the DAS server count above, divide by 2 and this is your Isilon node count
Use the capacity requirement stated previously and choose node type and capacity to meet capacity with node count requirement based on utilization at 80% utilization, ideally S nodes, X nodes as second choice (based on recent guidance from @bigdataryan)

Clearly, this is a super simplified approach, but dang if it isn’t handy?!? Now, I am well aware of many cases where this number and the configuration of a Hadoop cluster are dependent on more factors that capacity…like say are you planning to use Spark, SparkStreaming, HAWQ, Impala, TEZ, and on and on, but it’s a handy place to start. And to top it off, when you deploy Hadoop on Isilon, you have a ton of flexibility in scaling compute and storage distinctly to address the infrastructure constraints based on workload requirements.

So here’s to making Hadoop as simple as possible…one cluster at a time.

-your bearded friend