Home United States USA — software Hadoop Cluster Capacity Planning of Data Node for batch and in-memory processes...

Hadoop Cluster Capacity Planning of Data Node for batch and in-memory processes Hadoop Cluster Capacity Planning of Data Node for batch and in-memory processes

June 25, 2017

129

Here I am sharing my experince for setting up Hadoop Cluster for processing appx. 100 TB data in a year. The cluster was setup for 30% realtime and 70% batch…
Here I am sharing my experince for setting up Hadoop Cluster for processing appx. 100 TB data in a year. The cluster was setup for 30% realtime and 70% batch processing. Though there was nodes setup for ni-fi, kafka, spark and map-reduce. In this bolg, I am mention capacity palnning for Data Nodes only. In next blog, I will explain cpapcity planning for Name Node and Yarn. Here is how we started by gathering the cluster requirements.
while setting up cluster we need to know, below parameters initially (approx.)
With above parameters in hand next we can plan for commodity machines required for cluster. (These might be not exact what is required, but after installation we can fine tune the environment by scaling up/down the cluster) . Nodes required, depends on data to be stored/analyzed.
By default, hadoop eco-system creates 3 replicas of data. So if we go with deafult value of 3, so we need storage of 100TB *3=300 TB for storing data of one year. Now as we have retention policy of 2 years there fore stoarge required will be
1 year data* retaention period=300*2=600 TB
Now we had assumption, 30% data in container storage and 70% data in snappy compressed parque format. Now from various studies, we found parquet snappy compresses data to 70-80%.
We have taken it 70%. So here is the storage requirement calculation.
total storage required for data =total storage* % in container storage + total storage * %in compressed format*expected compression
600*.30+600*.70* (1-.70) =180+420*.30=180+420*.30=306 TB.
In addition to the data we need space for processing/computation the data plus for some other tasks. Now, we need to decide how much share should go to extra space. Now other assumption we had, on a average day only 10% of data is being processed and a data process creates 3 times temporary data, so we need around 30% of total storage as extra storage.
Hence, total storage required for data and other activities=306+306*.30=397.8 TB.
As for Data node JBOD is recommended, we need to allocate 20% of data storgae to the JBOD file system. Therefore the data stograge requirement will go up by 20%. Now the final figure we arrived is 397.8 (1+.20) =477.36 ~ 478 TB.
Now, we need to calculate no. of Data nodes required for 478 TB storage. Suppose we have JBOD of 12 disks, each disk worth of 4 TB. A Data Node capacity will be 48 TB.
Hence no. of required Data Nodes= 478/48 ~ 100.
As per our assumption 70% of data need to be processed in batch mode with Hive, map reduce etc.
so 100*.70=70 nodes are assgined for Batch processing and rest 30 nodes are for in-memory processing with spark, storm etc.
For batch processing 2*6-cores processor (hyper-threaded) and for in-memory processin 2*8-cores process, was chosen. For batch processing nodes, While 1 core is counted for cpu heavy process, .7 core can be assumed for medium cpu intensive process. so batch processing node can handle
12*.30/1+12*.70*/.7=3.6+12=15.6 ~15 Tasks per node
As hyperthreading is enabled, If task includes 2 threads, we can assume ~30 tasks per node.
For in-memory processing node, we have assumption spark.task.cpus=2 and spark.core.max=8*2=16. With this assumption, we can concurrently execute 16/2,8 spark jobs.
Again as hyperthreading is enabled, no. of concurrent jobs can be calcluted as.
total concurrent jobs =no. of threads*8
Now lets calculate RAM required per Data Node. RAM requirement depends on below parameters.
RAM Required=DataNode process memory+DataNode TaskTracker memory+OS memory+CPU’s core number *Memory per CPU core
At starting stage, we have allocated 4 GB memory for each parameter, which can be scaled up as required. Therefore RAM required will be,
for batch data nodes RAM=4+4+4+12*4=60 GB RAM
For in-memory processing data nodes RAM=4+4+4+16*4=76 GB
Steps defined above gives us fair understanding of resources required for setting up data nodes in hadoop cluster, which can be further fine tuned. In next blog I will focus on capacity planning for Name node and Yarn configuration.
Hope you enjoyed the blog!!
Mamta Chawla