Hadoop Uses The Concept Of Parallelism Or Serialism To Upload The Split Data To The Cluster?

Anurag Sharma
4 min readNov 5, 2020

Let’s find out the about the interesting myth about the hadoop that the world have — Is Hadoop uses the concept of parallelism to split the data among data nodes ?

Source: medium.com

So , let’s begin to find out the actual truth behind the Concept of parallelism of Hadoop .

Tools that we’re going to use :

  1. tcpdump — tcpdump is a data-network packet analyzer computer program that runs under a CLI . It allows the user to display TCP/IP and other packets being transmitted or received over a network to which the computer is attached .

Let’s start with the following steps to research above mentioned things are true or not :

  1. Create an account on AWS (Amazon Web Services)
  2. Launch four Ec2 instances on AWS with the free tier services .
  3. Configure one instance as NameNode, one as Client and remaining two as DataNodes.
  4. Install jdk and hadoop Software in all instances
  5. Configure “hdfs-site.xml” and “core-site.xml” file in both datanodes and namenode. (Remember, no need to configure “hdfs-site.xml” file in hadoop-client , only configure “core-site.xml” file)
  6. Format the NameNode
  7. Start the Hadoop daemon services in both DataNodes and NameNode and check by using “jps” command
  8. Check Datanodes available to the Hadoop-Cluster by using command “#hadoop dfsadmin -report”
  9. Hadoop-Client uploads a file to the Hadoop-Cluster by using command:# hadoop fs -put <file_name> /
  10. Check file in hadoop-cluster by using command: # hadoop fs -ls /
  11. While uploading file , RUN tcpdump command in NAMENODE and in both DATANODES :
  • For this , you have to firstly install the tcpdump using # yum install tcpdump command
  • Run tcpdump command to check the transferred packets between client , master and slaves-

# tcpdump -i eth0 -n -x

# tcpdump -i eth0 tcp port 22 -n

  • The following above command will tell us about the flow of the data packets between the client , data and the name node .You will find that in NameNode, Client is requesting Master(or NameNode) to get the IP-ADDRESSES of DataNodes as Client is the one who directly uploads the data to the DataNodes and Master is replying by sending the network packets to the Client which contains the IP-addresses of DataNodes.
  • To trace the DATA PACKETS (or data flow) → Use port no.50010 , run command :

# tcpdump -i eth0 port 50010 -n -x

  • While running this command in both DATANODES and in NAMENODE , you will see some data packets are receiving to the DataNodes. These data packets are received by data nodes in such a manner that Firstly , some packets received by DataNode1 and then, stops . After this, some packets received by DataNode2 and then, stops. Again data packets received to DN1 and when it stops ,then ,received by DN2 ……….so on . This process goes on till the whole file gets uploaded to the Hadoop-Cluster. This will help to uploads the data fastly or you can also check its time-stamp during uploadation of file in both the slaves , it will definitely differ as it is not transferring data in parallel.
Data Node receiving data packets
Client node uploading the data
  • Upper pic is the depiction of the client node while uploading the data to data nodes .
Name node
  • As we can see in the above screenshot name-node don’t receive any data packets from the client side . So, Name node doesn’t have to do with any data packets , it directly goes to the data nodes .

Thus, The Data Flow in data packets from CLIENT to the DATANODES is Completely visible in the serial order. So, We can say that Hadoop uses the concept of “serialism” to upload the split data while fulfilling Velocity problem.

HENCE PROVED !!!

Thank You for Reading 😊

--

--