What is Architecture of Hadoop?

Hadoop is the open-source framework of Apache Software Foundation, which is used to store and process large unstructured datasets in the distributed environment. Data is first distributed among different available clusters then it is processed.
Hadoop biggest strength is that it is scalable in nature means it can work on a single node to thousands of nodes without any problem. Hadoop framework is based on Java programming and it runs applications with the help of MapReduce which is used to perform parallel processing and achieve the entire statistical analysis on large datasets. Distribution of large datasets to different clusters is done on the basis of Apache Hadoop software library using easy programming models.
Organizations are now adopting Hadoop for the purpose of reducing the cost of data storage. It will lead to the analytics at an economical cost which will maximize the business profitability.
For a good Hadoop architectural design, you need to take in considerations – good computing power, storage, and networking.

Hadoop Architecture

Hadoop ecosystem consists of various components such as Hadoop Distributed File System (HDFS), Hadoop MapReduce, Hadoop Common, HBase, YARN, Pig, Hive, and others. Hadoop components which play a vital role in its architecture are-
A. Hadoop Distributed File System (HDFS)
B. Hadoop MapReduce
Hadoop works on the master/slave architecture for distributed storage and distributed computation. NameNode is the master and the DataNodes are the slaves in the distributed storage. The Job Tracker is the master and the Task Trackers are the slaves in the distributed computation. The slave nodes are those which store the data and perform the complex computations. Every slave node comes with a Task Tracker daemon and a DataNode synchronizes the processes with the Job Tracker and NameNode respectively. In Hadoop architectural setup, the master and slave systems can be implemented in the cloud or on-site premise.
Hadoop ecosystem

Role of HDFS in Hadoop Architecture

HDFS is used to split files into multiple blocks. Each file is replicated when it is stored in Hadoop cluster. The default size of that block of data is 64 MB but it can be extended up to 256 MB as per the requirement.
HDFS stores the application data and the file system metadata on two different servers. NameNode is used to store the file system metadata while and application data is stored by the DataNode. To ensure the data reliability and availability to the highest point, HDFS replicates the file content many times. NameNode and DataNode communicate with each other by using TCP protocols. Hadoop architecture performance depends upon Hard-drives throughput and the network speed for the data transfer.

NameNode

HDFS namespace is used to store all files in NameNode by Inodes which also contains attributes like permissions, disk space, namespace quota, modification timestamp, and access time. For the entire mapping of the file system into memory, NameNode is used. Two files fsimage and edits are used for tenacity during restarts.
Fsimage contains the Inodes and it contains the metadata definition. The edits file contains the details of modification that have been performed on the fsimage content. If the changes are made then instead of creating a new fsimage it is altered only.
When the NameNode starts, fsimagefile is loaded and the contents of the edits file are applied for recovering the previous state of the file system. With the time, edits file grow in number and consumes all the disk space which results in slowing the process. This problem is solved by the Secondary NameNode which copies the new fsimage and edits the file to the primary NameNode. It also updates the modified fsimage file to fstime file to track when the last time it was updated.

DataNode

DataNode is used to store the blocks of data and retrieve them when it is needed. NameNode is used to get periodical block information reports from the DataNode.
When the system starts up, each DataNode connects with the NameNode and checks if the connection is established by verifying namespace ID and the software version of the DataNode. If any of them not matches with each other, it automatically shuts off. NameNode gets block report from the DataNode as the verification of block replicas. After the registration of the first block, DataNode sends a pulse in every 3 seconds as a confirmation that DataNode is operating properly and block replicas are available to the host.

Role of MapReduce in Hadoop Architecture

MapReduce is a framework used for processing large datasets in a distributed environment. The MapReduce job is based on three operations: map an input data set in different pairs, shuffle the resulting data, and then reduce overall pairs with the same key. The job is the top level unit of MapReduce working and each job contains one or more Map or Reduce tasks.
The execution of a job starts when it is submitted to the Job Tracker of MapReduce which specifies the map, combines, and reduce functions along with the location of input and output data. When the job is received, the job tracker searches the number of splits based on input path and select Task Trackers based on their network vicinity to the data sources.
Task Tracker extracts information from the splits as the processing begins in Map phase. Records are parsed by the “InputFormat” and generate key-value pairs in the memory buffer when Map function is provoked. Combine function is used to sort all the splits from the memory buffer. After the completion of a map task, Task Tracker gives a message to the Job Tracker. Job Tracker then gives a message to selected Task Tracker to start the reduce phase. Now Task Tracker sorts the key-value pairs for each key after reading it. At last reduce function is invoked and all the values are collected into one output file.

Bottom Line

So, the best way for any organization to determine if Hadoop architecture suite their business or not is – by determining the cost of storing and processing data using Hadoop. Compare the determined cost to the cost with default process of data management. If you are an aspirant to build a career in Big Data Hadoop, Best Big Data Online Course/s may help you learn and get certified in Hadoop.
Keep learning to have a bright career ahead!

Related Posts

  • 30 May

    What is Yarn in Hadoop?

    The Hadoop ecosystem is going through the continuous evolution. Its processing frameworks are also evolving at full speed with the time. Hadoop 1.0 has passed the limitation of the batch-oriented MapReduce processing framework for the development of specialized and interactive processing model which is Hadoop 2.0. Apache Hadoop was introduced in 2005 and taken over […]

  • 28 May

    What is MapReduce in Hadoop?

    The heart of Apache Hadoop is Hadoop MapReduce. It’s a programming model used for processing large datasets in parallel across hundreds or thousands of Hadoop clusters on commodity hardware. The framework does all the works; you just need to put the business logic into the MapReduce. All the work is divided into the small works […]

  • 15 May

    What is HDFS in Hadoop

    The Hadoop Distributed File System is a java based file, developed by Apache Software Foundation with the purpose of providing versatile, resilient, and clustered approach to manage files in a Big Data environment using commodity servers. HDFS used to store a large amount of data by placing them on multiple machines as there are hundreds […]

  • 11 May

    What is HBase in Hadoop?

    Hadoop HBase is based on the Google Bigtable (a distributed database used for structured data) which is written in Java. Hadoop HBase was developed by the Apache Software Foundation in 2007; it was just a prototype then. Hadoop HBase is an open-source, multi-dimensional, column-oriented distributed database which was built on the top of the HDFS. […]

  • 08 May

    Applications of Big data

    Data is omnipresent. It was there in the past, it is now at present, and it will be there in the future also. But the thing that has changed now is that the industries have realized the importance of data. Industries are now having knowledge of Big Data and are benefited by Big Data applications. […]

  • 28 April

    What are the Big Data Technologies?

    To be at the top of your field is one thing, and maintaining your position at the top is another. The same thing applies to the IT industry and Big Data technologies are doing the later thing so well! Data management will decide the position. If any organizations don’t know to handle the tons of […]

Leave a Reply

Your email address will not be published. Required fields are marked *

Subscribe to our newletter

Get quality tutorials to your inbox. Subscribe now.