The Hadoop ecosystem is going through the continuous evolution. Its processing frameworks are also evolving at full speed with the time. Hadoop 1.0 has passed the limitation of the batch-oriented MapReduce processing framework for the development of specialized and interactive processing model which is Hadoop 2.0.
Apache Hadoop was introduced in 2005 and taken over the industries with the capability of doing distributed processing of large data using MapReduce engine. With the time Hadoop has gone through some modifications, which have made Hadoop a better and advanced framework that supports various other distributed processing model with the traditional MapReduce model.
The data moguls such as Facebook, Google, and Yahoo had adopted Apache Hadoop for gaining new heights from the Hadoop HDFS with the resource management environment and the MapReduce processing. But in the past, Google and the others of Hadoop users found some issues with the Hadoop 1.0 architecture. Batch processing agreement of MapReduce was unable to keep track of all information that was flooding in the data collecting process.
Introduction of Yarn (Hadoop 2.0)
The Yarn is an acronym for Yet Another Resource Negotiator which is a resource management layer in Hadoop. It was introduced in 2013 in Hadoop 2.0 architecture as to overcome the limitations of MapReduce. Yarn supports other various others distributed computing paradigms which are deployed by the Hadoop.
Yahoo rewrites the code of Hadoop for the purpose of separate resource management from job scheduling, the result of which we got Yarn. This has improved Hadoop, as we can use the standalone component with other software like Apache Stark or create our own application by coding using Yarn. Application created using Yarn can run different distribute architecture.
Limitations of MapReduce which paved the path for Yarn (Hadoop 2.0)
Hadoop MapReduce used to do Big data processing but had some drawbacks in architecture which came to light when dealing with huge datasets.
Limitations of MapReduce (Hadoop 1.0)
Jobtracker used to do the job scheduling and keeping tracking of jobs. If the Jobtracker fails in any case, then the jobs will have to restart. The architecture had a single point of availability.
Jobtracker used to perform various tasks such as Job Scheduling, Task Scheduling, Resource Management, and Monitoring. Due to all these tasks, Jobtracker is not able to fully focus on Job Scheduling by which different nodes are not utilized to fullest, and thereby systems scalability is limited.
MapReduce engine devotes the nodes of the cluster to work with the single system. As when the size of the cluster increases in Hadoop, it cannot be employed to work with the different models.
The problem in real-time processing
MapReduce is batch driven by processing and analysis is done in batches where the result is getting after several hours. What if we need real-time analysis in case of fraud detection, it will be no use then.
There are several other issues with MapReduce (Hadoop 1.0) that need to be taken in consideration such as running ad-hoc queries, cascading failure, inefficient utilization of resources, the problem in message passing, and problem in running non-MapReduce applications.
Yarn framework consists of “Resource Manager” a master daemon, “Node Manage” a slave daemon, and “Application Master” for per application.
The Resource Manager is known as the rack-aware master daemon in Yarn. It is responsible for hoarding resources and assigning them to the applications. Competing application get the system resources when the Resource Manager adjudges.
Two components of Resource Manager:
The Scheduler of the Yarn Resource Manager job is to assign resources to the running applications. It is a pure scheduler, so it doesn’t do tracking and monitoring of the application. It doesn’t do any tracking or monitoring, so it can’t do anything about failure due to hardware or application means restarting the failed tasks.
Application Manager is used for monitoring and restarting Application Master in case of any failure of nodes.
Node Manager is a slave daemon in Yarn. Node Manager is attached to containers as a per node- agent for overseeing the lifecycles. It also monitors container resource usage and communicates with the Resource manager periodically. Node Manager Work is same as the Task Tracker. Task Trackers used to have fixed number of map and reduce slots for scheduling, as Node Managers are dynamically created, where Resource Containers are of random size. Resource Containers can be used for the map, reduces tasks, and tasks from another framework.
Application Master creates a dedicated instance for every application running in the Hadoop. The instance lives in its own container on one of the nodes in the cluster. Each application instance sends a heartbeat message to Resource Master, and if needed requests for additional resources. Resource Manager is used to assigning the additional resources throughout the Container Resource leases, which also serve as reservations for containers on Node Manager.
The full lifespan of an application is overseen by the Application Master, as additional container request from the Resource Manager, to Node Manager Release request.
Resource Manager Restart
It is used for managing resources and scheduling applications running on Yarn and works as a central authority. There are two types of restart for Resource Manager:
Non-work-preserving Resource Manager Restart
Restart is used to enhance RM to persist application/ attempt state in a pluggable state-store. The same information from state-store on the restart is loaded by the Resource Manager, and the previously running apps were re-kicked. Applications are not required to re-submit by the users.
In downtime of RM, Node Manager and clients are used to poll RM until RM comes up. When RM comes up, it will send a heartbeat message to the Node Manager and Application Master.
Work-preserving Resource Manager Restart
This will focus on reconstructing the running state of RM by combining the container requests from Application Masters and container status from Node Manager on restart. When the master is restarted, previously running apps are stopped, so there is no loss of processing data.
When Node Manager sends container status, it is captured by RM which helps them recover the previous running state. When a heartbeat message is sent by the container for re-syncs with the restarted RM, Node Manager will not kill it. It will start to manage the containers again, and containers status is sent across RM when it registers again.
The yarn has completely changed the game for distributed application implementation and processing on a cluster of commodity servers. Yarn overcomes the limitation of MapReduce. It is more flexible, scalable, and efficient as compared to MapReduce. Companies are migrating from MRV1 to Yarn, and there is no such reason not to.
Best Big Data Online Course/s covers everything you should know about the Yarn. Get familiar with the various concepts of Yarn in Hadoop and take a step ahead toward the bright Big Data Hadoop career!