Introduction to Hadoop YARN
Hadoop YARN (Yet Another Resource Negotiator) is a key component of the Hadoop ecosystem, responsible for resource management and job scheduling. It was introduced in Hadoop 2.0 to address the limitations of the previous resource management system, the Hadoop MapReduce v1 (also known as MRv1 or Classic MapReduce).
YARN introduces a two-layer architecture, separating the resource management and job scheduling/monitoring functions of the previous Hadoop MapReduce framework. This separation of concerns allows YARN to support a wide range of distributed processing frameworks, including batch processing, interactive processing, real-time processing, and machine learning.
The main components of Hadoop YARN are:
- Resource Manager (RM): The central authority that allocates resources to various applications running in the Hadoop cluster.
- Node Manager (NM): The per-node agent that is responsible for launching and monitoring containers, as well as reporting resource usage and status to the Resource Manager.
- Application Master (AM): The per-application master responsible for negotiating resources from the Resource Manager and working with the Node Managers to execute and monitor the application's tasks.
YARN provides several key benefits over the previous Hadoop MapReduce framework:
- Scalability: YARN can support a larger number of concurrent applications and tasks, making it more scalable.
- Flexibility: YARN can support a wide range of distributed processing frameworks, not just MapReduce.
- Efficiency: YARN can better utilize cluster resources by allowing multiple applications to run concurrently.
To demonstrate the basic setup of a Hadoop YARN cluster, let's consider the following example using Ubuntu 22.04:
## Install Hadoop
sudo apt-get update
sudo apt-get install -y openjdk-8-jdk
wget https://downloads.apache.org/hadoop/common/hadoop-3.3.4/hadoop-3.3.4.tar.gz
tar -xzf hadoop-3.3.4.tar.gz
cd hadoop-3.3.4
## Configure Hadoop environment
export HADOOP_HOME=$(pwd)
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
## Start the Hadoop YARN cluster
hdfs namenode -format
start-dfs.sh
start-yarn.sh
This setup will start a basic Hadoop YARN cluster with a single node. You can then use the yarn
command to interact with the cluster and submit applications for execution.
graph TD
A[Client] --> B[Resource Manager]
B --> C[Node Manager]
C --> D[Container]
D --> E[Application Master]
E --> F[Task Tracker]