Hadoop Hadoop YARN Basic Setup

HadoopHadoopBeginner
Practice Now

Introduction

In a futuristic robot factory, where cutting-edge technology meets precision engineering, you take on the role of a robot maintenance technician. Your primary goal is to ensure the efficient allocation and management of computing resources within the factory's intricate network. This network powers the robots' cognitive functions, enabling them to perform complex tasks with unparalleled accuracy and speed.

The factory's computing infrastructure relies on the Hadoop ecosystem, specifically the YARN (Yet Another Resource Negotiator) component. Your objective is to master the basic setup of Hadoop YARN, allowing you to seamlessly distribute and manage the factory's computational workloads across multiple nodes, ensuring optimal performance and resource utilization.


Skills Graph

%%%%{init: {'theme':'neutral'}}%%%% flowchart RL hadoop(("`Hadoop`")) -.-> hadoop/HadoopYARNGroup(["`Hadoop YARN`"]) hadoop/HadoopYARNGroup -.-> hadoop/yarn_setup("`Hadoop YARN Basic Setup`") subgraph Lab Skills hadoop/yarn_setup -.-> lab-289015{{"`Hadoop Hadoop YARN Basic Setup`"}} end

Explore the YARN Architecture

In this step, we will explore the YARN architecture and its key components, laying the foundation for understanding how it manages and allocates resources within the Hadoop ecosystem.

The YARN architecture consists of two main components:

  1. ResourceManager (RM): The ResourceManager acts as the central authority that arbitrates and allocates available resources (CPU, memory, etc.) across the cluster. It consists of two components:

    • Scheduler: Responsible for allocating resources to the various running applications based on predefined scheduling policies.
    • ApplicationsManager: Responsible for accepting job submissions, negotiating the first resource container for executing the ApplicationMaster, and providing the service for restarting the ApplicationMaster container on failure.
  2. NodeManager (NM): The NodeManager runs on each node in the cluster and is responsible for managing the node's resources and monitoring the containers running on that node.

To better understand the YARN architecture, let's navigate to the Hadoop configuration directory and examine the relevant configuration files.

But firstly we need switch the user:

su - hadoop

Navigate to the Hadoop configuration directory:

cd /home/hadoop/hadoop/etc/hadoop/

Open the yarn-site.xml file by vim tool:

<configuration>
  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>
  <property>
    <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
    <value>org.apache.hadoop.mapred.ShuffleHandler</value>
  </property>
</configuration>

In this configuration file, we can see the mapreduce_shuffle auxiliary service is enabled for the NodeManager. This service is responsible for managing the shuffle operations in MapReduce jobs, ensuring efficient data transfer between the map and reduce phases.

Start the YARN Services

Now that we have explored the YARN architecture and its configuration, let's start the YARN services on our Hadoop cluster.

Firstly, start the YARN services using the following command:

start-yarn.sh

This script will start the ResourceManager and NodeManager daemons on the appropriate nodes in the cluster.

View the process of the YARN services using the following command:

jps

The NodeManager and ResourceManager service s should be visible in the output.

You can check the status of the YARN services using the following command:

yarn node -list

This command will display a list of active NodeManagers in the cluster, along with their status and available resources.

2024-03-17 19:27:30,108 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at /0.0.0.0:8032
Total Nodes:1
         Node-Id	     Node-State	Node-Http-Address	Number-of-Running-Containers
iZj6cdofomqja8ye7wk8kzZ:43689	        RUNNING	iZj6cdofomqja8ye7wk8kzZ:8042	                           0

In the output above, we can see that there is one active NodeManager running.

Submit a YARN Job

With the YARN services up and running, let's submit a sample job to test the resource allocation and scheduling capabilities of YARN.

First, prepare an input text file called input.txt in the Hadoop file system that contains the text content to be word counted.

echo -e "Hello World\nHello Hadoop\nYARN is cool" > input.txt
hadoop fs -put input.txt /input.txt

Then, The JAR file for the example program can be found in the Hadoop installation directory, usually located at $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.6.jar. You can use this JAR file to run the Word Count program.

yarn jar /home/hadoop/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.6.jar wordcount /input.txt /output

This command will submit the MapReduce job to the YARN ResourceManager, which will allocate resources and schedule the job across the available NodeManagers.

Once the job is complete, you can view the output in the /output directory:

hdfs dfs -cat /output/part-r-00000

This should display the word count output:

Hadoop	1
Hello	2
World	1
YARN	1
cool	1
is	1

Summary

In this lab, we explored the YARN architecture and its key components, learned how to configure and start the YARN services, and submitted a sample MapReduce job to the YARN cluster. By completing this lab, you have gained hands-on experience with the basic setup and operation of Hadoop YARN, enabling you to manage and allocate computing resources efficiently in a distributed environment.

The lab not only provided a practical understanding of YARN but also highlighted the importance of resource management and scheduling in modern computing infrastructures. As a robot maintenance technician in a futuristic factory, mastering these skills will empower you to optimize the performance and efficiency of the factory's computing resources, ensuring smooth and reliable operations for the robots.

Other Hadoop Tutorials you may like