Mastering Hadoop Hive Shell | Data Engineering Tutorial

Introduction

In a futuristic world where data reigns supreme, a grand tournament is held to determine the ultimate master of data manipulation. Contestants from all corners of the galaxy gather to showcase their skills in the art of data wrangling, with the most formidable challenge lying in the realm of Hadoop Hive Shell.

You, a promising data engineer, have been chosen to represent your home planet in this prestigious event. As you step into the arena, the imposing figure of the tournament's Head Referee greets you, "Welcome, challenger! In this decisive battle, you must conquer the depths of Hive Shell, a powerful tool that allows you to interact with Hadoop's data warehouse system. Prove your mettle by executing a series of tasks that will test your proficiency in data querying, manipulation, and analysis."

With a determined nod, you ready yourself for the challenges that lie ahead, eager to showcase your skills and bring honor to your planet.

Skills Graph

%%%%{init: {'theme':'neutral'}}%%%% flowchart RL hadoop(("`Hadoop`")) -.-> hadoop/HadoopHiveGroup(["`Hadoop Hive`"]) hadoop/HadoopHiveGroup -.-> hadoop/hive_setup("`Hive Setup`") hadoop/HadoopHiveGroup -.-> hadoop/hive_shell("`Hive Shell`") hadoop/HadoopHiveGroup -.-> hadoop/manage_db("`Managing Database`") hadoop/HadoopHiveGroup -.-> hadoop/create_tables("`Creating Tables`") hadoop/HadoopHiveGroup -.-> hadoop/describe_tables("`Describing Tables`") hadoop/HadoopHiveGroup -.-> hadoop/load_insert_data("`Loading and Inserting Data`") hadoop/HadoopHiveGroup -.-> hadoop/basic_hiveql("`Basic HiveQL Queries`") hadoop/HadoopHiveGroup -.-> hadoop/where("`where Usage`") hadoop/HadoopHiveGroup -.-> hadoop/group_by("`group by Usage`") subgraph Lab Skills hadoop/hive_setup -.-> lab-288978{{"`Data Mastery in Hive Arena`"}} hadoop/hive_shell -.-> lab-288978{{"`Data Mastery in Hive Arena`"}} hadoop/manage_db -.-> lab-288978{{"`Data Mastery in Hive Arena`"}} hadoop/create_tables -.-> lab-288978{{"`Data Mastery in Hive Arena`"}} hadoop/describe_tables -.-> lab-288978{{"`Data Mastery in Hive Arena`"}} hadoop/load_insert_data -.-> lab-288978{{"`Data Mastery in Hive Arena`"}} hadoop/basic_hiveql -.-> lab-288978{{"`Data Mastery in Hive Arena`"}} hadoop/where -.-> lab-288978{{"`Data Mastery in Hive Arena`"}} hadoop/group_by -.-> lab-288978{{"`Data Mastery in Hive Arena`"}} end

Accessing the Hive Shell

In this step, you will learn how to access the Hive Shell, a command-line interface that allows you to interact with the Hive data warehouse system.

To access the Hive Shell, follow these steps:

Launch a terminal window.
Switch to the hadoop user by executing the following command:

su - hadoop

Once you're in the hadoop user's environment, start the Hive Shell by running the following command:

hive

You should see the Hive Shell prompt (hive>), indicating that you have successfully entered the Hive Shell.

Here's an example of what you should see in your terminal:

[hadoop@localhost ~]$ hive
Hive Session ID = 2bd4d0ce-9a9a-4c4e-a8f4-c5a6c7b9a8b9

Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j2.properties Async: true
Hive Session Value Manager Service

hive>

Now you're ready to start executing Hive queries and commands within the Hive Shell.

Creating a Database and Table

In this step, you will learn how to create a new database and table within the Hive Shell.

First, create a new database named tournament_db by executing the following command in the Hive Shell:

CREATE DATABASE tournament_db;

Next, switch to the newly created database:

USE tournament_db;

Create a new table named contestants with the following schema:

CREATE TABLE contestants (
  id INT,
  name STRING,
  planet STRING,
  score INT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';

This table will store information about the tournament contestants, including their ID, name, home planet, and score.

To verify that the table was created successfully, you can run the following command to describe the table schema:

DESCRIBE contestants;

You should see the table schema displayed in the Hive Shell output.

Here's an example of what you should see:

hive> DESCRIBE contestants;
OK
id                   int                 None
name                 string               None
planet               string               None
score                int                  None

Loading Data Into the Table

Now that you have created the contestants table, it's time to populate it with data.

Create a new file named contestants.csv in the /home/hadoop/ directory with the following content:

1,Zara,Andromeda,85
2,Brock,Celestia,92
3,Kira,Nebula,78
4,Tara,Vega,96
5,Zion,Altair,83

Next, switch to the newly created database:

USE tournament_db;

Load the data from the contestants.csv file into the contestants table using the following command in the Hive Shell:

LOAD DATA LOCAL INPATH '/home/hadoop/contestants.csv' OVERWRITE INTO TABLE contestants;

This command will load the data from the local file contestants.csv into the contestants table, overwriting any existing data.

To verify that the data was loaded correctly, you can run a SELECT query:

SELECT * FROM contestants;

You should see the following output:

hive> SELECT * FROM contestants;
OK
1 Zara Andromeda 85
2 Brock Celestia 92
3 Kira Nebula 78
4 Tara Vega 96
5 Zion Altair 83

Querying the Data

With the data loaded into the contestants table, you can now perform various queries to analyze and manipulate the data.

To display the names and scores of all contestants from the planet 'Andromeda', run the following query:

SELECT name, score FROM contestants WHERE planet = 'Andromeda';

You should see the following output:

hive> SELECT name, score FROM contestants WHERE planet = 'Andromeda';
OK
Zara 85

To calculate the average score of all contestants, run the following query:

SELECT AVG(score) AS average_score FROM contestants;

You should see the following output:

hive> SELECT AVG(score) AS average_score FROM contestants;
OK
86.8

To find the contestant with the highest score, run the following query:

SELECT name, MAX(score) AS highest_score FROM contestants GROUP BY name;

You should see the following output:

hive> SELECT name, MAX(score) AS highest_score FROM contestants GROUP BY name;
OK
Brock   92
Kira    78
Tara    96
Zara    85
Zion    83

Feel free to experiment with more queries and explore the data in the contestants table using various clauses and functions provided by Hive.

Summary

In this lab, you embarked on a journey into the world of Hadoop Hive Shell, a powerful tool for data manipulation and analysis. Through a series of immersive challenges, you mastered the art of accessing the Hive Shell, creating databases and tables, loading data, and querying the data using various clauses and functions.

The lab's design, inspired by a futuristic tournament, aimed to create an engaging and motivating learning experience. By assuming the role of a contestant representing your home planet, you were driven to excel in each challenge, honing your skills and demonstrating your prowess in data wrangling.

Throughout the lab, you encountered hands-on tasks that required you to execute commands, write queries, and analyze the results. The step-by-step instructions, accompanied by code examples and detailed explanations, ensured a smooth learning curve, allowing you to progressively build your knowledge and confidence.

The inclusion of checker scripts at the end of each step provided an opportunity for self-assessment, enabling you to validate your understanding and ensure that you had successfully completed the tasks. These checkers played a crucial role in reinforcing your learning and ensuring that you were on the right track.

Overall, this lab not only equipped you with the essential skills needed to work with Hadoop Hive Shell but also fostered an appreciation for the power of data manipulation and analysis in a engaging and immersive manner.

Data Mastery in Hive Arena

Introduction

Skills Graph

Accessing the Hive Shell

Creating a Database and Table

Loading Data Into the Table

Querying the Data

Summary

Other Hadoop Tutorials you may like