Introduction
Welcome to the Intergalactic Trade Station, a bustling hub where merchants and travelers from across the galaxy converge to exchange goods and services. As a skilled Space Station Mechanic, your expertise is in high demand to keep the station's systems running smoothly. Today, you've been tasked with analyzing and optimizing the station's resource allocation by sorting data based on usage patterns.
Your goal is to develop a Hadoop-based solution that can efficiently process and sort large datasets, ensuring that the station's resources are allocated efficiently to meet the ever-changing demands of its diverse visitors.
Set up the Environment
In this step, we'll set up the environment for our Hadoop project and create a sample dataset.
- Open a terminal and switch to the
hadoopuser by running the following command:
su - hadoop
- Create a new directory called
sorting_labin the/home/hadoopdirectory:
mkdir /home/hadoop/sorting_lab
- Navigate to the
sorting_labdirectory:
cd /home/hadoop/sorting_lab
- Create a sample dataset by running the following command:
echo -e "apple\t5\nbanana\t3\norange\t7\ngrape\t2\nstrawberry\t6" > fruit_sales.txt
This command creates a file named fruit_sales.txt with the following contents:
apple 5
banana 3
orange 7
grape 2
strawberry 6
Each line in the file represents a fruit and its sales count, separated by a tab character.
Load Data into Hive
In this step, we'll create a Hive table and load the sample dataset into it.
- Start the Hive shell by running the following command:
hive
- Create a new database called
sorting_db:
CREATE DATABASE sorting_db;
- Use the
sorting_dbdatabase:
USE sorting_db;
- Create a new table called
fruit_saleswith two columns:fruit(string) andcount(int):
CREATE TABLE fruit_sales (fruit STRING, count INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
- Load the
fruit_sales.txtfile into thefruit_salestable:
LOAD DATA LOCAL INPATH '/home/hadoop/sorting_lab/fruit_sales.txt' OVERWRITE INTO TABLE fruit_sales;
- Verify that the data was loaded correctly by running a
SELECTquery:
SELECT * FROM fruit_sales;
This should output:
apple 5
banana 3
orange 7
grape 2
strawberry 6
- Exit the Hive shell by running the following command:
quit;
Sort Data by Usage
In this step, we'll sort the fruit_sales table by the count column in descending order using Hive's ORDER BY clause.
- Start the Hive shell by running the following command:
hive
- Use the
sorting_dbdatabase:
USE sorting_db;
- Run the following query to sort the
fruit_salestable by thecountcolumn in descending order:
CREATE TABLE result AS
SELECT * FROM fruit_sales ORDER BY count DESC;
SELECT * FROM result;
This should output:
orange 7
strawberry 6
apple 5
banana 3
grape 2
- Exit the Hive shell by running the following command:
quit;
Summary
In this lab, we explored the "sort by Usage" feature in Hadoop Hive. We started by setting up the environment and creating a sample dataset. Then, we learned how to load the data into a Hive table and sort the table by a specific column using the ORDER BY clause.
The lab provided hands-on experience in working with Hive and demonstrated how to sort data based on usage patterns. By mastering this skill, you can efficiently analyze and optimize resource allocation in various scenarios, such as the Intergalactic Trade Station.
Throughout the lab, we also used checkers to verify the successful completion of each step, ensuring that you have gained the necessary knowledge and practical experience to tackle similar challenges in the future.



