Gem Analysis with Hadoop

HadoopHadoopBeginner
Practice Now

Introduction

In the enchanting realm of the Royal Academy of Magic, a team of skilled gemologists dedicated themselves to unraveling the mysteries of the mystical gems that powered the wizarding world. Among them was a prodigious researcher, Aria, whose unwavering curiosity and keen intellect made her a force to be reckoned with.

Recently, the academy acquired a vast trove of data pertaining to the properties and applications of these gems, a wealth of knowledge that promised to unlock new avenues of magical innovation. However, the sheer volume of information presented a daunting challenge, one that required the aid of powerful analytical tools.

Aria, undeterred, turned to the venerable Hadoop ecosystem, recognizing its potential to tame the unruly expanse of data. Her goal was twofold: to delve deeper into the intricacies of the gems' behavior and to uncover hidden patterns that could revolutionize the way magic was harnessed and utilized.


Skills Graph

%%%%{init: {'theme':'neutral'}}%%%% flowchart RL hadoop(("`Hadoop`")) -.-> hadoop/HadoopHiveGroup(["`Hadoop Hive`"]) hadoop/HadoopHiveGroup -.-> hadoop/distribute_by("`distribute by Usage`") subgraph Lab Skills hadoop/distribute_by -.-> lab-288967{{"`Gem Analysis with Hadoop`"}} end

Exploring the Gem Dataset

In this step, we will familiarize ourselves with the gem dataset and its structure, laying the groundwork for our subsequent analysis.
First, ensure you are logged in as the hadoop user by running the following command in the terminal:

su - hadoop

Now let's start by creating an example. Copy the following command line into the terminal to create our sample file.

mkdir -p hadoop/gemstone_data
cd hadoop/gemstone_data
echo "gem_id,gem_name,color,hardness,density,refractive_index" > gem_properties.csv
echo "1,Ruby    ,Red      ,9.0 ,4.0,1.77" >> gem_properties.csv
echo "2,Emerald ,Green    ,8.0 ,3.1,1.58" >> gem_properties.csv
echo "3,Sapphire,Blue     ,9.0 ,4.0,1.76" >> gem_properties.csv
echo "4,Diamond ,Colorless,10.0,3.5,2.42" >> gem_properties.csv
echo "5,Amethyst,Purple   ,7.0 ,2.6,1.54" >> gem_properties.csv
echo "6,Topaz   ,Yellow   ,8.0 ,3.5,1.63" >> gem_properties.csv
echo "7,Pearl   ,White    ,2.5 ,2.7,1.53" >> gem_properties.csv
echo "8,Agate   ,Multi    ,7.0 ,2.6,1.53" >> gem_properties.csv
echo "9,Rose    ,Pink     ,7.0 ,2.7,1.54" >> gem_properties.csv
echo "10,CatsEye,Green    ,6.5 ,3.2,1.54" >> gem_properties.csv
echo "gem_id,application" > gem_applications.csv
echo "1,Fire Magic       " >> gem_applications.csv
echo "2,Earth Magic      " >> gem_applications.csv
echo "3,Water Magic      " >> gem_applications.csv
echo "4,Enhancement Magic" >> gem_applications.csv
echo "5,Psychic Magic    " >> gem_applications.csv
echo "6,Lightning Magic  " >> gem_applications.csv
echo "7,Illusion Magic   " >> gem_applications.csv
echo "8,Strength Magic   " >> gem_applications.csv
echo "9,Love Magic       " >> gem_applications.csv
echo "10,Stealth Magic   " >> gem_applications.csv

Now we are already in the directory gemstone_data, let's take a moment to review the contents of this directory:

ls

As you navigate through the directory, you'll see these two files, each dedicated to a distinct aspect of gemstone data. gem_properties.csv delves into the physical characteristics of gems, whereas gem_applications.csv provides insights into their varied magical uses.

To gain deeper insights into our dataset, let's have a look at the first few lines of one of these files:

head -n 5 gem_properties.csv

The result should be like as follows:

gem_id,gem_name,color,hardness,density,refractive_index
1,Ruby    ,Red      ,9.0 ,4.0,1.77
2,Emerald ,Green    ,8.0 ,3.1,1.58
3,Sapphire,Blue     ,9.0 ,4.0,1.76
4,Diamond ,Colorless,10.0,3.5,2.42

This command displayed the first five lines of the gem_properties.csv file, giving you a glimpse into its structure and contents.

Creating a Hive Table

Now that we have familiarized ourselves with the dataset, let's move on to creating a Hive table to store and query the gem properties data.

First, launch the Hive CLI:

hive

Once inside the Hive CLI, create a new database called gemstone_analysis to store our gemstone data:

CREATE DATABASE gemstone_analysis;

Next, switch to the newly created database:

USE gemstone_analysis;

With the database set, let's proceed to create a table named gem_properties to store the data from the gem_properties.csv file. This table will contain information about gem ID, name, color, hardness, density, and refractive index:

CREATE TABLE gem_properties (
  gem_id INT COMMENT 'Unique ID for the gem',
  gem_name STRING COMMENT 'Name of the gem',
  color STRING COMMENT 'Color of the gem',
  hardness FLOAT COMMENT 'Hardness of the gem',
  density FLOAT COMMENT 'Density of the gem',
  refractive_index FLOAT COMMENT 'Refractive index of the gem'
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
TBLPROPERTIES ('skip.header.line.count'='1');

This statement defines the structure of the gem_properties table, specifying the column names and data types. The ROW FORMAT clause indicates that the data is delimited by commas, and the STORED AS TEXTFILE clause specifies that the data should be stored as a plain text file.

Now, let's load the data from the gem_properties.csv file into the newly created table gem_properties:

LOAD DATA LOCAL INPATH '/home/hadoop/hadoop/gemstone_data/gem_properties.csv'
INTO TABLE gem_properties;

This command loads the data from the specified file path into the gem_properties table.

Once the data is loaded, let's verify that it was loaded correctly by running a simple SELECT query to display the first five rows of the gem_properties table:

SELECT * FROM gem_properties LIMIT 5;

This query will display the first five rows of the gem_properties table, allowing us to confirm that the data was loaded correctly. The result should be as follows:

OK
1	Ruby    	Red      	9.0	4.0	1.77
2	Emerald 	Green    	8.0	3.1	1.58
3	Sapphire	Blue     	9.0	4.0	1.76
4	Diamond 	Colorless	10.0	3.5	2.42
5	Amethyst	Purple   	7.0	2.6	1.54
Time taken: 1.234 seconds, Fetched: 5 row(s)

Analyzing Gem Properties

In this step, we will harness the power of the DISTRIBUTE BY clause in Hive to efficiently analyze gem properties data.

To begin, let's create a new table named distributed_gem_properties to partition data by the gem_name column:

(It may take a little time please be patient)

CREATE TABLE distributed_gem_properties
AS
SELECT * FROM gem_properties
DISTRIBUTE BY gem_id;

This table mirrors the structure of the original gem_properties table but incorporates additional clauses to organize and distribute data effectively:

AS SELECT * FROM gem_properties: This part of the statement selects all columns and rows from the existing gem_properties table. We're essentially copying the data from the original table to our new distributed table.

DISTRIBUTE BY gem_id: This clause specifies the distribution of data across reducers based on the gem_id column. By distributing the data in this way, we can leverage parallel processing to analyze gem properties more efficiently.

Now that we have our distributed table set up, we can proceed to perform various analyses on gem properties data with improved efficiency.

Let's execute a query to find gemstones with hardness between 8.0 and 9.0, copy and paste this code into the terminal to run.

SELECT * FROM distributed_gem_properties WHERE hardness BETWEEN 8.0 AND 9.0;

This query will retrieve gemstone data from the distributed table where the hardness falls within the specified range. The result should be as follows:

OK
6	Topaz   	Yellow   	8.0	3.5	1.63
3	Sapphire	Blue     	9.0	4.0	1.76
2	Emerald 	Green    	8.0	3.1	1.58
1	Ruby    	Red      	9.0	4.0	1.77
Time taken: 0.388 seconds, Fetched: 4 row(s)

Next, let's try out the following code snippet, which utilizes GROUP BY.

SELECT color, COUNT(*) AS count FROM distributed_gem_properties GROUP BY color;

This code is a query statement used to count the number of gemstones for each color in the distributed table distributed_gem_properties. The result should be as follows:

OK
Blue     	1
Colorless	1
Green    	2
Multi    	1
Pink     	1
Purple   	1
Red      	1
White    	1
Yellow   	1
Time taken: 18.566 seconds, Fetched: 9 row(s)

Here's an another example query using PARTITION BY:

SELECT gem_name, color, COUNT(*) OVER(PARTITION BY color) AS color_count
FROM distributed_gem_properties;

This query selects the gem_name and color from the distributed_gem_properties table and counts the number of gemstones for each color. The PARTITION BY clause specifies which column to partition the count by, in this case, the color column. This will generate a separate count for each color of gemstones. The result should be as follows:

OK
Sapphire	Blue     	1
Diamond 	Colorless	1
Emerald 	Green    	2
CatsEye	Green    	2
Agate   	Multi    	1
Rose    	Pink     	1
Amethyst	Purple   	1
Ruby    	Red      	1
Pearl   	White    	1
Topaz   	Yellow   	1
Time taken: 20.865 seconds, Fetched: 10 row(s)

These are just two simple examples to help us gain a preliminary understanding of the analysis and application of tables distributed by DISTRIBUTE BY, incorporating commands such as PARTITION BY and GROUP BY. This concludes this lesson, and I believe you now have a certain understanding.

Summary

In this lab, we embarked on a captivating journey into the realm of gemstone analysis, guided by the wizarding world's talented gemologist, Aria. By harnessing the power of Hadoop and Hive, we delved into a vast trove of data, uncovering the secrets hidden within the mystical gems that fuel the magic of this enchanting realm.

Through a series of carefully crafted steps, we explored the intricacies of the gemstone dataset, created a robust Hive table to store and query the data, and ultimately leveraged the DISTRIBUTE BY clause to optimize our analysis. This powerful technique enabled us to partition and distribute the data efficiently, paving the way for faster and more focused queries, unlocking invaluable insights into the properties and applications of these remarkable gems.

Throughout this lab, we not only gained practical experience with Hadoop and Hive but also witnessed the transformative potential of big data analytics in unraveling the mysteries of the wizarding world. By mastering these tools, we have equipped ourselves with the knowledge and skills necessary to embark on new adventures, where data holds the key to unlocking boundless possibilities and pushing the boundaries of magical innovation.

Other Hadoop Tutorials you may like