Exploring the Gem Dataset
In this step, we will familiarize ourselves with the gem dataset and its structure, laying the groundwork for our subsequent analysis.
First, ensure you are logged in as the hadoop
user by running the following command in the terminal:
su - hadoop
Now let's start by creating an example. Copy the following command line into the terminal to create our sample file.
mkdir -p hadoop/gemstone_data
cd hadoop/gemstone_data
echo "gem_id,gem_name,color,hardness,density,refractive_index" > gem_properties.csv
echo "1,Ruby ,Red ,9.0 ,4.0,1.77" >> gem_properties.csv
echo "2,Emerald ,Green ,8.0 ,3.1,1.58" >> gem_properties.csv
echo "3,Sapphire,Blue ,9.0 ,4.0,1.76" >> gem_properties.csv
echo "4,Diamond ,Colorless,10.0,3.5,2.42" >> gem_properties.csv
echo "5,Amethyst,Purple ,7.0 ,2.6,1.54" >> gem_properties.csv
echo "6,Topaz ,Yellow ,8.0 ,3.5,1.63" >> gem_properties.csv
echo "7,Pearl ,White ,2.5 ,2.7,1.53" >> gem_properties.csv
echo "8,Agate ,Multi ,7.0 ,2.6,1.53" >> gem_properties.csv
echo "9,Rose ,Pink ,7.0 ,2.7,1.54" >> gem_properties.csv
echo "10,CatsEye,Green ,6.5 ,3.2,1.54" >> gem_properties.csv
echo "gem_id,application" > gem_applications.csv
echo "1,Fire Magic " >> gem_applications.csv
echo "2,Earth Magic " >> gem_applications.csv
echo "3,Water Magic " >> gem_applications.csv
echo "4,Enhancement Magic" >> gem_applications.csv
echo "5,Psychic Magic " >> gem_applications.csv
echo "6,Lightning Magic " >> gem_applications.csv
echo "7,Illusion Magic " >> gem_applications.csv
echo "8,Strength Magic " >> gem_applications.csv
echo "9,Love Magic " >> gem_applications.csv
echo "10,Stealth Magic " >> gem_applications.csv
Now we are already in the directory gemstone_data, let's take a moment to review the contents of this directory:
ls
As you navigate through the directory, you'll see these two files, each dedicated to a distinct aspect of gemstone data. gem_properties.csv
delves into the physical characteristics of gems, whereas gem_applications.csv
provides insights into their varied magical uses.
To gain deeper insights into our dataset, let's have a look at the first few lines of one of these files:
head -n 5 gem_properties.csv
The result should be like as follows:
gem_id,gem_name,color,hardness,density,refractive_index
1,Ruby ,Red ,9.0 ,4.0,1.77
2,Emerald ,Green ,8.0 ,3.1,1.58
3,Sapphire,Blue ,9.0 ,4.0,1.76
4,Diamond ,Colorless,10.0,3.5,2.42
This command displayed the first five lines of the gem_properties.csv
file, giving you a glimpse into its structure and contents.