Uncover Ancient Treasures with Hadoop

Introduction

Imagine a scenario where you are a marine archaeologist exploring the depths of the "Sunken Treasure Bay." This mystical bay is rumored to hold secrets of ancient civilizations, their treasures, and artifacts lost beneath the waves. Your goal is to uncover these treasures and piece together the historical puzzle that lies hidden beneath the ocean's surface.

As you embark on your underwater expedition, you'll need to analyze the various artifacts and relics you discover, carefully studying their characteristics and patterns. This is where the power of Hadoop Window Functions comes into play, allowing you to perform complex data analysis and unravel the mysteries of the past.

Skills Graph

%%%%{init: {'theme':'neutral'}}%%%% flowchart RL hadoop(("`Hadoop`")) -.-> hadoop/HadoopHDFSGroup(["`Hadoop HDFS`"]) hadoop(("`Hadoop`")) -.-> hadoop/HadoopHiveGroup(["`Hadoop Hive`"]) hadoop/HadoopHDFSGroup -.-> hadoop/fs_mkdir("`FS Shell mkdir`") hadoop/HadoopHiveGroup -.-> hadoop/manage_db("`Managing Database`") hadoop/HadoopHiveGroup -.-> hadoop/create_tables("`Creating Tables`") hadoop/HadoopHiveGroup -.-> hadoop/load_insert_data("`Loading and Inserting Data`") hadoop/HadoopHiveGroup -.-> hadoop/basic_hiveql("`Basic HiveQL Queries`") hadoop/HadoopHiveGroup -.-> hadoop/group_by("`group by Usage`") hadoop/HadoopHiveGroup -.-> hadoop/math("`Mathematical Operating Function`") hadoop/HadoopHiveGroup -.-> hadoop/window("`Window Function`") subgraph Lab Skills hadoop/fs_mkdir -.-> lab-289008{{"`Discovering Ancient Treasures with Hadoop`"}} hadoop/manage_db -.-> lab-289008{{"`Discovering Ancient Treasures with Hadoop`"}} hadoop/create_tables -.-> lab-289008{{"`Discovering Ancient Treasures with Hadoop`"}} hadoop/load_insert_data -.-> lab-289008{{"`Discovering Ancient Treasures with Hadoop`"}} hadoop/basic_hiveql -.-> lab-289008{{"`Discovering Ancient Treasures with Hadoop`"}} hadoop/group_by -.-> lab-289008{{"`Discovering Ancient Treasures with Hadoop`"}} hadoop/math -.-> lab-289008{{"`Discovering Ancient Treasures with Hadoop`"}} hadoop/window -.-> lab-289008{{"`Discovering Ancient Treasures with Hadoop`"}} end

Setting Up the Data

In this step, we'll create a dataset that simulates the artifacts and relics you've discovered during your underwater exploration. This dataset will serve as the foundation for our analysis using Hadoop Window Functions.

First, ensure you are logged in as the hadoop user by running the following command in the terminal:

su - hadoop

Then, let's create a new directory called artifacts inside the /home/hadoop directory:

mkdir /home/hadoop/artifacts
cd artifacts/

Next, create a new file named artifacts.txt inside the artifacts directory with the following content:

1,Gold,2500,15,A1
2,Silver,1800,20,B2
3,Bronze,3000,12,A1
4,Ceramic,1200,18,C3
5,Gold,2800,22,B2
6,Silver,2000,16,A1
7,Bronze,2500,25,C3
8,Ceramic,1500,14,A1
9,Gold,3200,21,B2
10,Silver,1700,19,C3

This file represents a collection of artifacts with details such as their material, age, depth at which they were found, and location coordinates.

Exploring the Data With Window Functions

Now that we have our dataset ready, let's explore it using Hadoop Window Functions. We'll use Apache Hive, a data warehouse software that facilitates querying and managing large datasets residing in distributed storage.

First, start the Hive shell by executing the following command in your terminal:

hive

Next, create a new table named artifacts to store our dataset:

CREATE TABLE artifacts (
  artifact_id INT,
  material STRING,
  age INT,
  depth INT,
  location STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;

Load the data from the artifacts.txt file into the artifacts table:

LOAD DATA LOCAL INPATH '/home/hadoop/artifacts/artifacts.txt' INTO TABLE artifacts;

Let's start by exploring the age column using the RANK() and DENSE_RANK() window functions. These functions will help us identify the relative age ranks of the artifacts:

SELECT
  artifact_id,
  material,
  age,
  RANK() OVER (ORDER BY age DESC) AS rank_age,
  DENSE_RANK() OVER (ORDER BY age DESC) AS dense_rank_age
FROM
  artifacts;

This query will output the artifact details along with their age ranks based on the RANK() and DENSE_RANK() functions.

Next, let's use the LAG() and LEAD() functions to compare the ages of consecutive artifacts:

SELECT
  artifact_id,
  material,
  age,
  LAG(age, 1) OVER (ORDER BY age DESC) AS prev_age,
  LEAD(age, 1) OVER (ORDER BY age DESC) AS next_age
FROM
  artifacts;

This query will display the artifact details along with the ages of the previous and next artifacts in the ordered sequence.

Analyzing Artifact Patterns

In this step, we'll use window functions to analyze patterns and identify significant artifacts based on various criteria.

Let's start by finding the oldest and newest artifacts for each material type:

CREATE TABLE result_1
AS
SELECT
  material,
  artifact_id,
  age,
  FIRST_VALUE(artifact_id) OVER (PARTITION BY material ORDER BY age DESC) AS oldest_artifact,
  LAST_VALUE(artifact_id) OVER (PARTITION BY material ORDER BY age ASC RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS newest_artifact
FROM
  artifacts;

SELECT * FROM result_1;

This query uses the FIRST_VALUE() and LAST_VALUE() functions to identify the oldest and newest artifacts for each material type, respectively.

Next, let's find the artifacts that are significantly older or younger compared to the average age for their material type:

CREATE TABLE result_2
AS
SELECT
  artifact_id,
  material,
  age,
  AVG(age) OVER (PARTITION BY material) AS avg_age_by_material,
  CASE
    WHEN age > AVG(age) OVER (PARTITION BY material) + 200 THEN 'Significantly Older'
    WHEN age < AVG(age) OVER (PARTITION BY material) - 200 THEN 'Significantly Younger'
    ELSE 'Average'
  END AS age_category
FROM
  artifacts;

SELECT * FROM result_2;

This query calculates the average age for each material type using the AVG() window function and then categorizes each artifact based on whether it is significantly older, significantly younger, or average compared to the material's average age.

Summary

In this lab, we explored the power of Hadoop Window Functions in the context of a marine archaeology expedition. By simulating a dataset of artifacts and relics, we learned how to leverage various window functions like RANK(), DENSE_RANK(), LAG(), LEAD(), FIRST_VALUE(), LAST_VALUE(), and AVG() to analyze and uncover patterns within the data.

Through hands-on exercises, we gained insights into identifying age ranks, comparing consecutive artifact ages, finding the oldest and newest artifacts for each material type, and categorizing artifacts based on their age relative to the material's average. This lab provided a practical and engaging experience in using Hadoop Window Functions to unravel the mysteries of the "Sunken Treasure Bay."

The design of this lab aimed to combine a captivating scenario with practical coding exercises, allowing learners to develop a deeper understanding of Hadoop Window Functions while exploring a fictional adventure. By immersing learners in a realistic context, the lab fostered a more engaging and memorable learning experience.

Discovering Ancient Treasures with Hadoop