Rebel Data Restructuring with Hadoop

HadoopHadoopBeginner
Practice Now

Introduction

In a distant galaxy, the tyrannical Galactic Empire has subjugated countless planets, exploiting their resources and oppressing their populations. The rebel leader, Princess Leia, has been coordinating a rebellion against the Empire, but her forces are vastly outnumbered and outgunned. To turn the tide of the war, Leia must analyze vast amounts of intelligence data to identify the Empire's weaknesses and plan strategic strikes.

The goal of this lab is to help Princess Leia and the Rebel Alliance utilize the power of Hadoop Hive to manage and manipulate large datasets, specifically by learning how to alter existing Hive tables. By mastering this skill, Leia can efficiently restructure her intelligence databases, enabling her to uncover crucial insights and gain a strategic advantage against the Empire.


Skills Graph

%%%%{init: {'theme':'neutral'}}%%%% flowchart RL hadoop(("`Hadoop`")) -.-> hadoop/HadoopHiveGroup(["`Hadoop Hive`"]) hadoop/HadoopHiveGroup -.-> hadoop/alter_tables("`Altering Tables`") subgraph Lab Skills hadoop/alter_tables -.-> lab-288956{{"`Rebel Data Restructuring with Hadoop`"}} end

Create a Sample Table

In this step, we will create a sample Hive table to work with. This table will store information about various planets in the galaxy, including their names, populations, and resources.

First, switch to the hadoop user by running the following command in your terminal:

su - hadoop

Now, start the Hive CLI by running the following command:

hive

Inside the Hive CLI, create a new database called rebel_intel:

CREATE DATABASE rebel_intel;

Use the rebel_intel database:

USE rebel_intel;

Create a new table called planets with the following schema:

CREATE TABLE planets (
    planet_id INT,
    planet_name STRING,
    population BIGINT,
    resources ARRAY<STRING>
)
CLUSTERED BY (planet_id) INTO 2 BUCKETS
STORED AS ORC
TBLPROPERTIES ('transactional'='true');

This table has four columns: planet_id (an integer for the planet's ID), planet_name (a string for the planet's name), population (a big integer for the planet's population), and resources (an array of strings representing the planet's resources).

Exit the Hive CLI by typing exit;.

Add Data to the Table

In this step, we will populate the planets table with some sample data. This data represents the intelligence gathered by the Rebel Alliance about various planets in the galaxy.

Start the Hive CLI again:

hive

Use the rebel_intel database:

USE rebel_intel;

Insert some sample data into the planets table:

INSERT INTO planets VALUES
    (1, 'Alderaan', 2000000000, array('Agriculture', 'Energy')),
    (2, 'Tatooine', 200000, array('Mining')),
    (3, 'Hoth', 0, array('None')),
    (4, 'Endor', 30000, array('Forestry', 'Hunting')),
    (5, 'Bespin', 6000000, array('Gas Mining', 'Tourism'));

This inserts five rows of data into the planets table, representing different planets with their respective planet IDs, names, populations, and resources.

Exit the Hive CLI by typing exit;.

Alter the Table Structure

Now that we have a table with some data, let's say Princess Leia needs to add another column to store information about each planet's climate. In this step, we will learn how to alter the structure of an existing Hive table.

Start the Hive CLI:

hive

Use the rebel_intel database:

USE rebel_intel;

Alter the planets table to add a new column called climate:

ALTER TABLE planets ADD COLUMNS (climate STRING);

This command adds a new column named climate of type STRING to the planets table.

Verify the updated table structure:

DESCRIBE planets;

You should see the new climate column in the table schema.

Exit the Hive CLI by typing exit;.

Update Table Data

With the new climate column added to the planets table, we can now update the existing data to include climate information for each planet.

Start the Hive CLI:

hive

Use the rebel_intel database:

USE rebel_intel;

Update the climate column for each planet:

UPDATE planets SET climate = 'Temperate' WHERE planet_name = 'Alderaan';
UPDATE planets SET climate = 'Arid' WHERE planet_name = 'Tatooine';
UPDATE planets SET climate = 'Frozen' WHERE planet_name = 'Hoth';
UPDATE planets SET climate = 'Tropical' WHERE planet_name = 'Endor';
UPDATE planets SET climate = 'Artificial' WHERE planet_name = 'Bespin';

These UPDATE statements set the climate value for each planet based on their respective names.

Verify the updated data:

SELECT * FROM planets;

You should see the updated climate values for each planet.

Exit the Hive CLI by typing exit;.

Summary

In this lab, we explored the concept of altering tables in Hadoop Hive, which is a crucial skill for managing and restructuring large datasets. By creating a sample planets table, adding data, and then altering the table structure to include a new climate column, we learned how to adapt our data models to meet evolving requirements.

The design of this lab was inspired by the iconic Star Wars universe, where the rebel leader, Princess Leia, needs to analyze intelligence data to identify weaknesses in the Galactic Empire's defenses. By mastering Hive's ability to alter tables, Leia can efficiently reshape her intelligence databases, uncovering valuable insights and gaining a strategic advantage in the rebellion against the oppressive Empire.

Through this hands-on experience, I gained a deeper understanding of how to create, modify, and update Hive tables, which are essential skills for working with large datasets in Hadoop environments. The use of checkers also ensured that I could verify my progress at each step, reinforcing my learning and ensuring that I successfully completed the lab objectives.

Other Hadoop Tutorials you may like