Historical Data Harmony Quest

HadoopHadoopBeginner
Practice Now

Introduction

In a medieval city, known for its vibrant culture and rich history, a wandering minstrel named Alaric found himself captivated by the tales and songs of the land. As he roamed the streets, strumming his lute, he realized that the city's archives were in dire need of organization. Countless parchments and scrolls lay scattered, filled with stories and records of the past, but the task of sorting and managing them seemed daunting.

Alaric's goal was to create a harmonious system, where the city's historical records could be preserved and accessed with ease. With his love for storytelling and his keen eye for organization, he set out on a quest to harness the power of Hadoop Hive, a tool that would allow him to efficiently manage and manipulate the vast troves of data.


Skills Graph

%%%%{init: {'theme':'neutral'}}%%%% flowchart RL hadoop(("`Hadoop`")) -.-> hadoop/HadoopHiveGroup(["`Hadoop Hive`"]) hadoop/HadoopHiveGroup -.-> hadoop/delete_truncate_data("`Deleting and Truncating Data`") subgraph Lab Skills hadoop/delete_truncate_data -.-> lab-288965{{"`Historical Data Harmony Quest`"}} end

Exploring the City's Archives

In this step, we will delve into the city's archives, where countless parchments and scrolls lay scattered, awaiting organization. Our goal is to familiarize ourselves with the existing data and understand the challenges faced in managing such a vast collection.

First, ensure you are logged in as the hadoop user by running the following command in the terminal:

su - hadoop

Here, you'll find a collection of files containing various records and tales from the city's past. To get an overview of the available data, run the following command:

hdfs dfs -ls /home/hadoop/archives

This command will list the files and directories within the /home/hadoop/archives directory on the Hadoop Distributed File System (HDFS).

Next, let's explore the contents of one of the files. We'll use the hdfs dfs -cat command to view the file's contents:

hdfs dfs -cat /home/hadoop/archives/chronicles/chapter_1.txt

This command will display the contents of the chapter_1.txt file located in the chronicles subdirectory.

As you browse through the files, you'll notice that some contain irrelevant or outdated information that needs to be removed or truncated. This is where the power of Hadoop Hive comes into play, allowing us to efficiently manage and manipulate the data.

Setting Up Hive and Exploring Data

In this step, we will set up Hive, a powerful data warehouse system built on top of Hadoop, and explore the existing data in our archives.

First, we'll open the Hive CLI:

hive

Once inside the Hive CLI, we can create a new database to store our city's archives:

CREATE DATABASE city_archives;

Now, let's switch to the newly created database:

USE city_archives;

To explore the existing data, we'll create a new table and load the data from our HDFS archives directory:

CREATE EXTERNAL TABLE tmp_chronicles (
  chapter STRING,
  content STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LOCATION '/home/hadoop/archives/chronicles';

This code will create an external table named tmp_chronicles with two columns: chapter and content. The table's data will be loaded from the /home/hadoop/archives/chronicles directory on HDFS, and the fields will be delimited by tab characters.

CREATE TABLE chronicles (
  chapter STRING,
  content STRING
)
STORED AS ORC
TBLPROPERTIES ('transactional'='true');

This code will create an table named chronicles with two columns: chapter and content. The STORED AS ORC clause specifies that the data will be stored in the ORC file format. The TBLPROPERTIES clause specifies that the table is transactional, meaning that it supports ACID transactions.

INSERT INTO TABLE chronicles SELECT * FROM tmp_chronicles;

This code will insert all the data from the temporary table tmp_chronicles into the chronicles table.

Now, we can query the chronicles table to see its contents:

SELECT * FROM chronicles LIMIT 5;

This command will display the first 5 rows of the chronicles table, giving us a glimpse of the data we'll be working with.

Deleting and Truncating Data

In this step, we will learn how to delete and truncate data from our Hive tables, allowing us to manage and maintain the city's archives efficiently.

Sometimes, we may need to remove outdated or irrelevant data from our tables. In Hive, we can use the DELETE statement to remove specific rows that match a given condition.

For example, let's say we want to remove all chapters that contain the word "outdated" from the chronicles table:

DELETE FROM chronicles WHERE content LIKE '%outdated%';

This command will delete all rows from the chronicles table where the content column contains the word "outdated".

However, if we want to remove all data from a table, we can use the TRUNCATE statement instead of deleting rows individually. This operation is more efficient and faster than deleting rows one by one.

TRUNCATE TABLE chronicles;

This command will remove all data from the chronicles table, leaving it empty.

By mastering these deletion and truncation techniques, we can maintain the integrity and relevance of our city's archives, ensuring that only the most valuable and up-to-date information is preserved.

Summary

In this lab, we embarked on a journey to organize and maintain the city's archives using Hadoop Hive. Through the eyes of Alaric, the wandering minstrel, we explored the challenges of managing vast collections of historical records and learned how to harness the power of Hive to efficiently delete and truncate data.

By delving into the archives directory and setting up Hive, we gained hands-on experience in creating databases, tables, and loading data into Hive. We then mastered the art of deleting specific rows and truncating entire tables, enabling us to remove outdated or irrelevant information from the city's archives.

Throughout this lab, we not only acquired practical skills in data management but also discovered the beauty of combining storytelling with technology. Alaric's quest to preserve the city's rich cultural heritage serves as a reminder that data is more than just numbers and figures; it is a tapestry of stories waiting to be woven and shared.

Other Hadoop Tutorials you may like