Setting Up Hive and Exploring Data
In this step, we will set up Hive, a powerful data warehouse system built on top of Hadoop, and explore the existing data in our archives.
First, we'll open the Hive CLI:
hive
Once inside the Hive CLI, we can create a new database to store our city's archives:
CREATE DATABASE city_archives;
Now, let's switch to the newly created database:
USE city_archives;
To explore the existing data, we'll create a new table and load the data from our HDFS archives directory:
CREATE EXTERNAL TABLE tmp_chronicles (
chapter STRING,
content STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LOCATION '/home/hadoop/archives/chronicles';
This code will create an external table named tmp_chronicles
with two columns: chapter
and content
. The table's data will be loaded from the /home/hadoop/archives/chronicles
directory on HDFS, and the fields will be delimited by tab characters.
CREATE TABLE chronicles (
chapter STRING,
content STRING
)
STORED AS ORC
TBLPROPERTIES ('transactional'='true');
This code will create an table named chronicles
with two columns: chapter
and content
. The STORED AS ORC
clause specifies that the data will be stored in the ORC file format. The TBLPROPERTIES
clause specifies that the table is transactional, meaning that it supports ACID transactions.
INSERT INTO TABLE chronicles SELECT * FROM tmp_chronicles;
This code will insert all the data from the temporary table tmp_chronicles
into the chronicles
table.
Now, we can query the chronicles
table to see its contents:
SELECT * FROM chronicles LIMIT 5;
This command will display the first 5 rows of the chronicles
table, giving us a glimpse of the data we'll be working with.