Quick Start to Hive

LinuxLinuxBeginner
Practice Now

Introduction

This lab focuses on Hive's configuration and basic operations, designed for students with a certain SQL foundation to understand Hive's architecture, basic deployment and usage.

Input all the sample code in the document by yourself; do not just copy and paste the code as far as possible. Only in this way can you be more familiar with the code. If you have problems, carefully review the documentation, or you can go to forum for help and communication.


Skills Graph

%%%%{init: {'theme':'neutral'}}%%%% flowchart RL linux(("`Linux`")) -.-> linux/PackagesandSoftwaresGroup(["`Packages and Softwares`"]) linux(("`Linux`")) -.-> linux/CompressionandArchivingGroup(["`Compression and Archiving`"]) linux(("`Linux`")) -.-> linux/BasicFileOperationsGroup(["`Basic File Operations`"]) linux(("`Linux`")) -.-> linux/VersionControlandTextEditorsGroup(["`Version Control and Text Editors`"]) linux(("`Linux`")) -.-> linux/UserandGroupManagementGroup(["`User and Group Management`"]) linux/PackagesandSoftwaresGroup -.-> linux/wget("`Non-interactive Downloading`") linux/CompressionandArchivingGroup -.-> linux/tar("`Archiving`") linux/BasicFileOperationsGroup -.-> linux/chown("`Ownership Changing`") linux/VersionControlandTextEditorsGroup -.-> linux/vim("`Text Editing`") linux/UserandGroupManagementGroup -.-> linux/export("`Variable Exporting`") subgraph Lab Skills linux/wget -.-> lab-272323{{"`Quick Start to Hive`"}} linux/tar -.-> lab-272323{{"`Quick Start to Hive`"}} linux/chown -.-> lab-272323{{"`Quick Start to Hive`"}} linux/vim -.-> lab-272323{{"`Quick Start to Hive`"}} linux/export -.-> lab-272323{{"`Quick Start to Hive`"}} end

Introduction to Hive

Apache Hive is a data warehouse software package built on top of Hadoop for querying and analyzing large datasets using SQL-like queries. It simplifies data aggregation, ad hoc queries, and analysis of big data by providing a familiar SQL interface for users. Here is an overview of Hive's architecture and key features:

Hive Architecture

  • Components: Hive consists of HCatalog for table and storage management and WebHCat for running MapReduce tasks, Pig queries, or interacting with Hive metadata via REST API.
  • Features:
    • Supports SQL queries for data warehousing tasks like ETL, data analysis, and reporting.
    • Data is stored in storage systems like HDFS or HBase; query operations are executed by calculation engines like Tez, Spark, or MapReduce.
    • Supports procedural languages, HPL-SQL, and sub-query retrieval with Hive LLAP, YARN, and Slider support.

Basic Data Units in Hive

  • Database: Namespace to avoid naming conflicts and enhance security.
  • Table: Homogeneous data units with the same schema (e.g., page_views table with columns like timestamp, userid, page_url).
  • Partition: Divides data into partitions based on keys for efficient data retrieval.
  • Bucket: Divides partitioned data into buckets based on hash function values for improved query efficiency.

Hive Data Types

Hive is designed for data warehousing tasks rather than online transaction processing (OLTP), focusing on scalability, performance, reliability, and fault tolerance. It supports various input formats and can be extended with connectors in different formats. Understanding Hive's architecture, data units, and data types is essential for leveraging its capabilities in big data analytics.

Installation and Deployment of Hive

This section focuses on download and directory settings, environment variable settings, metabase configuration settings, Hive system parameters and initializing the metabase.

First you need to switch to the hadoop user to perform the subsequent operations. Double-click to open the Xfce terminal on your desktop and enter the following commands:

su - hadoop

tip: the password is 'hadoop' of the user 'hadoop'

Then use the wget command to download the latest stable version of Hive. We have chosen the version 2.3.3 of Hive here:

sudo wget https://labexfile.oss-us-west-1.aliyuncs.com/courses/144/apache-hive-2.3.3-bin.tar.gz

Next you need to extract the downloaded installation package. Enter the following command in the terminal:

tar -zxvf apache-hive-2.3.3-bin.tar.gz

Use the sudo privilege to move the extracted directory to the /opt directory. If you do not use the sudo privilege, you might not write to the /opt directory:

sudo mv /home/hadoop/apache-hive-2.3.3-bin /opt/hive-2.3.3

After moving the directory, you need to change the Hive's directory owner to the hadoop user and its user group. Please enter the following command in the terminal:

sudo chown -R hadoop:hadoop /opt/hive-2.3.3

tip: you can opening another terminal to do above operations with 'sudo' privilege.

labex:~/ $ ls -al /opt/
total 24
drwxrwxr-x 10 hadoop hadoop 4096 Mar 3 12:01 hive-2.3.3
...

Environment Variable Settings

After the directory has been set, you need to set the Hive environment variables so that HIVE_HOME points to its installation directory.

First open the .bashrc file of hadoop user using vim editor. Enter the following command in the terminal:

vim /home/hadoop/.bashrc

Add the following content at the end of the file. The PATH needs to be modified according to the actual situation of the current experiment environment. $PATH expression is used to refer to the existing content. Add :$HIVE_HOME/bin directly at the end of the PATH environment variable:

export HIVE_HOME=/opt/hive-2.3.3
export PATH=$PATH:$HIVE_HOME/bin

Save the file and exit the editor when editing is complete. Then use the source command to activate the above environment variables.

In this lab, we've pre-installed hive, so you don't need the source environment variable, but in your actual installation, this step is done on a case-by-case basis.

Set Configuration

This is a hive configuration process, in this step, you can just do view without hands-on, because we have pre-configured a hive in the lab, and this we pre-installed hive, you can use the which hive command to view, but the creation of the file still need to be carried out, because this is the credentials that we determine whether you pass the test or not.

Metabase Configuration

Before you can officially use Hive, you need to set its metadata store up. By default, Hive stores meta information in an embedded Derby database. Its storage location on disk is determined by the configuration item javax.jdo.option.ConnectionURL in the Hive configuration file conf/hive-default.xml. By default, this location is ./metastore_db.

But in this lab we will use MySQL as the storage for metadata. Therefore, you need to modify the Hive configuration file.

Create and open this configuration file we bulid with vim editor:

vim /opt/hive-2.3.3/conf/hive-site.xml
<configuration>
<property>
  <name>javax.jdo.option.ConnectionURL</name>
  <value>jdbc:mysql://localhost:3306/hive_metastore1?createDatabaseIfNotExist=true</value>
</property>
<property>
  <name>javax.jdo.option.ConnectionDriverName</name>
  <value>com.mysql.cj.jdbc.Driver</value>
</property>
<property>
  <name>javax.jdo.option.ConnectionUserName</name>
  <value>hiveuser1</value>
</property>
<property>
  <name>javax.jdo.option.ConnectionPassword</name>
  <value>123qweQWE...</value>
</property>
</configuration>

After editing the configuration items above, save the file and exit the editor.

Create MySQL Database

Start the Mysql Server by sudo mysql and create the database hive_metastore:

CREATE DATABASE hive_metastore1;
CREATE USER 'hiveuser1'@'localhost' IDENTIFIED BY '123qweQWE...';
GRANT ALL PRIVILEGES ON hive_metastore1.* TO 'hiveuser1'@'localhost';
FLUSH PRIVILEGES;
exit;

Then download the MySQL JDBC driver and copy it to the /opt/hive-2.3.3/lib directory

cp /usr/local/hive/lib/mysql-connector-j-8.3.0.jar /opt/hive-2.3.3/lib

Setting Hive System Parameters

After the metabase has been configured, you need to set the Hive internal configuration items to identify the Hadoop location, internal configuration file path and so on.

The first thing to do is to use the cp command to make a copy of the settings template for it to take effect.

Please enter the following command in the terminal:

cp /opt/hive-2.3.3/conf/hive-env.sh.template /opt/hive-2.3.3/conf/hive-env.sh

Then open the hive-env.sh file with vim editor:

vim /opt/hive-2.3.3/conf/hive-env.sh

On line 48, set HADOOP_HOME to the installation path of Hadoop:

HADOOP_HOME=/home/hadoop/hadoop

After editing the configuration items above, save the file and exit the editor.

Initializing the Metabase

Once all the configuration work is complete, you can start initializing the metabase.

Since the data used later will be stored on HDFS, you need to start HDFS in advance. Please enter the following commands in the terminal to start HDFS:

start-dfs.sh
start-yarn.sh

You need Typing the jps to see the service status:

hadoop:~/ $ jps                                                      [17:31:43]
8960 Jps
3153 NodeManager
2823 SecondaryNameNode
3017 ResourceManager
2570 DataNode
2428 NameNode

If you are using Hadoop for the first time, you need to delete Hadoop data and initialize it with hdfs namenode -format.

In this lab, we have initialised the pre-installed hive with the u per database, of course, if you want to initialise the hive you just downloaded, use the following command.

/opt/hive-2.3.3/bin/schematool -initSchema -dbType mysql
Initialization script completed
schemaTool completed

Here, if you first initialise the hive you've just downloaded, you'll have to change the path of HIVE_HOME in '~/.bashrc' to the path of the hive you've just installed and source it, and my advice is not to do that, because our subsequent tests of hive are all based around our pre-installed hive!

When the prompt message indicates that initialization has been completed, you can use the hive command to enter its command line. Please enter the following command in the terminal:

hive

tip:Typing hive directly into our pre-installed hive shell, add the absolute path and you'll be in the hive shell you just set up!

Hive Session ID = 3eee2693-175d-4452-82d2-47f1b639d9d5

Logging initialized using configuration in jar:file:/usr/local/hive/lib/hive-common-3.1.3.jar!/hive-log4j2.properties Async: true
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Hive Session ID = 68d0d345-92c3-4eb1-8393-7ee34f56363f
hive>

Basic Operations of Hive

Next we will learn some basic operations in Hive. In addition to some of the operational statements mentioned in this section, all Hive statements can be found in its Language Manual. You can try more statements after finishing this section.

Creating a Table

First you need to get some data. The data used in this section of the experiment is mainly a log file simulating the NginX server, which can be used to analyze the page views of the website.

Please reopen a terminal and enter the following commands in the terminal to download the sample data:

su hadoop
cd ~
wget http://labfile.oss-cn-hangzhou.aliyuncs.com/courses/567/log_example.csv

Then upload the file to HDFS:

hdfs dfs -mkdir -p /user/data/staging/page_view
hdfs dfs -put /home/hadoop/log_example.csv /user/data/staging/page_view

The first thing after having the data is to create the table. If you want to create a page_view table for the above data, you need to enter the following statement on the Hive command line:

CREATE TABLE page_view(viewTime INT, userid BIGINT,
Page_url STRING, referrer_url STRING,
Ip STRING COMMENT 'IP Address of the User')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
STORED AS SEQUENCEFILE;

In this example, each column in the table is assigned with appropriate type. And, we can attach comments at the column and table levels. In addition, the PARTITIONED BY clause defines a partition column that is different from the data column. The partition column is not stored with the data column. When you specify a partition column in this way, a newline character is used as the separator for each row.

If the data is not in the above format, you can parameterize the field separator as shown in the following example:

The following statement is only for demonstration purpose and has not to be entered in the hive shell.

CREATE TABLE page_view1(viewTime INT, userid BIGINT,
Page_url STRING, referrer_url STRING,
Ip STRING COMMENT 'IP Address of the User')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '1'
STORED AS SEQUENCEFILE;

Since the line separator is determined by the separator in the Hadoop system rather than by Hive, we cannot manually change the line separator.

Usually, the table in which the column data is determined is stored in a bucket, which will facilitate efficient sampling of the data set query. If there is no bucket, even if the random sampling of the table can be completed, it cannot achieve high-efficiency sampling in the process of scanning the whole data. The following example shows how to enable bucket storage for the page_view table on the userid column.

The following statement is only for demonstration purpose and has not to be entered in the hive shell.

CREATE TABLE page_view2(viewTime INT, userid BIGINT,
Page_url STRING, referrer_url STRING,
Ip STRING COMMENT 'IP Address of the User')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '1'
COLLECTION ITEMS TERMINATED BY '2'
MAP KEYS TERMINATED BY '3'
STORED AS SEQUENCEFILE;

In the example above, the userid column in the table is aggregated into 32 buckets by a hash function. In each bucket, the data is sorted in ascending order of viewTime. This method of organizing data allows users to effectively sample aggregated columns (here the userid column), and the sorting feature allows data managers to evaluate queries more efficiently through better data structures.

The following statement is only for demonstration purpose and has not to be entered in the hive shell.

CREATE TABLE page_view3(viewTime INT, userid BIGINT,
Page_url STRING, referrer_url STRING,
Friends ARRAY<BIGINT>, properties MAP<STRING, STRING>,
Ip STRING COMMENT 'IP Address of the User')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '1'
COLLECTION ITEMS TERMINATED BY '2'
MAP KEYS TERMINATED BY '3'
STORED AS SEQUENCEFILE;

In the example above, the format of each row in the table is defined in terms of field names and types, quite similar to type definitions in other programming languages. Please note that table names and column names are not case sensitive. IP Address of the User and the COMMENT statement below indicate that we can add comments at both column and table levels. In addition, the PARTITIONED BY clause defines partition columns that are different from the data columns. As mentioned earlier, actually they do not store the data. The CLUSTERED BY clause specifies which column to use to create how many buckets. The ROW FORMAT DELIMITED clause specifies how to store rows in a Hive table. For delimited formats, this statement specifies how to determine the termination symbol for a field, how to determine the termination symbol for an item in a collection (array or map), and the termination symbol for determining the termination map key, where the numbers are in ASCII encoding. STORED AS SEQUENCEFILE indicates that the data is stored on HDFS in binary format (specifically Hadoop's SequenceFile type). Among them, the settings in the ROW FORMAT and STORED AS clauses are the default values that Hive is currently using. So, we didn't explicitly write them in the statement that created the table at the beginning.

Browsing Tables and Partitions

If you want to list existing tables in your warehouse, you can use the following statement:

SHOW TABLES;

If there are a lot of tables, the above statement will generate a lot of return information. You can narrow the scope down by specifying a prefix. For example, if you want to list a table with the prefix page, you can use the following statement:

SHOW TABLES 'page.*';

The matching rules in this statement are the same as the regular expression syntax, and the period (.) represents a wildcard.

If you want to list the partitions of a table, use the following statement. If the table is not a partitioned table, no information is returned:

SHOW PARTITIONS page_view;

If you want to list the columns and the column types of a table, you can use the DESCRIBE statement:

DESCRIBE page_view;

If you want to list the columns of the table and all other properties, you need to add the EXTENDED keyword. This will print a lot of information, which is usually used for debugging:

DESCRIBE EXTENDED page_view;

Modifying a Table

If you want to rename an existing table, use the ALTER TABLE statement with RENAME TO. If a table with the new name already exists, an error is returned:

ALTER TABLE page_view RENAME TO new_page_view;

View the result:

hive> ALTER TABLE page_view RENAME TO new_page_view;
OK
Time taken: 0.124 seconds
hive> show tables;
OK
new_page_view
...
Time taken: 0.021 seconds, Fetched: 4 row(s)

We can also rename the columns of an existing table. However, it is important to note that you must use the same column type and include a record in each of the existing columns:

ALTER TABLE new_page_view REPLACE COLUMNS (viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ipaddr STRING COMMENT 'IP Address of the User');

In addition, you can add new columns to an existing table:

ALTER TABLE new_page_view ADD COLUMNS (new_column INT COMMENT 'a new int column');

Note that the changes in the schema (such as adding a column) will preserve the schema for the old partition of the table to prevent it from becoming a partitioned table. All queries that access these columns and run on the old partition implicitly return null values or the specified default values for those columns.

Deleting Tables and Partitions

Since Hive is designed as a data warehouse, the destruction of data is always negative. Therefore, the operation of deleting a table is a bit cumbersome. Using the DROP command on the table will implicitly delete any indexes built on the table.

You can use the following commands to delete a table:

DROP TABLE new_page_view;

Loading and Query Data

In Hive, loading data and query data are two different operations. This document describes how to load data into a Hive table.

Loading Data

There are several ways to load data into a Hive table. Users can create external tables that point to specific locations in HDFS. In this usage, the user can use put or copy command of HDFS to copy the data file to the specified location and create a table pointing to that location. This table will contain all relevant line format information.

Once created, users can convert the data and insert them into any other Hive table. We uploaded the log_example.csv file to HDFS from the beginning and renamed it to page_view. If we want to load it into the page_view table of the corresponding partition, we can do so with the following command.

First create an external table and associate it with the specified file:

CREATE EXTERNAL TABLE page_view_stg(viewTime INT, userid BIGINT,
                Page_url STRING, referrer_url STRING,
                Ip STRING COMMENT 'IP Address of the User',
                Country STRING COMMENT 'country of origin')
COMMENT 'This is the staging page view table'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '44' LINES TERMINATED BY '\n' STORED AS TEXTFILE
LOCATION '/user/data/staging/page_view';

Then create a page_view table for the final storage of the data:

CREATE TABLE page_view(viewTime INT, userid BIGINT,
                Page_url STRING, referrer_url STRING,
                Ip STRING COMMENT 'IP Address of the User')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
STORED AS SEQUENCEFILE;

Finally, insert the data from the external table into the page_view table:

FROM page_view_stg pvs
INSERT OVERWRITE TABLE page_view
PARTITION(dt='2015-05-17', country='us')
SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, pvs.ip
WHERE pvs.country = 'us';

It can take a few minutes for the data to be loaded into HDFS and then into the page_view table.

...
Loading data to table default.page_view partition (dt=2015-05-17, country=us)
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 6.31 sec   HDFS Read: 369021 HDFS Write: 22943 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 310 msec
OK
Time taken: 25.304 seconds

Simple Query

After inserting the data, we can do a simple query. This is similar to a common SQL statement. Enter the following statement in the Hive command-line interface:

SELECT * FROM page_view WHERE userid = 0 LIMIT 10;

The information returned is the queried records:

hive> select * from page_view;
OK
NULL	490	/downloads/product_1	/archives/webroot	217.168.17.5	2015-05-17	us
NULL	490	/downloads/product_2	/archives/webroot	217.168.17.5	2015-05-17	us
NULL	3316	/downloads/product_2	/archives/webroot	217.168.17.5	2015-05-17	us
...

Close Services

If you want to learn more Hive action statements, you can read the following extensions:

When you are done using it, you can use the quit command to exit the Hive command-line interface:

quit;

Of course, don't forget to turn the HDFS service off. Enter the following command in the terminal:

stop-yarn.sh
stop-dfs.sh

Summary

This session introduces Hive's architecture, installation and deployment, and basic HQL statements. Also, we have learned data import using the sample data.

The main points mentioned are:

  • Hive architecture
  • Basic data unit of Hive
  • How to deploy Hive
  • Hive HQL language

In general, as a data warehouse software package, Hive's functions need to be explored further. Please maintain the habit of actively reviewing technical materials, and continue to learn the following courses.

Other Linux Tutorials you may like