Basic Operations of Hive
Next we will learn some basic operations in Hive. In addition to some of the operational statements mentioned in this section, all Hive statements can be found in its Language Manual. You can try more statements after finishing this section.
Creating a Table
First you need to get some data. The data used in this section of the experiment is mainly a log file simulating the NginX server, which can be used to analyze the page views of the website.
Please reopen a terminal and enter the following commands in the terminal to download the sample data:
su hadoop
cd ~
wget http://labfile.oss-cn-hangzhou.aliyuncs.com/courses/567/log_example.csv
Then upload the file to HDFS:
hdfs dfs -mkdir -p /user/data/staging/page_view
hdfs dfs -put /home/hadoop/log_example.csv /user/data/staging/page_view
The first thing after having the data is to create the table. If you want to create a page_view
table for the above data, you need to enter the following statement on the Hive command line:
CREATE TABLE page_view(viewTime INT, userid BIGINT,
Page_url STRING, referrer_url STRING,
Ip STRING COMMENT 'IP Address of the User')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
STORED AS SEQUENCEFILE;
In this example, each column in the table is assigned with appropriate type. And, we can attach comments at the column and table levels. In addition, the PARTITIONED BY
clause defines a partition column that is different from the data column. The partition column is not stored with the data column. When you specify a partition column in this way, a newline character is used as the separator for each row.
If the data is not in the above format, you can parameterize the field separator as shown in the following example:
The following statement is only for demonstration purpose and has not to be entered in the hive
shell.
CREATE TABLE page_view1(viewTime INT, userid BIGINT,
Page_url STRING, referrer_url STRING,
Ip STRING COMMENT 'IP Address of the User')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '1'
STORED AS SEQUENCEFILE;
Since the line separator is determined by the separator in the Hadoop system rather than by Hive, we cannot manually change the line separator.
Usually, the table in which the column data is determined is stored in a bucket, which will facilitate efficient sampling of the data set query. If there is no bucket, even if the random sampling of the table can be completed, it cannot achieve high-efficiency sampling in the process of scanning the whole data. The following example shows how to enable bucket storage for the page_view
table on the userid
column.
The following statement is only for demonstration purpose and has not to be entered in the hive
shell.
CREATE TABLE page_view2(viewTime INT, userid BIGINT,
Page_url STRING, referrer_url STRING,
Ip STRING COMMENT 'IP Address of the User')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '1'
COLLECTION ITEMS TERMINATED BY '2'
MAP KEYS TERMINATED BY '3'
STORED AS SEQUENCEFILE;
In the example above, the userid
column in the table is aggregated into 32 buckets by a hash function. In each bucket, the data is sorted in ascending order of viewTime
. This method of organizing data allows users to effectively sample aggregated columns (here the userid
column), and the sorting feature allows data managers to evaluate queries more efficiently through better data structures.
The following statement is only for demonstration purpose and has not to be entered in the hive
shell.
CREATE TABLE page_view3(viewTime INT, userid BIGINT,
Page_url STRING, referrer_url STRING,
Friends ARRAY<BIGINT>, properties MAP<STRING, STRING>,
Ip STRING COMMENT 'IP Address of the User')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '1'
COLLECTION ITEMS TERMINATED BY '2'
MAP KEYS TERMINATED BY '3'
STORED AS SEQUENCEFILE;
In the example above, the format of each row in the table is defined in terms of field names and types, quite similar to type definitions in other programming languages. Please note that table names and column names are not case sensitive. IP Address of the User
and the COMMENT
statement below indicate that we can add comments at both column and table levels. In addition, the PARTITIONED BY
clause defines partition columns that are different from the data columns. As mentioned earlier, actually they do not store the data. The CLUSTERED BY
clause specifies which column to use to create how many buckets. The ROW FORMAT DELIMITED
clause specifies how to store rows in a Hive table. For delimited formats, this statement specifies how to determine the termination symbol for a field, how to determine the termination symbol for an item in a collection (array or map), and the termination symbol for determining the termination map key, where the numbers are in ASCII encoding. STORED AS SEQUENCEFILE
indicates that the data is stored on HDFS in binary format (specifically Hadoop's SequenceFile
type). Among them, the settings in the ROW FORMAT
and STORED AS
clauses are the default values that Hive is currently using. So, we didn't explicitly write them in the statement that created the table at the beginning.
Browsing Tables and Partitions
If you want to list existing tables in your warehouse, you can use the following statement:
SHOW TABLES;
If there are a lot of tables, the above statement will generate a lot of return information. You can narrow the scope down by specifying a prefix. For example, if you want to list a table with the prefix page
, you can use the following statement:
SHOW TABLES 'page.*';
The matching rules in this statement are the same as the regular expression syntax, and the period (.
) represents a wildcard.
If you want to list the partitions of a table, use the following statement. If the table is not a partitioned table, no information is returned:
SHOW PARTITIONS page_view;
If you want to list the columns and the column types of a table, you can use the DESCRIBE
statement:
DESCRIBE page_view;
If you want to list the columns of the table and all other properties, you need to add the EXTENDED
keyword. This will print a lot of information, which is usually used for debugging:
DESCRIBE EXTENDED page_view;
Modifying a Table
If you want to rename an existing table, use the ALTER TABLE
statement with RENAME TO
. If a table with the new name already exists, an error is returned:
ALTER TABLE page_view RENAME TO new_page_view;
View the result:
hive> ALTER TABLE page_view RENAME TO new_page_view;
OK
Time taken: 0.124 seconds
hive> show tables;
OK
new_page_view
...
Time taken: 0.021 seconds, Fetched: 4 row(s)
We can also rename the columns of an existing table. However, it is important to note that you must use the same column type and include a record in each of the existing columns:
ALTER TABLE new_page_view REPLACE COLUMNS (viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ipaddr STRING COMMENT 'IP Address of the User');
In addition, you can add new columns to an existing table:
ALTER TABLE new_page_view ADD COLUMNS (new_column INT COMMENT 'a new int column');
Note that the changes in the schema (such as adding a column) will preserve the schema for the old partition of the table to prevent it from becoming a partitioned table. All queries that access these columns and run on the old partition implicitly return null values or the specified default values for those columns.
Deleting Tables and Partitions
Since Hive is designed as a data warehouse, the destruction of data is always negative. Therefore, the operation of deleting a table is a bit cumbersome. Using the DROP
command on the table will implicitly delete any indexes built on the table.
You can use the following commands to delete a table:
DROP TABLE new_page_view;