How to create an external Hive table for HDFS data?

HadoopHadoopBeginner
Practice Now

Introduction

This tutorial will guide you through the process of creating an external Hive table to access data stored in the Hadoop Distributed File System (HDFS). You will learn how to set up the external table and query the data within it, providing a seamless way to work with Hadoop data using the Hive SQL interface.


Skills Graph

%%%%{init: {'theme':'neutral'}}%%%% flowchart RL hadoop(("`Hadoop`")) -.-> hadoop/HadoopHDFSGroup(["`Hadoop HDFS`"]) hadoop(("`Hadoop`")) -.-> hadoop/HadoopHiveGroup(["`Hadoop Hive`"]) hadoop/HadoopHDFSGroup -.-> hadoop/hdfs_setup("`HDFS Setup`") hadoop/HadoopHiveGroup -.-> hadoop/hive_setup("`Hive Setup`") hadoop/HadoopHiveGroup -.-> hadoop/hive_shell("`Hive Shell`") hadoop/HadoopHiveGroup -.-> hadoop/manage_db("`Managing Database`") hadoop/HadoopHiveGroup -.-> hadoop/create_tables("`Creating Tables`") hadoop/HadoopHiveGroup -.-> hadoop/describe_tables("`Describing Tables`") subgraph Lab Skills hadoop/hdfs_setup -.-> lab-414827{{"`How to create an external Hive table for HDFS data?`"}} hadoop/hive_setup -.-> lab-414827{{"`How to create an external Hive table for HDFS data?`"}} hadoop/hive_shell -.-> lab-414827{{"`How to create an external Hive table for HDFS data?`"}} hadoop/manage_db -.-> lab-414827{{"`How to create an external Hive table for HDFS data?`"}} hadoop/create_tables -.-> lab-414827{{"`How to create an external Hive table for HDFS data?`"}} hadoop/describe_tables -.-> lab-414827{{"`How to create an external Hive table for HDFS data?`"}} end

Introduction to Hive and HDFS

What is Hive?

Hive is a data warehouse infrastructure built on top of Hadoop, which provides a SQL-like interface to query and manage data stored in Hadoop's Distributed File System (HDFS). It allows users to write queries in a SQL-like language called HiveQL, which are then translated into MapReduce jobs that can be executed on the Hadoop cluster.

What is HDFS?

HDFS (Hadoop Distributed File System) is the primary data storage system used by Hadoop applications. It is designed to store large datasets reliably and efficiently across multiple machines in a Hadoop cluster. HDFS provides high-throughput access to application data and is suitable for applications that have large data sets.

Hive and HDFS Integration

Hive is tightly integrated with HDFS, allowing users to store and query data stored in HDFS. Hive uses HDFS as its primary data storage system, and it provides a way to create tables that map to data stored in HDFS.

graph TD A[Hive] --> B[HiveQL] B --> C[MapReduce] C --> D[HDFS]

Use Cases of Hive and HDFS

Hive and HDFS are commonly used in the following scenarios:

  • Big Data Analytics: Hive and HDFS are used to store and analyze large datasets, such as web logs, sensor data, and social media data.
  • Data Warehousing: Hive provides a SQL-like interface to query and manage data stored in HDFS, making it a popular choice for building data warehouses.
  • Batch Processing: Hive can be used to perform batch processing of data stored in HDFS, such as generating reports, performing ETL (Extract, Transform, Load) operations, and running machine learning algorithms.

Creating an External Hive Table

Understanding External Hive Tables

In Hive, an external table is a table that points to data stored in an external location, such as HDFS. Unlike internal tables, which store data within the Hive metastore, external tables simply provide a way to access data that is stored elsewhere.

Creating an External Hive Table

To create an external Hive table, you can use the following SQL statement:

CREATE EXTERNAL TABLE IF NOT EXISTS table_name (
  col1 data_type,
  col2 data_type,
  ...
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/path/to/hdfs/directory';

Let's break down the different parts of this statement:

  • CREATE EXTERNAL TABLE IF NOT EXISTS table_name: This creates a new external table with the specified name, or skips the creation if the table already exists.
  • (col1 data_type, col2 data_type, ...): This defines the columns of the table, along with their data types.
  • ROW FORMAT DELIMITED: This specifies that the data in the external location is delimited by a specific character.
  • FIELDS TERMINATED BY ',': This sets the field delimiter to a comma.
  • STORED AS TEXTFILE: This tells Hive that the data is stored in a text file format.
  • LOCATION '/path/to/hdfs/directory': This specifies the HDFS directory where the data is stored.

Example: Creating an External Hive Table

Assuming you have a CSV file stored in HDFS at the path /user/hive/data/sales.csv, you can create an external Hive table as follows:

CREATE EXTERNAL TABLE IF NOT EXISTS sales (
  product STRING,
  quantity INT,
  price DOUBLE
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/user/hive/data/sales.csv';

This creates an external Hive table named sales with three columns: product, quantity, and price. The data is expected to be in a CSV format, with fields separated by commas.

Querying Data in the External Table

Accessing Data in the External Table

Once you have created an external Hive table, you can query the data stored in the associated HDFS directory just like you would with any other Hive table. The data is automatically loaded from HDFS when you execute a query against the table.

Example Queries

Let's assume you have created an external Hive table named sales as shown in the previous section. Here are some example queries you can run:

-- Select all rows from the sales table
SELECT * FROM sales;

-- Select specific columns from the sales table
SELECT product, quantity, price FROM sales;

-- Filter the data based on a condition
SELECT * FROM sales WHERE quantity > 10;

-- Perform aggregations on the data
SELECT product, SUM(quantity) AS total_quantity, AVG(price) AS avg_price
FROM sales
GROUP BY product;

These queries will fetch data from the HDFS directory associated with the sales table and return the results to the user.

Advantages of External Tables

Using external tables in Hive offers several advantages:

  1. Data Independence: External tables decouple the data from the Hive metastore, allowing the data to be managed and modified independently without affecting the Hive table definition.
  2. Flexibility: External tables can be used to access data stored in various formats (e.g., CSV, Parquet, ORC) and locations (e.g., HDFS, Amazon S3, Google Cloud Storage).
  3. Reduced Storage Overhead: Since the data is not stored in the Hive metastore, external tables require less storage space compared to internal tables.
  4. Easier Data Ingestion: External tables make it easier to ingest data into the Hive ecosystem, as the data can be directly added to the HDFS directory without the need to create a new Hive table.

By understanding how to create and query external Hive tables, you can effectively leverage the power of Hive and HDFS to manage and analyze your big data.

Summary

In this Hadoop-focused tutorial, you have learned how to create an external Hive table to access data stored in HDFS. By setting up the external table, you can now query and manipulate the data using the familiar Hive SQL syntax, without the need to directly interact with the underlying HDFS file system. This approach simplifies the process of working with Hadoop data and enables you to leverage the powerful capabilities of the Hive ecosystem.

Other Hadoop Tutorials you may like