In this step, you will delve into the realm of Hadoop storage formats, exploring their strengths, weaknesses, and suitability for different data types and workloads.
First, ensure you are logged in as the hadoop
user by running the following command in the terminal:
su - hadoop
Then, let's create a directory to hold our data files:
mkdir /home/hadoop/data
Next, we'll generate some sample data files to work with:
echo "Alice,25,New York" >> /home/hadoop/data/people.csv
echo "Bob,32,Los Angeles" >> /home/hadoop/data/people.csv
echo "Charlie,19,Chicago" >> /home/hadoop/data/people.csv
Now, let's explore different storage formats and their use cases:
-
Text Files: Text files are the simplest and most human-readable format. They work well for small datasets and prototyping but can be inefficient for large datasets due to their lack of compression and schema enforcement.
-
Sequence Files: Sequence files are flat files consisting of binary key-value pairs. They are compressed and splittable, making them efficient for large datasets with relatively small records. However, they lack schema enforcement and can be challenging to work with for complex data types.
-
Avro Files: Apache Avro is a row-based data serialization format that supports schema enforcement and efficient compression. It is well-suited for large datasets with complex data types and provides excellent interoperability between different programming languages.
-
Parquet Files: Apache Parquet is a column-oriented storage format that offers excellent compression and efficient data skipping. It is particularly well-suited for analytical workloads involving large datasets with complex schemas and many columns.
-
ORC Files: The Optimized Row Columnar (ORC) format is another column-oriented storage format optimized for large datasets with complex schemas. It provides excellent compression, data skipping capabilities, and efficient reads for analytical workloads.
To explore these formats further, you can use Hadoop's built-in tools or libraries like Apache Hive or Apache Spark. For example, to create a Hive table using the Text format:
Launch the Hive shell by executing the following command:
hive
Create a Hive table using the Text format:
CREATE TABLE people (
name STRING,
age INT,
city STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
Load data into the table:
LOAD DATA LOCAL INPATH '/home/hadoop/data/people.csv' INTO TABLE people;
This will create a Hive table named people
with the specified schema and store the data in the Text format.