Structured Data: CSV and TSV
Comma-Separated Values (CSV) and Tab-Separated Values (TSV) are two of the most common structured data formats used in Hadoop. These formats are simple, human-readable, and widely supported by various tools and applications.
To read and process CSV/TSV data in Hadoop, you can use the built-in TextInputFormat
and write custom MapReduce code to parse the data. Alternatively, you can leverage higher-level frameworks like Apache Spark or Apache Hive, which provide built-in support for these data formats.
## Example: Reading CSV data using Spark
from pyspark.sql.types import StructType, StructField, StringType
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("CSVExample").getOrCreate()
schema = StructType([
StructField("name", StringType(), True),
StructField("age", StringType(), True),
StructField("city", StringType(), True)
])
df = spark.read.csv("hdfs://path/to/data.csv", schema=schema, header=True)
df.show()
Structured Data: Parquet and ORC
Parquet and ORC are columnar data formats that are optimized for efficient storage and processing in Hadoop. These formats provide better compression, faster query performance, and reduced storage requirements compared to row-based formats like CSV and TSV.
Parquet and ORC can be used with various Hadoop ecosystem components, such as Apache Spark, Apache Hive, and Apache Impala. They are particularly useful for analytical workloads and data warehousing scenarios.
## Example: Reading Parquet data using Spark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("ParquetExample").getOrCreate()
df = spark.read.parquet("hdfs://path/to/data.parquet")
df.show()
Semi-structured Data: JSON and Avro
JSON (JavaScript Object Notation) and Avro are popular semi-structured data formats used in Hadoop. These formats are self-describing and allow for more flexible and schema-less data representation compared to structured data formats.
Handling JSON and Avro data in Hadoop often involves using higher-level frameworks like Apache Spark or Apache Hive, which provide built-in support for these data formats.
## Example: Reading JSON data using Spark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("JSONExample").getOrCreate()
df = spark.read.json("hdfs://path/to/data.json")
df.show()
In the next section, we will explore advanced techniques for processing diverse data formats in Hadoop.