Understanding HDFS Basics
What is HDFS?
HDFS (Hadoop Distributed File System) is the primary data storage system used by Apache Hadoop applications. It is designed to store and process large datasets in a distributed computing environment. HDFS is highly fault-tolerant and is designed to run on commodity hardware, making it a cost-effective solution for big data processing.
Key Features of HDFS
- Scalability: HDFS can scale to store and process petabytes of data by adding more nodes to the cluster.
- Fault Tolerance: HDFS automatically replicates data across multiple nodes, ensuring that data is not lost even if a node fails.
- High Throughput: HDFS is optimized for high-throughput access to data, making it suitable for large-scale data processing applications.
- Streaming Data Access: HDFS is designed for batch processing, where data is read and written in a streaming fashion.
HDFS Architecture
HDFS follows a master-slave architecture, where the master node is called the NameNode, and the slave nodes are called DataNodes. The NameNode manages the file system metadata, while the DataNodes store the actual data.
graph TD
NameNode --> DataNode1
NameNode --> DataNode2
NameNode --> DataNode3
DataNode1 --> Data
DataNode2 --> Data
DataNode3 --> Data
HDFS File Operations
HDFS supports various file operations, including:
- Creating a file:
hadoop fs -put <local_file> <hdfs_file_path>
- Listing files:
hadoop fs -ls <hdfs_directory_path>
- Deleting a file:
hadoop fs -rm <hdfs_file_path>
- Copying a file:
hadoop fs -get <hdfs_file_path> <local_file_path>
HDFS Replication and Block Size
HDFS stores data in blocks, and by default, each block is replicated three times across different DataNodes. This ensures high availability and fault tolerance. The block size can be configured, with the default being 128 MB.