How to list files and directories in HDFS using the FS Shell?

HadoopHadoopBeginner
Practice Now

Introduction

Hadoop is a widely-adopted framework for storing and processing large datasets, and the Hadoop Distributed File System (HDFS) is the core component for data storage. In this tutorial, you will learn how to use the HDFS FS Shell to list files and directories in your Hadoop environment, enabling you to effectively manage and navigate your big data storage.


Skills Graph

%%%%{init: {'theme':'neutral'}}%%%% flowchart RL hadoop(("`Hadoop`")) -.-> hadoop/HadoopHDFSGroup(["`Hadoop HDFS`"]) hadoop/HadoopHDFSGroup -.-> hadoop/hdfs_setup("`HDFS Setup`") hadoop/HadoopHDFSGroup -.-> hadoop/fs_cat("`FS Shell cat`") hadoop/HadoopHDFSGroup -.-> hadoop/fs_ls("`FS Shell ls`") hadoop/HadoopHDFSGroup -.-> hadoop/fs_mkdir("`FS Shell mkdir`") hadoop/HadoopHDFSGroup -.-> hadoop/fs_test("`FS Shell test`") subgraph Lab Skills hadoop/hdfs_setup -.-> lab-415780{{"`How to list files and directories in HDFS using the FS Shell?`"}} hadoop/fs_cat -.-> lab-415780{{"`How to list files and directories in HDFS using the FS Shell?`"}} hadoop/fs_ls -.-> lab-415780{{"`How to list files and directories in HDFS using the FS Shell?`"}} hadoop/fs_mkdir -.-> lab-415780{{"`How to list files and directories in HDFS using the FS Shell?`"}} hadoop/fs_test -.-> lab-415780{{"`How to list files and directories in HDFS using the FS Shell?`"}} end

Introduction to HDFS

Hadoop Distributed File System (HDFS) is a distributed file system designed to store and process large datasets across a cluster of commodity hardware. It is a core component of the Hadoop ecosystem and is widely used in big data applications.

HDFS is designed to be highly fault-tolerant and scalable, allowing it to handle petabytes of data and thousands of nodes. It achieves this by replicating data across multiple nodes, ensuring that data is available even if one or more nodes fail.

Some key features of HDFS include:

Distributed Storage

HDFS divides files into blocks and stores them across multiple nodes in the cluster. This allows for parallel processing of data, improving performance and scalability.

Fault Tolerance

HDFS automatically replicates data across multiple nodes, ensuring that data is available even if one or more nodes fail.

Scalability

HDFS can easily scale to handle large amounts of data and a growing number of nodes in the cluster.

High Throughput

HDFS is optimized for high-throughput access to data, making it well-suited for batch processing applications.

To interact with HDFS, users can use the HDFS FS Shell, which provides a set of command-line tools for managing files and directories in the HDFS file system.

Using the HDFS FS Shell

The HDFS FS Shell is a command-line interface that allows users to interact with the Hadoop Distributed File System (HDFS). It provides a set of commands for managing files and directories in HDFS.

To use the HDFS FS Shell, you need to have a Hadoop cluster set up and running. Once you have access to the cluster, you can use the hdfs dfs command to execute various operations on HDFS.

Here's an example of how to use the HDFS FS Shell on an Ubuntu 22.04 system:

## Connect to the Hadoop cluster
$ ssh user@hadoop-cluster

## Navigate to the Hadoop bin directory
$ cd /usr/local/hadoop/bin

## List the available HDFS FS Shell commands
$ ./hdfs dfs
Usage: hdfs dfs [generic options]
...

## List the contents of the HDFS root directory
$ ./hdfs dfs -ls /
Found 2 items
drwxr-xr-x - user supergroup 0 2023-04-12 12:34 /user
drwxr-xr-x - user supergroup 0 2023-04-12 12:34 /tmp

## Create a new directory in HDFS
$ ./hdfs dfs -mkdir /user/example

## Upload a local file to HDFS
$ ./hdfs dfs -put local_file.txt /user/example/

## Download a file from HDFS to the local filesystem
$ ./hdfs dfs -get /user/example/local_file.txt .

The HDFS FS Shell provides a wide range of commands for managing files and directories in HDFS, including ls, mkdir, put, get, rm, and more. You can find a complete list of available commands by running hdfs dfs without any arguments.

Listing Files and Directories in HDFS

One of the most common operations in HDFS is listing the files and directories in the file system. The HDFS FS Shell provides several commands for this purpose, allowing you to view the contents of HDFS directories and retrieve information about files and directories.

Listing the Root Directory

To list the contents of the HDFS root directory, you can use the following command:

$ hdfs dfs -ls /

This will display a list of all the files and directories in the root directory, including their permissions, owner, group, size, and modification time.

Listing a Specific Directory

To list the contents of a specific directory in HDFS, you can use the following command:

$ hdfs dfs -ls /user/example

This will display the contents of the /user/example directory.

Recursive Listing

If you want to list the contents of a directory and its subdirectories recursively, you can use the -R option:

$ hdfs dfs -ls -R /user/example

This will display the contents of the /user/example directory and all its subdirectories.

Displaying File and Directory Details

To display more detailed information about files and directories in HDFS, you can use the -stat option. This will show the file or directory size, replication factor, block size, and other metadata:

$ hdfs dfs -stat /user/example/file.txt

This will display the detailed information for the file.txt file in the /user/example directory.

By using these HDFS FS Shell commands, you can effectively list and explore the contents of your HDFS file system, which is an essential skill for working with Hadoop and big data applications.

Summary

This tutorial has provided a comprehensive guide on how to use the HDFS FS Shell to list files and directories in your Hadoop environment. By understanding the basic HDFS commands, you can now efficiently navigate and manage your big data storage, a crucial skill for any Hadoop developer or administrator.

Other Hadoop Tutorials you may like