How to recover deleted files from Trash in Hadoop HDFS?

HadoopHadoopBeginner
Practice Now

Introduction

This tutorial will guide you through the process of recovering deleted files from the Trash directory in Hadoop's Distributed File System (HDFS). Whether you accidentally deleted an important file or need to restore data, this article will provide you with the necessary steps to retrieve your lost information and maintain the integrity of your Hadoop cluster.


Skills Graph

%%%%{init: {'theme':'neutral'}}%%%% flowchart RL hadoop(("`Hadoop`")) -.-> hadoop/HadoopHDFSGroup(["`Hadoop HDFS`"]) hadoop/HadoopHDFSGroup -.-> hadoop/hdfs_setup("`HDFS Setup`") hadoop/HadoopHDFSGroup -.-> hadoop/fs_ls("`FS Shell ls`") hadoop/HadoopHDFSGroup -.-> hadoop/fs_rm("`FS Shell rm`") hadoop/HadoopHDFSGroup -.-> hadoop/fs_expunge("`FS Shell expunge`") hadoop/HadoopHDFSGroup -.-> hadoop/data_replication("`Data Replication`") hadoop/HadoopHDFSGroup -.-> hadoop/data_block("`Data Block Management`") hadoop/HadoopHDFSGroup -.-> hadoop/node("`DataNode and NameNode Management`") hadoop/HadoopHDFSGroup -.-> hadoop/snapshot("`Snapshot Management`") subgraph Lab Skills hadoop/hdfs_setup -.-> lab-417686{{"`How to recover deleted files from Trash in Hadoop HDFS?`"}} hadoop/fs_ls -.-> lab-417686{{"`How to recover deleted files from Trash in Hadoop HDFS?`"}} hadoop/fs_rm -.-> lab-417686{{"`How to recover deleted files from Trash in Hadoop HDFS?`"}} hadoop/fs_expunge -.-> lab-417686{{"`How to recover deleted files from Trash in Hadoop HDFS?`"}} hadoop/data_replication -.-> lab-417686{{"`How to recover deleted files from Trash in Hadoop HDFS?`"}} hadoop/data_block -.-> lab-417686{{"`How to recover deleted files from Trash in Hadoop HDFS?`"}} hadoop/node -.-> lab-417686{{"`How to recover deleted files from Trash in Hadoop HDFS?`"}} hadoop/snapshot -.-> lab-417686{{"`How to recover deleted files from Trash in Hadoop HDFS?`"}} end

Introduction to Hadoop HDFS

Hadoop Distributed File System (HDFS) is the primary storage system used by the Hadoop framework for big data processing. HDFS is designed to provide reliable, scalable, and fault-tolerant storage for large datasets.

What is HDFS?

HDFS is a distributed file system that runs on commodity hardware. It is designed to provide high-throughput access to application data and is suitable for applications that have large data sets. HDFS follows the master-slave architecture, where a single NameNode manages the file system namespace and regulates access to files by clients, while multiple DataNodes store and retrieve data.

Key Features of HDFS

  1. Scalability: HDFS can scale to hundreds of petabytes of storage and thousands of client nodes.
  2. Fault Tolerance: HDFS provides automatic data replication and recovery, ensuring that data is not lost even in the event of hardware failures.
  3. High Throughput: HDFS is optimized for high-throughput access to application data and is well-suited for large data sets.
  4. Compatibility: HDFS is compatible with a wide range of applications and tools, making it a versatile storage solution for big data processing.

HDFS Architecture

The HDFS architecture consists of a NameNode and multiple DataNodes. The NameNode is responsible for managing the file system namespace, while the DataNodes store and retrieve data blocks.

graph TD NameNode --> DataNode1 NameNode --> DataNode2 NameNode --> DataNode3 DataNode1 --> Data Blocks DataNode2 --> Data Blocks DataNode3 --> Data Blocks

HDFS Commands

HDFS provides a set of command-line tools for interacting with the file system. Some common HDFS commands include:

Command Description
hdfs dfs -ls List the contents of a directory
hdfs dfs -put Copy files from the local file system to HDFS
hdfs dfs -get Copy files from HDFS to the local file system
hdfs dfs -rm Remove files or directories from HDFS

Trash Management in HDFS

HDFS provides a Trash feature to help users recover accidentally deleted files. When a file is deleted in HDFS, it is first moved to the Trash directory instead of being permanently removed.

Enabling Trash

The Trash feature in HDFS is disabled by default. To enable it, you need to modify the core-site.xml configuration file and set the following properties:

<property>
  <name>fs.trash.interval</name>
  <value>1440</value>
</property>
<property>
  <name>fs.trash.checkpoint.interval</name>
  <value>0</value>
</property>

The fs.trash.interval property specifies the number of minutes after which the contents of the Trash directory will be permanently deleted. The fs.trash.checkpoint.interval property sets the frequency at which the Trash directory is checkpointed.

Deleting Files and Using Trash

When a file is deleted in HDFS, it is first moved to the Trash directory. You can use the following command to delete a file and move it to Trash:

hdfs dfs -rm /path/to/file

The deleted file will now be available in the Trash directory, which is located at /user/<username>/.Trash/.

Emptying Trash

To permanently delete the contents of the Trash directory, you can use the following command:

hdfs dfs -expunge

This will remove all the files from the Trash directory, and they will no longer be recoverable.

Restoring Deleted Files from Trash

If you need to restore a file that was accidentally deleted, you can use the following command to copy the file back from the Trash directory:

hdfs dfs -mv /.Trash/Current/path/to/file /path/to/restore

This will move the file from the Trash directory back to its original location.

Recovering Deleted Files from Trash

When a file is deleted in HDFS, it is first moved to the Trash directory, where it is stored for a specified period of time before being permanently deleted. This provides a way for users to recover accidentally deleted files.

Locating Deleted Files in Trash

To locate a deleted file in the Trash directory, you can use the following command:

hdfs dfs -ls /.Trash/Current/

This will list all the files and directories that are currently in the Trash.

Restoring Deleted Files

To restore a deleted file from the Trash directory, you can use the following command:

hdfs dfs -mv /.Trash/Current/path/to/file /path/to/restore

This will move the file from the Trash directory back to its original location.

Permanent Deletion and Expunge

If you want to permanently delete the contents of the Trash directory, you can use the following command:

hdfs dfs -expunge

This will remove all the files from the Trash directory, and they will no longer be recoverable.

Configuring Trash Retention

The Trash feature in HDFS can be configured to control the retention period for deleted files. You can modify the core-site.xml configuration file and set the following properties:

<property>
  <name>fs.trash.interval</name>
  <value>1440</value>
</property>
<property>
  <name>fs.trash.checkpoint.interval</name>
  <value>0</value>
</property>

The fs.trash.interval property specifies the number of minutes after which the contents of the Trash directory will be permanently deleted. The fs.trash.checkpoint.interval property sets the frequency at which the Trash directory is checkpointed.

By understanding and utilizing the Trash feature in HDFS, you can effectively recover accidentally deleted files and maintain data integrity in your Hadoop cluster.

Summary

By following the instructions in this Hadoop tutorial, you will learn how to effectively manage the Trash directory, understand the process of recovering deleted files, and ensure the safety and reliability of your Hadoop HDFS data. This knowledge will empower you to maintain a robust and well-managed Hadoop ecosystem, enabling you to confidently handle data recovery scenarios and safeguard your valuable information.

Other Hadoop Tutorials you may like