Unraveling Desert Data Mysteries with Hadoop

Introduction

In the vast expanse of a desert wasteland, a lone merchant embarks on a perilous journey, seeking to unravel the mysteries hidden beneath the scorching sands. The merchant's goal is to uncover ancient relics and artifacts, unlocking the secrets of a long-forgotten civilization. However, the sheer volume of data buried within the desert poses a formidable challenge, requiring the power of Hadoop MapReduce to process and analyze the information effectively.

Skills Graph

%%%%{init: {'theme':'neutral'}}%%%% flowchart RL hadoop(("`Hadoop`")) -.-> hadoop/HadoopMapReduceGroup(["`Hadoop MapReduce`"]) hadoop/HadoopMapReduceGroup -.-> hadoop/setup_jobs("`Setting up MapReduce Jobs`") hadoop/HadoopMapReduceGroup -.-> hadoop/mappers_reducers("`Coding Mappers and Reducers`") subgraph Lab Skills hadoop/setup_jobs -.-> lab-288986{{"`Desert Data Discovery Quest`"}} hadoop/mappers_reducers -.-> lab-288986{{"`Desert Data Discovery Quest`"}} end

Implementing the Mapper

In this step, we will create a Mapper class to process the raw data obtained from the desert excavations. Our objective is to extract relevant information from the data and prepare it for further analysis by the Reducer.

Use the su - hadoop command to switch to the hadoop user and automatically go to the /home/hadoop directory, at this time, use the ls . command to see the data file data*.txt. Then create and populate the ArtifactMapper.java file in that directory according to the code below:

import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class ArtifactMapper extends Mapper<LongWritable, Text, Text, LongWritable> {

    private final static LongWritable ONE = new LongWritable(1);
    private Text word = new Text();

    @Override
    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        // Split the line into words
        String[] tokens = value.toString().split("\\s+");

        // Emit each word with a count of 1
        for (String token : tokens) {
            word.set(token);
            context.write(word, ONE);
        }
    }
}

In the ArtifactMapper class, we extend the Mapper class provided by Hadoop. The map method is overridden to process each input key-value pair.

The input key is a LongWritable representing the byte offset of the input line, and the input value is a Text object containing the line of text from the input file.
The map method splits the input line into individual words using the split method and the regular expression "\\s+" to match one or more whitespace characters.
For each word, the map method creates a Text object and emits it as the key, along with a constant LongWritable value of 1 as the value, representing the count of that word.

Implementing the Reducer

In this step, we will create a Reducer class to aggregate the data emitted by the Mapper. The Reducer will count the occurrences of each word and produce the final output.

Create and populate the ArtifactReducer.java file in that /home/hadoop directory according to the following code content:

import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class ArtifactReducer extends Reducer<Text, LongWritable, Text, LongWritable> {

    @Override
    public void reduce(Text key, Iterable<LongWritable> values, Context context)
            throws IOException, InterruptedException {
        long sum = 0;
        for (LongWritable value : values) {
            sum += value.get();
        }
        context.write(key, new LongWritable(sum));
    }
}

In the ArtifactReducer class, we extend the Reducer class provided by Hadoop. The reduce method is overridden to aggregate the values associated with each key.

The input key is a Text object representing the word, and the input values are an Iterable of LongWritable objects representing the counts of that word emitted by the Mappers.
The reduce method iterates over the values and calculates the sum of all the counts for the given word.
The reduce method then emits the word as the key and the total count as the value, using context.write.

Creating the Driver

In this step, we will create a Driver class to orchestrate the MapReduce job. The Driver will configure the job, specify the input and output paths, and submit the job to the Hadoop cluster.

Create and populate the ArtifactDriver.java file in that /home/hadoop directory according to the following code content:

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class ArtifactDriver {

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "Artifact Word Count");

        // Specify the job's jar file by class
        job.setJarByClass(ArtifactDriver.class);

        // Set the Mapper and Reducer classes
        job.setMapperClass(ArtifactMapper.class);
        job.setReducerClass(ArtifactReducer.class);

        // Set the output key and value types
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(LongWritable.class);

        // Set the input and output paths
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        // Submit the job and wait for completion
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

In the ArtifactDriver class, we create a MapReduce job and configure it to run our ArtifactMapper and ArtifactReducer classes.

The main method creates a new Configuration object and a Job object with a custom name "Artifact Word Count".
The setMapperClass and setReducerClass methods are used to specify the Mapper and Reducer classes to be used in the job.
The setOutputKeyClass and setOutputValueClass methods are used to specify the output key and value types for the job.
The FileInputFormat.addInputPath method is used to specify the input path for the job, which is taken as the first command-line argument.
The FileOutputFormat.setOutputPath method is used to specify the output path for the job, which is taken as the second command-line argument.
The job.waitForCompletion method is called to submit the job and wait for its completion. The program exits with a status code of 0 if the job is successful, or 1 if it fails.

Compiling and Running the Job

In this step, we will compile the Java classes and run the MapReduce job on the Hadoop cluster.

First, we need to compile the Java classes:

javac -source 8 -target 8 -classpath $HADOOP_HOME/share/hadoop/common/hadoop-common-3.3.6.jar:$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-core-3.3.6.jar:. *.java

This command compiles the Java classes and places the compiled .class files in the current directory. The -classpath option includes the Hadoop library paths, which are needed to compile the code that uses Hadoop classes. The -source and -target are parameters used to specify the Java source and target bytecode versions to match the version of java in hadoop

Next, packing a class file with the jar command:

jar -cvf Artifact.jar *.class

Finally, we can run the MapReduce job, and all the data about the desert is already stored in the /input HDFS directory:

hadoop jar Artifact.jar ArtifactDriver /input /output

After executing the command, you should see logs indicating the progress of the MapReduce job. Once the job is complete, you can find the output files in the /output HDFS directory. And use the following command to view the result:

hdfs dfs -ls /output
hdfs dfs -cat /output/part-r-00000

Summary

Congratulations! You have successfully explored the process of coding Mappers and Reducers for a Hadoop MapReduce job. Guided by a scenario involving a desert merchant seeking ancient relics, you harnessed Hadoop MapReduce's power to analyze vast desert data. Implementing the ArtifactMapper class extracted relevant data, while the ArtifactReducer class aggregated Mapper output. Orchestrating the process with the ArtifactDriver class further solidified your understanding. Throughout, emphasis was placed on best practices, complete code examples, and verification checks. This hands-on experience deepened your grasp of Hadoop MapReduce and highlighted effective learning experience design.

Desert Data Discovery Quest