Ghostly Data Transformation Journey

HadoopHadoopBeginner
Practice Now

Introduction

In this lab, you will learn how to customize output and input formats in Hadoop MapReduce to effectively process data. With guided instruction from the Ghost Tutor, you'll gain the skills to work with different types of data and unlock the full potential of the Hadoop ecosystem. Get ready to embark on an exciting journey to master the art of computing in the supernatural realm!


Skills Graph

%%%%{init: {'theme':'neutral'}}%%%% flowchart RL hadoop(("`Hadoop`")) -.-> hadoop/HadoopMapReduceGroup(["`Hadoop MapReduce`"]) hadoop/HadoopMapReduceGroup -.-> hadoop/setup_jobs("`Setting up MapReduce Jobs`") hadoop/HadoopMapReduceGroup -.-> hadoop/mappers_reducers("`Coding Mappers and Reducers`") hadoop/HadoopMapReduceGroup -.-> hadoop/handle_io_formats("`Handling Output Formats and Input Formats`") subgraph Lab Skills hadoop/setup_jobs -.-> lab-288974{{"`Ghostly Data Transformation Journey`"}} hadoop/mappers_reducers -.-> lab-288974{{"`Ghostly Data Transformation Journey`"}} hadoop/handle_io_formats -.-> lab-288974{{"`Ghostly Data Transformation Journey`"}} end

Writing the Mapper and Reducer

In this step, we'll dive into the heart of Hadoop MapReduce and create our own Mapper and Reducer classes.

Explore the Data File

First, use su - hadoop command to switch identity. The data file data.txt is stored in the /user/hadoop/input directory of HDFS, which stores the contents of some people's conversations, use the following command to view the contents:

hdfs dfs -cat /user/hadoop/input/data.txt

Custom Mapper

Next, we'll create a custom Mapper class called WordCountMapper that extends Mapper. This Mapper will process the input data and emit key-value pairs, noting that the data processed is the content of each line of the conversation, not the names of people. Refer to the following example to supplement the map method in WordCountMapper.java under /home/hadoop/.

@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
    // Convert the Text object to a String
    String line = value.toString();
    // Tokenize the string using StringTokenizer
    StringTokenizer tokenizer = new StringTokenizer(line);
    // Iterate through each token and write it to the context
    while (tokenizer.hasMoreTokens()) {
        // Set the current word
        word.set(tokenizer.nextToken().trim());
        // Write the word and its count to the context
        context.write(word, one);
    }
}

Then use the following command to compile the code in java 8 version:

javac -source 8 -target 8 -cp $HADOOP_HOME/share/hadoop/common/hadoop-common-3.3.6.jar:$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-core-3.3.6.jar:. WordCountMapper.java

Custom Reducer

Finally, we'll create a custom Reducer class called WordCountReducer that extends Reducer. This Reducer will aggregate the values for each key and emit the final result. Supplement the reduce method in WordCountReducer.java under /home/hadoop/ with the following example.

@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context)
        throws IOException, InterruptedException {
    // Initialize a variable to store the sum of counts for each word
    int sum = 0;
    // Iterate through all counts for the current word and calculate the total sum
    for (IntWritable value : values) {
        sum += value.get();
    }
    // Set the final count for the current word
    result.set(sum);
    // Write the word and its final count to the context
    context.write(key, result);
}

Then use the following command to compile the code in java 8 version:

javac -source 8 -target 8 -cp $HADOOP_HOME/share/hadoop/common/hadoop-common-3.3.6.jar:$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-core-3.3.6.jar:. WordCountReducer.java

Writing Custom Input and Output Formats

In this step, we will specify the input and output formats for the MapReduce job.

Custom Input Format

First, let's create a custom input format to read data from a specific source. We'll define a class called PersonInputFormat that extends TextInputFormat and overrides the getCurrentValue method to handle input format. Supplement the getCurrentValue method in PersonInputFormat.java under /home/hadoop/ with the following example.

@Override
public synchronized Text getCurrentValue() {
    // Return the value of the current record, split it according to ":", remove the first and last blanks and set it to the Text object.
    Text value = new Text();
    Text line = super.getCurrentValue();
    String[] parts = line.toString().split(":");
    if (parts.length == 2) {
        value.set(parts[1].trim());
    }
    return value;
}

Then use the following command to compile the code in java 8 version:

javac -source 8 -target 8 -cp $HADOOP_HOME/share/hadoop/common/hadoop-common-3.3.6.jar:$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-core-3.3.6.jar:. PersonInputFormat.java

Custom Output Format

Next, let's create a custom output format. We will define a class named CSVOutputFormat that extends TextOutputFormat. Supplement the write method in CSVOutputFormat.java under /home/hadoop/ with the following example.

// Write the key-value pair to the output stream in CSV format
@Override
public void write(Text key, IntWritable value) throws IOException, InterruptedException {
    out.writeBytes(key.toString() + "," + value.toString() + "\n");
}

Then use the following command to compile the code in java 8 version:

javac -source 8 -target 8 -cp $HADOOP_HOME/share/hadoop/common/hadoop-common-3.3.6.jar:$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-core-3.3.6.jar:. CSVOutputFormat.java

Integrating With The Driver

In this final step, we'll modify the WordCountDriver class to utilize the custom input and output formats we created earlier.

Custom Driver

Supplement the main function of WordCountDriver.java under /home/hadoop/ with the following example.

// Set Mapper and Reducer classes
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReducer.class);

// Set input format class to custom input format PersonInputFormat
job.setInputFormatClass(PersonInputFormat.class);

// Set output format class to custom output format CSVOutputFormat
job.setOutputFormatClass(CSVOutputFormat.class);

Executing the Job

To execute the Hadoop MapReduce job with the custom input and output formats, follow these steps:

  1. Compile the WordCountDriver.java using the appropriate Hadoop dependencies.

    javac -source 8 -target 8 -cp $HADOOP_HOME/share/hadoop/common/hadoop-common-3.3.6.jar:$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-core-3.3.6.jar:. WordCountDriver.java
  2. Create a JAR file containing the compiled classes.

    jar -cvf mywordcount.jar *.class
  3. Run the WordCountDriver class with the appropriate input and output paths. Make sure the output path does not previously exist.

    hadoop jar ./mywordcount.jar WordCountDriver /user/hadoop/input /output

This command will execute the Hadoop MapReduce job using the custom input and output formats we defined.

View Output Results

Use the following command to check whether the result file is generated successfully:

hdfs dfs -ls /output
hdfs dfs -cat /output/part.csv

Summary

Congratulations! In this lab, you have successfully explored the intricacies of Hadoop MapReduce and mastered output and input format customization under the guidance of the Ghost Tutor. From creating mapper and reducer classes to overcoming challenges such as handling unsplittable input files, you've gained valuable insight. Through hands-on exercises, you've deepened your understanding of Hadoop MapReduce capabilities and are now equipped to tackle data challenges with confidence. This journey has not only enhanced your technical skills, but also broadened your perspective on the potential of Hadoop MapReduce for big data processing.

Other Hadoop Tutorials you may like