Unraveling the Secrets of Hadoop Sorting

Introduction

In a mysterious night market, a captivating figure adorned in an ornate mask gracefully moves through the bustling crowd. This enigmatic mask dancer seems to possess a secret power, effortlessly sorting the chaotic stalls into an orderly arrangement with each twirl and sway. Your goal is to unravel the mystery behind this remarkable talent by mastering the art of Hadoop Shuffle Comparable.

Skills Graph

%%%%{init: {'theme':'neutral'}}%%%% flowchart RL hadoop(("`Hadoop`")) -.-> hadoop/HadoopMapReduceGroup(["`Hadoop MapReduce`"]) hadoop/HadoopMapReduceGroup -.-> hadoop/shuffle_comparable("`Shuffle Comparable`") subgraph Lab Skills hadoop/shuffle_comparable -.-> lab-288996{{"`Mystical Hadoop Sorting Secrets`"}} end

Implement the Mapper

In this step, we will create a custom Mapper class to process input data and emit key-value pairs. The key will be a composite key comprising two fields: the first character of each word and the length of the word. The value will be the word itself.

First, change the user to hadoop and then switch to the home directory of the hadoop user:

su - hadoop

Then, create a Java file for the Mapper class:

touch /home/hadoop/WordLengthMapper.java

Add the following code to the WordLengthMapper.java file:

import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class WordLengthMapper extends Mapper<LongWritable, Text, CompositeKey, Text> {

    private CompositeKey compositeKey = new CompositeKey();

    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String line = value.toString();
        String[] words = line.split("\\s+");

        for (String word : words) {
            compositeKey.setFirstChar(word.charAt(0));
            compositeKey.setLength(word.length());
            context.write(compositeKey, new Text(word));
        }
    }
}

In the above code, we create a WordLengthMapper class that extends the Mapper class from the Hadoop MapReduce API. The map method takes a LongWritable key (representing the byte offset of the input line) and a Text value (the input line itself). It then splits the input line into individual words, creates a CompositeKey object for each word (containing the first character and length of the word), and emits the CompositeKey as the key and the word as the value.

Implement the CompositeKey

In this step, we will create a custom CompositeKey class that implements the WritableComparable interface from the Hadoop MapReduce API. This class will be used as the key in our MapReduce job, allowing us to sort and group the data based on the first character and length of each word.

First, create a Java file for the CompositeKey class:

touch /home/hadoop/CompositeKey.java

Then, add the following code to the CompositeKey.java file:

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import org.apache.hadoop.io.WritableComparable;

public class CompositeKey implements WritableComparable<CompositeKey> {

    private char firstChar;
    private int length;

    public CompositeKey() {
    }

    public void setFirstChar(char firstChar) {
        this.firstChar = firstChar;
    }

    public char getFirstChar() {
        return firstChar;
    }

    public void setLength(int length) {
        this.length = length;
    }

    public int getLength() {
        return length;
    }

    @Override
    public void write(DataOutput out) throws IOException {
        out.writeChar(firstChar);
        out.writeInt(length);
    }

    @Override
    public void readFields(DataInput in) throws IOException {
        firstChar = in.readChar();
        length = in.readInt();
    }

    @Override
    public int compareTo(CompositeKey other) {
        int cmp = Character.compare(firstChar, other.firstChar);
        if (cmp != 0) {
            return cmp;
        }
        return Integer.compare(length, other.length);
    }

    @Override
    public int hashCode() {
        return firstChar + length;
    }

    @Override
    public boolean equals(Object obj) {
        if (obj instanceof CompositeKey) {
            CompositeKey other = (CompositeKey) obj;
            return firstChar == other.firstChar && length == other.length;
        }
        return false;
    }

    @Override
    public String toString() {
        return firstChar + ":" + length;
    }
}

In the above code, we create a CompositeKey class that implements the WritableComparable interface. It has two fields: firstChar (the first character of a word) and length (the length of the word). The class provides getter and setter methods for these fields, as well as implementations of the write, readFields, compareTo, hashCode, equals, and toString methods required by the WritableComparable interface.

The compareTo method is particularly important, as it defines how the keys will be sorted in the MapReduce job. In our implementation, we first compare the firstChar fields of the two keys. If they are different, we return the result of that comparison. If the firstChar fields are the same, we then compare the length fields.

Implement the Reducer

In this step, we will create a custom Reducer class to process the key-value pairs emitted by the Mapper and generate the final output.

First, create a Java file for the Reducer class:

touch /home/hadoop/WordLengthReducer.java

Then, add the following code to the WordLengthReducer.java file:

import java.io.IOException;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class WordLengthReducer extends Reducer<CompositeKey, Text, CompositeKey, Text> {

    public void reduce(CompositeKey key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
        StringBuilder sb = new StringBuilder();
        for (Text value : values) {
            sb.append(value.toString()).append(", ");
        }
        sb.setLength(sb.length() - 2);
        context.write(key, new Text(sb.toString()));
    }
}

In the above code, we create a WordLengthReducer class that extends the Reducer class from the Hadoop MapReduce API. The reduce method takes a CompositeKey key (containing the first character and length of a word) and an Iterable of Text values (the words that match the key).

Inside the reduce method, we concatenate all the words that match the key into a comma-separated string. We use a StringBuilder to efficiently build the output string, and we remove the trailing comma and space before writing the key-value pair to the output.

Implement the Driver

In this step, we will create a Driver class to configure and run the MapReduce job.

First, create a Java file for the Driver class:

touch /home/hadoop/WordLengthDriver.java

Then, add the following code to the WordLengthDriver.java file:

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordLengthDriver {

    public static void main(String[] args) throws Exception {
        if (args.length != 2) {
            System.err.println("Usage: WordLengthDriver <input> <output>");
            System.exit(1);
        }

        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "Word Length");

        job.setJarByClass(WordLengthDriver.class);
        job.setMapperClass(WordLengthMapper.class);
        job.setReducerClass(WordLengthReducer.class);
        job.setOutputKeyClass(CompositeKey.class);
        job.setOutputValueClass(Text.class);

        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

In the above code, we create a WordLengthDriver class that serves as the entry point for our MapReduce job. The main method takes two command-line arguments: the input path and the output path for the job.

Inside the main method, we create a new Configuration object and a new Job object. We configure the job by setting the mapper and reducer classes, the output key and value classes, and the input and output paths.

Finally, we submit the job and wait for its completion. If the job completes successfully, we exit with a status code of 0; otherwise, we exit with a status code of 1.

To run the job, you can use the following command:

javac -source 8 -target 8 -classpath "/home/hadoop/:/home/hadoop/hadoop/share/hadoop/common/hadoop-common-3.3.6.jar:/home/hadoop/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-core-3.3.6.jar:/home/hadoop/hadoop/share/hadoop/common/lib/*" -d /home/hadoop /home/hadoop/WordLengthMapper.java /home/hadoop/CompositeKey.java /home/hadoop/WordLengthReducer.java /home/hadoop/WordLengthDriver.java
jar cvf word-length.jar *.class
hadoop jar word-length.jar WordLengthDriver /input /output

Finally, we can check the results by running the following command:

hadoop fs -cat /output/*

Example output:

A:3	Amr
A:6	AADzCv
A:10	AlGyQumgIl
...
h:7	hgQUIhA
h:8	hyrjMGbY, hSElGKux
h:10	hmfHJjCkwB
...
z:6	zkpRCN
z:8	zfMHRbtk
z:9	zXyUuLHma

Summary

In this lab, we explored the concept of Hadoop Shuffle Comparable by implementing a MapReduce job that groups words based on their first character and length. We created a custom Mapper to emit key-value pairs with a composite key, a custom CompositeKey class that implements the WritableComparable interface, a Reducer to concatenate words with the same key, and a Driver class to configure and run the job.

Through this lab, I gained a deeper understanding of the Hadoop MapReduce framework and the importance of custom data types and sorting in distributed computing. By mastering Hadoop Shuffle Comparable, we can design efficient algorithms for data processing and analysis, unlocking the power of big data like the enigmatic mask dancer sorting the chaotic night market stalls.

Mystical Hadoop Sorting Secrets