Mastering Hadoop Partitioning for Efficient Data Processing

Introduction

In the ancient Greek Olympiad, athletes from across the land would gather to showcase their prowess and compete in various athletic events. One such athlete, Alexios, had trained relentlessly for the upcoming games, determined to bring glory to his city-state.

The objective was to sort and organize the participants into different groups based on their events, ensuring a fair and efficient competition. However, with hundreds of athletes vying for glory, the task of partitioning them into their respective events was a daunting one.

Skills Graph

%%%%{init: {'theme':'neutral'}}%%%% flowchart RL hadoop(("`Hadoop`")) -.-> hadoop/HadoopMapReduceGroup(["`Hadoop MapReduce`"]) hadoop/HadoopMapReduceGroup -.-> hadoop/shuffle_partitioner("`Shuffle Partitioner`") subgraph Lab Skills hadoop/shuffle_partitioner -.-> lab-288997{{"`Hadoop Olympiad Partitioning`"}} end

Implement the Mapper

In this step, we will create a Mapper class that reads input data and generates key-value pairs for the Partitioner to process.

First, change the user to hadoop and then switch to the home directory of the hadoop user:

su - hadoop

Then, create a Java file for the Mapper class:

touch /home/hadoop/OlympicMapper.java

Add the following code to the OlympicMapper.java file:

import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class OlympicMapper extends Mapper<LongWritable, Text, Text, Text> {

    @Override
    protected void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {
        String[] fields = value.toString().split(",");
        String athlete = fields[0];
        String event = fields[1];
        context.write(new Text(event), new Text(athlete));
    }
}

In the OlympicMapper class, we define the input key as LongWritable (representing the line offset) and the input value as Text (representing a line of text from the input file). The output key is a Text object representing the event, and the output value is a Text object representing the athlete's name.

The map method splits each line of input data by the comma delimiter, extracts the athlete's name and event, and emits a key-value pair with the event as the key and the athlete's name as the value.

Implement the Partitioner

In this step, we will create a custom Partitioner class that partitions the key-value pairs based on the event.

First, create a Java file for the Partitioner class:

touch /home/hadoop/OlympicPartitioner.java

Then, add the following code to the OlympicPartitioner.java file:

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;

public class OlympicPartitioner extends Partitioner<Text, Text> {

    @Override
    public int getPartition(Text key, Text value, int numPartitions) {
        return Math.abs(key.hashCode() % numPartitions);
    }
}

The OlympicPartitioner class extends the Partitioner class provided by Hadoop. It overrides the getPartition method, which takes the key, value, and the number of partitions as input.

The getPartition method calculates a hash code for the event (key) and returns the partition number by taking the absolute value of the hash code modulo the number of partitions. This ensures that all records with the same event are sent to the same partition for processing by the Reducer.

Implement the Reducer

In this step, we will create a Reducer class that processes the partitioned data and generates the final output.

First, create a Java file for the Reducer class:

touch /home/hadoop/OlympicReducer.java

Then, add the following code to the OlympicReducer.java file:

import java.io.IOException;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class OlympicReducer extends Reducer<Text, Text, Text, Text> {

    @Override
    protected void reduce(Text key, Iterable<Text> values, Context context)
            throws IOException, InterruptedException {
        StringBuilder athletes = new StringBuilder();
        for (Text value : values) {
            athletes.append(value.toString()).append(",");
        }
        if (athletes.length() > 0) {
            athletes.deleteCharAt(athletes.length() - 1);
        }
        context.write(key, new Text(athletes.toString()));
    }
}

The OlympicReducer class extends the Reducer class provided by Hadoop. It defines the input key as Text (representing the event), the input value as Text (representing the athlete's name), and the output key and value as Text objects.

The reduce method is called for each unique event key, with an iterator over the athlete names associated with that event. It builds a comma-separated list of athletes for each event and emits a key-value pair with the event as the key and the list of athletes as the value.

Write the Driver

In this step, we will create a Driver class that ties together the Mapper, Partitioner, and Reducer classes and runs the MapReduce job.

First, create a Java file for the Driver class:

touch /home/hadoop/OlympicDriver.java

Then, add the following code to the OlympicDriver.java file:

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class OlympicDriver {

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "Olympic Partitioner");

        job.setJarByClass(OlympicDriver.class);
        job.setMapperClass(OlympicMapper.class);
        job.setPartitionerClass(OlympicPartitioner.class);
        job.setReducerClass(OlympicReducer.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);

        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

The OlympicDriver class is the entry point for the MapReduce job. It sets up the job configuration, specifies the Mapper, Partitioner, and Reducer classes, and configures the input and output paths.

In the main method, we create a new Configuration object and a Job instance with the job name "Olympic Partitioner". We set the Mapper, Partitioner, and Reducer classes using the corresponding setter methods.

We also set the output key and value classes to Text. The input and output paths are specified using command-line arguments passed to the driver.

Finally, we call the waitForCompletion method on the Job instance to run the MapReduce job and exit with an appropriate status code (0 for success, 1 for failure).

To run the job, you need to compile the Java classes and create a jar file. Then, you can execute the jar file using the following command:

javac -source 8 -target 8 -classpath "/home/hadoop/:/home/hadoop/hadoop/share/hadoop/common/hadoop-common-3.3.6.jar:/home/hadoop/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-core-3.3.6.jar:/home/hadoop/hadoop/share/hadoop/common/lib/*" -d /home/hadoop /home/hadoop/OlympicMapper.java /home/hadoop/OlympicPartitioner.java /home/hadoop/OlympicReducer.java /home/hadoop/OlympicDriver.java
jar cvf olympic.jar *.class
hadoop jar olympic.jar OlympicDriver /input /output

Finally, we can check the results by running the following command:

hadoop fs -cat /output/*

Example output:

Event_1 Athlete_17,Athlete_18,Athlete_79,Athlete_71,Athlete_77,Athlete_75,Athlete_19,Athlete_24,Athlete_31,Athlete_32,Athlete_39,Athlete_89,Athlete_88,Athlete_87,Athlete_100,Athlete_13,Athlete_52,Athlete_53,Athlete_58
Event_2 Athlete_1,Athlete_97,Athlete_96,Athlete_85,Athlete_81,Athlete_80,Athlete_72,Athlete_68,Athlete_64,Athlete_61,Athlete_54,Athlete_48,Athlete_47,Athlete_43,Athlete_28,Athlete_23,Athlete_21,Athlete_15,Athlete_12,Athlete_3
Event_3 Athlete_11,Athlete_55,Athlete_8,Athlete_46,Athlete_42,Athlete_41,Athlete_40,Athlete_38,Athlete_33,Athlete_92,Athlete_29,Athlete_27,Athlete_25,Athlete_93,Athlete_22,Athlete_20,Athlete_98,Athlete_14,Athlete_69,Athlete_99,Athlete_66,Athlete_65
Event_4 Athlete_90,Athlete_50,Athlete_37,Athlete_36,Athlete_91,Athlete_74,Athlete_73,Athlete_63,Athlete_26,Athlete_78,Athlete_5,Athlete_62,Athlete_60,Athlete_59,Athlete_82,Athlete_4,Athlete_51,Athlete_86,Athlete_2,Athlete_94,Athlete_7,Athlete_95
Event_5 Athlete_34,Athlete_76,Athlete_57,Athlete_56,Athlete_30,Athlete_16,Athlete_6,Athlete_10,Athlete_83,Athlete_84,Athlete_70,Athlete_45,Athlete_44,Athlete_49,Athlete_9,Athlete_67,Athlete_35

Summary

In this lab, we explored the concept of the Hadoop Shuffle Partitioner by designing a scenario inspired by the ancient Greek Olympiad. We implemented a Mapper class to read input data and generate key-value pairs, a custom Partitioner class to partition the data based on the event, and a Reducer class to process the partitioned data and generate the final output.

Through this lab, I gained hands-on experience with the MapReduce programming model and learned how to leverage the Partitioner class to efficiently distribute data across partitions. The ancient Greek Olympiad scenario provided an engaging context to understand the practical applications of the Shuffle Partitioner in a real-world setting.

Hadoop Olympiad Partitioning