Introduction
In the vast expanse of a desert wasteland, a lone merchant embarks on a perilous journey, seeking to unravel the mysteries hidden beneath the scorching sands. The merchant's goal is to uncover ancient relics and artifacts, unlocking the secrets of a long-forgotten civilization. However, the sheer volume of data buried within the desert poses a formidable challenge, requiring the power of Hadoop MapReduce to process and analyze the information effectively.
Implementing the Mapper
In this step, we will create a Mapper class to process the raw data obtained from the desert excavations. Our objective is to extract relevant information from the data and prepare it for further analysis by the Reducer.
Use the su - hadoop command to switch to the hadoop user and automatically go to the /home/hadoop directory, at this time, use the ls . command to see the data file data*.txt. Then create and populate the ArtifactMapper.java file in that directory according to the code below:
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class ArtifactMapper extends Mapper<LongWritable, Text, Text, LongWritable> {
private final static LongWritable ONE = new LongWritable(1);
private Text word = new Text();
@Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
// Split the line into words
String[] tokens = value.toString().split("\\s+");
// Emit each word with a count of 1
for (String token : tokens) {
word.set(token);
context.write(word, ONE);
}
}
}
In the ArtifactMapper class, we extend the Mapper class provided by Hadoop. The map method is overridden to process each input key-value pair.
- The input key is a
LongWritablerepresenting the byte offset of the input line, and the input value is aTextobject containing the line of text from the input file. - The
mapmethod splits the input line into individual words using thesplitmethod and the regular expression"\\s+"to match one or more whitespace characters. - For each word, the
mapmethod creates aTextobject and emits it as the key, along with a constantLongWritablevalue of1as the value, representing the count of that word.
Implementing the Reducer
In this step, we will create a Reducer class to aggregate the data emitted by the Mapper. The Reducer will count the occurrences of each word and produce the final output.
Create and populate the ArtifactReducer.java file in that /home/hadoop directory according to the following code content:
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class ArtifactReducer extends Reducer<Text, LongWritable, Text, LongWritable> {
@Override
public void reduce(Text key, Iterable<LongWritable> values, Context context)
throws IOException, InterruptedException {
long sum = 0;
for (LongWritable value : values) {
sum += value.get();
}
context.write(key, new LongWritable(sum));
}
}
In the ArtifactReducer class, we extend the Reducer class provided by Hadoop. The reduce method is overridden to aggregate the values associated with each key.
- The input key is a
Textobject representing the word, and the input values are anIterableofLongWritableobjects representing the counts of that word emitted by the Mappers. - The
reducemethod iterates over the values and calculates the sum of all the counts for the given word. - The
reducemethod then emits the word as the key and the total count as the value, usingcontext.write.
Creating the Driver
In this step, we will create a Driver class to orchestrate the MapReduce job. The Driver will configure the job, specify the input and output paths, and submit the job to the Hadoop cluster.
Create and populate the ArtifactDriver.java file in that /home/hadoop directory according to the following code content:
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class ArtifactDriver {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "Artifact Word Count");
// Specify the job's jar file by class
job.setJarByClass(ArtifactDriver.class);
// Set the Mapper and Reducer classes
job.setMapperClass(ArtifactMapper.class);
job.setReducerClass(ArtifactReducer.class);
// Set the output key and value types
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
// Set the input and output paths
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
// Submit the job and wait for completion
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
In the ArtifactDriver class, we create a MapReduce job and configure it to run our ArtifactMapper and ArtifactReducer classes.
- The
mainmethod creates a newConfigurationobject and aJobobject with a custom name "Artifact Word Count". - The
setMapperClassandsetReducerClassmethods are used to specify the Mapper and Reducer classes to be used in the job. - The
setOutputKeyClassandsetOutputValueClassmethods are used to specify the output key and value types for the job. - The
FileInputFormat.addInputPathmethod is used to specify the input path for the job, which is taken as the first command-line argument. - The
FileOutputFormat.setOutputPathmethod is used to specify the output path for the job, which is taken as the second command-line argument. - The
job.waitForCompletionmethod is called to submit the job and wait for its completion. The program exits with a status code of 0 if the job is successful, or 1 if it fails.
Compiling and Running the Job
In this step, we will compile the Java classes and run the MapReduce job on the Hadoop cluster.
First, we need to compile the Java classes:
javac -source 8 -target 8 -classpath $HADOOP_HOME/share/hadoop/common/hadoop-common-3.3.6.jar:$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-core-3.3.6.jar:. *.java
This command compiles the Java classes and places the compiled .class files in the current directory. The -classpath option includes the Hadoop library paths, which are needed to compile the code that uses Hadoop classes. The -source and -target are parameters used to specify the Java source and target bytecode versions to match the version of java in hadoop
Next, packing a class file with the jar command:
jar -cvf Artifact.jar *.class
Finally, we can run the MapReduce job, and all the data about the desert is already stored in the /input HDFS directory:
hadoop jar Artifact.jar ArtifactDriver /input /output
After executing the command, you should see logs indicating the progress of the MapReduce job. Once the job is complete, you can find the output files in the /output HDFS directory. And use the following command to view the result:
hdfs dfs -ls /output
hdfs dfs -cat /output/part-r-00000
Summary
Congratulations! You have successfully explored the process of coding Mappers and Reducers for a Hadoop MapReduce job. Guided by a scenario involving a desert merchant seeking ancient relics, you harnessed Hadoop MapReduce's power to analyze vast desert data. Implementing the ArtifactMapper class extracted relevant data, while the ArtifactReducer class aggregated Mapper output. Orchestrating the process with the ArtifactDriver class further solidified your understanding. Throughout, emphasis was placed on best practices, complete code examples, and verification checks. This hands-on experience deepened your grasp of Hadoop MapReduce and highlighted effective learning experience design.



