Implement the Join Operation
In this step, we will implement a MapReduce job to perform a join operation on the dinosaurs.txt
and locations.txt
files.
Create a new Java file named JoinDinosaurs.java
in the /home/hadoop/join-lab
directory with the following content:
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
public class JoinDinosaurs {
public static class JoinMapper extends Mapper<LongWritable, Text, Text, Text> {
private final Text outKey = new Text();
private final Text outValue = new Text();
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
String[] parts = line.split(",");
if (parts.length == 2) { // locations.txt
outKey.set(parts[0]);
outValue.set("LOC:" + parts[1]);
} else if (parts.length == 3) { // dinosaurs.txt
outKey.set(parts[0]);
outValue.set("DIN:" + parts[1] + "," + parts[2]);
}
context.write(outKey, outValue);
}
}
public static class JoinReducer extends Reducer<Text, Text, Text, Text> {
private final Text outValue = new Text();
@Override
protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
Map<String, String> dinMap = new HashMap<>();
StringBuilder locBuilder = new StringBuilder();
for (Text value : values) {
String valStr = value.toString();
if (valStr.startsWith("DIN:")) {
dinMap.put("DIN", valStr.substring(4));
} else if (valStr.startsWith("LOC:")) {
locBuilder.append(valStr.substring(4)).append(",");
}
if (locBuilder.length() > 0) {
locBuilder.deleteCharAt(locBuilder.length() - 1);
}
}
StringBuilder outBuilder = new StringBuilder();
for (Map.Entry<String, String> entry : dinMap.entrySet()) {
outBuilder.append(entry.getValue()).append("\t").append(locBuilder.toString().trim());
}
outValue.set(outBuilder.toString());
context.write(key, outValue);
}
}
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
if (args.length != 2) {
System.err.println("Usage: JoinDinosaurs <input_dir> <output_dir>");
System.exit(1);
}
Job job = Job.getInstance();
job.setJarByClass(JoinDinosaurs.class);
job.setJobName("Join Dinosaurs");
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(JoinMapper.class);
job.setReducerClass(JoinReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
This code defines a MapReduce job with a custom JoinMapper
and JoinReducer
. The mapper reads the input data from dinosaurs.txt
and locations.txt
and emits key-value pairs with the dinosaur name as the key and the data type ("DIN" or "LOC") along with the corresponding value. The reducer then performs the join operation by grouping the values by key and combining the dinosaur information with the location.
To compile the code, run the following command:
mkdir classes
javac -source 8 -target 8 -cp "/home/hadoop/hadoop/share/hadoop/common/hadoop-common-3.3.6.jar:/home/hadoop/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-core-3.3.6.jar:/home/hadoop/hadoop/share/hadoop/common/lib/*" -d classes JoinDinosaurs.java
jar -cvf join-dinosaurs.jar -C classes/ .
Next, run the MapReduce job using the following command:
hadoop jar join-dinosaurs.jar JoinDinosaurs /home/hadoop/join-lab /home/hadoop/join-lab/output
This command runs the JoinDinosaurs
class from the join-dinosaurs.jar
file, with the input directory /home/hadoop/join-lab
(containing dinosaurs.txt
and locations.txt
) and the output directory /home/hadoop/join-lab/output
.
After the job completes successfully, you can view the output in the /home/hadoop/join-lab/output
directory.