Introduction
In the era of dinosaurs, a fearless dinosaur hunter named Alex embarks on an exciting mission to uncover the secrets of these prehistoric creatures. Alex's goal is to gather valuable data from various sources and perform advanced analysis to gain insights into the behavior, diet, and evolution of different dinosaur species.
To achieve this, Alex needs to leverage the power of Hadoop MapReduce and its ability to perform join operations efficiently. By joining data from multiple sources, Alex can combine information about dinosaur fossils, their geological locations, and environmental conditions to paint a comprehensive picture of the dinosaur world.
Set up the Environment and Data
In this step, we will set up the necessary environment and prepare the data for the join operation.
First, change the user to hadoop and then switch to the home directory of the hadoop user:
su - hadoop
Create a new directory called join-lab to store our files:
mkdir join-lab
cd join-lab
Next, let's create two data files: dinosaurs.txt and locations.txt. These files will contain information about dinosaurs and their fossil locations, respectively.
Create dinosaurs.txt with the following content:
trex,Tyrannosaurus Rex,carnivore
velociraptor,Velociraptor,carnivore
brachiosaurus,Brachiosaurus,herbivore
stegosaurus,Stegosaurus,herbivore
Create locations.txt with the following content:
trex,North America
velociraptor,Asia
brachiosaurus,Africa
stegosaurus,North America
Finally, upload join-lab to hdfs using the following command:
hadoop fs -mkdir -p /home/hadoop
hadoop fs -put /home/hadoop/join-lab /home/hadoop/
Implement the Join Operation
In this step, we will implement a MapReduce job to perform a join operation on the dinosaurs.txt and locations.txt files.
Create a new Java file named JoinDinosaurs.java in the /home/hadoop/join-lab directory with the following content:
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
public class JoinDinosaurs {
public static class JoinMapper extends Mapper<LongWritable, Text, Text, Text> {
private final Text outKey = new Text();
private final Text outValue = new Text();
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
String[] parts = line.split(",");
if (parts.length == 2) { // locations.txt
outKey.set(parts[0]);
outValue.set("LOC:" + parts[1]);
} else if (parts.length == 3) { // dinosaurs.txt
outKey.set(parts[0]);
outValue.set("DIN:" + parts[1] + "," + parts[2]);
}
context.write(outKey, outValue);
}
}
public static class JoinReducer extends Reducer<Text, Text, Text, Text> {
private final Text outValue = new Text();
@Override
protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
Map<String, String> dinMap = new HashMap<>();
StringBuilder locBuilder = new StringBuilder();
for (Text value : values) {
String valStr = value.toString();
if (valStr.startsWith("DIN:")) {
dinMap.put("DIN", valStr.substring(4));
} else if (valStr.startsWith("LOC:")) {
locBuilder.append(valStr.substring(4)).append(",");
}
if (locBuilder.length() > 0) {
locBuilder.deleteCharAt(locBuilder.length() - 1);
}
}
StringBuilder outBuilder = new StringBuilder();
for (Map.Entry<String, String> entry : dinMap.entrySet()) {
outBuilder.append(entry.getValue()).append("\t").append(locBuilder.toString().trim());
}
outValue.set(outBuilder.toString());
context.write(key, outValue);
}
}
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
if (args.length != 2) {
System.err.println("Usage: JoinDinosaurs <input_dir> <output_dir>");
System.exit(1);
}
Job job = Job.getInstance();
job.setJarByClass(JoinDinosaurs.class);
job.setJobName("Join Dinosaurs");
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(JoinMapper.class);
job.setReducerClass(JoinReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
This code defines a MapReduce job with a custom JoinMapper and JoinReducer. The mapper reads the input data from dinosaurs.txt and locations.txt and emits key-value pairs with the dinosaur name as the key and the data type ("DIN" or "LOC") along with the corresponding value. The reducer then performs the join operation by grouping the values by key and combining the dinosaur information with the location.
To compile the code, run the following command:
mkdir classes
javac -source 8 -target 8 -cp "/home/hadoop/hadoop/share/hadoop/common/hadoop-common-3.3.6.jar:/home/hadoop/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-core-3.3.6.jar:/home/hadoop/hadoop/share/hadoop/common/lib/*" -d classes JoinDinosaurs.java
jar -cvf join-dinosaurs.jar -C classes/ .
Next, run the MapReduce job using the following command:
hadoop jar join-dinosaurs.jar JoinDinosaurs /home/hadoop/join-lab /home/hadoop/join-lab/output
This command runs the JoinDinosaurs class from the join-dinosaurs.jar file, with the input directory /home/hadoop/join-lab (containing dinosaurs.txt and locations.txt) and the output directory /home/hadoop/join-lab/output.
After the job completes successfully, you can view the output in the /home/hadoop/join-lab/output directory.
Analyze the Output
In this step, we will analyze the output of the join operation to gain insights into the dinosaur world.
First, let's check the contents of the output directory:
hadoop fs -ls /home/hadoop/join-lab/output
hadoop fs -cat /home/hadoop/join-lab/output/part-r-00000
You should see output similar to the following:
brachiosaurus Brachiosaurus,herbivore Africa
stegosaurus Stegosaurus,herbivore North America
trex Tyrannosaurus Rex,carnivore North America
velociraptor Velociraptor,carnivore Asia
This output shows the joined data, with each line containing the dinosaur name, its details (species and diet), and the location where its fossils were found.
Based on the output, we can make the following observations:
- The
Tyrannosaurus Rex(T-Rex) andVelociraptorwere carnivorous dinosaurs, while theBrachiosaurusandStegosauruswere herbivores. - The
Brachiosaurusfossils were found in Africa, theStegosaurusandTyrannosaurus Rexfossils were found in North America, and theVelociraptorfossils were found in Asia.
These insights can help paleontologists better understand the distribution, behavior, and evolution of different dinosaur species across different geological regions.
Summary
In this lab, we explored the implementation of a join operation using Hadoop MapReduce. By combining data from multiple sources, we were able to gain valuable insights into the world of dinosaurs, including their species, diets, and fossil locations.
The lab introduced the concept of joining data using MapReduce, where the mapper prepares the data for the join operation, and the reducer performs the actual join by grouping the values by key and combining the information.
Through the hands-on experience of setting up the environment, preparing the data, implementing the MapReduce job, and analyzing the output, we gained practical knowledge of how to leverage Hadoop's powerful data processing capabilities to solve complex analytical problems.
This lab not only strengthened our understanding of join operations but also reinforced our skills in working with Hadoop MapReduce, writing Java code, and executing commands in a Linux environment. The experience of designing and implementing a complete solution from scratch was invaluable and will undoubtedly contribute to our growth as data engineers or data scientists.



