Magical Serialization Mastery

HadoopHadoopBeginner
Practice Now

Introduction

In this lab, you will learn how to streamline the cataloging process for the vast collection of magical tomes at the illustrious Magical Academy. Through the power of Hadoop MapReduce, you will handle the serialization of book data, ensuring seamless processing and analysis. This will enable you to efficiently store and process the book information, ultimately better serving the academy's faculty and students.


Skills Graph

%%%%{init: {'theme':'neutral'}}%%%% flowchart RL hadoop(("`Hadoop`")) -.-> hadoop/HadoopMapReduceGroup(["`Hadoop MapReduce`"]) hadoop/HadoopMapReduceGroup -.-> hadoop/setup_jobs("`Setting up MapReduce Jobs`") hadoop/HadoopMapReduceGroup -.-> hadoop/mappers_reducers("`Coding Mappers and Reducers`") hadoop/HadoopMapReduceGroup -.-> hadoop/handle_serialization("`Handling Serialization`") subgraph Lab Skills hadoop/setup_jobs -.-> lab-288975{{"`Magical Serialization Mastery`"}} hadoop/mappers_reducers -.-> lab-288975{{"`Magical Serialization Mastery`"}} hadoop/handle_serialization -.-> lab-288975{{"`Magical Serialization Mastery`"}} end

Implementing the Writable Interface

In this step, we will create a custom Writable class to represent the book data. This class will implement the Writable interface provided by Apache Hadoop, allowing for efficient serialization and deserialization of data during the MapReduce process.

First, you need to use the su - hadoop command to become a hadoop user, and create a new Java file named Book.java in the /home/hadoop directory with the following contents:

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import org.apache.hadoop.io.Writable;

// The Book class represents a book with a title, author, and publication year.
// It implements the Writable interface for Hadoop's serialization.
public class Book implements Writable {
    // Private fields for the book's title, author, and year of publication.
    private String title;
    private String author;
    private int year;

    // Default constructor - required for creating a new instance for deserialization.
    public Book() {}

    // Constructor with parameters to initialize the fields with given values.
    public Book(String title, String author, int year) {
        this.title = title;
        this.author = author;
        this.year = year;
    }

    // The write method serializes the object's fields to the DataOutput.
    public void write(DataOutput out) throws IOException {
        out.writeUTF(title); // Writes the title as UTF-8
        out.writeUTF(author); // Writes the author as UTF-8
        out.writeInt(year); // Writes the publication year as an integer
    }

    // The readFields method deserializes the object's fields from the DataInput.
    public void readFields(DataInput in) throws IOException {
        title = in.readUTF(); // Reads the title as UTF-8
        author = in.readUTF(); // Reads the author as UTF-8
        year = in.readInt(); // Reads the publication year as an integer
    }

    // The toString method provides a string representation of the object,
    // which is useful for printing and logging.
    @Override
    public String toString() {
        return "Title: " + title + ", Author: " + author + ", Year: " + year;
    }

    // Getters and setters are omitted for brevity but are necessary for accessing the fields.
}

This Book class contains fields for the book's title, author, and publication year. The write method serializes the book data to a byte stream, while the readFields method deserializes the data from a byte stream. Both methods are required by the Writable interface.

Then, you will need to compile the Java classes use the following commands:

## Compile the Java classes
javac -source 8 -target 8 -classpath $HADOOP_HOME/share/hadoop/common/hadoop-common-3.3.6.jar:$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-core-3.3.6.jar:. Book.java

Implementing the Mapper and Reducer

In this step, we will create a Mapper and a Reducer class to process the book data using the MapReduce paradigm.

Custom BookMapper

First, create a new Java file named BookMapper.java in the /home/hadoop directory with the following content:

import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

// BookMapper extends the Mapper class to process text input files
// Input key-value pairs are LongWritable (line number) and Text (line content)
// Output key-value pairs are Text (author name) and Book (book details)
public class BookMapper extends Mapper<LongWritable, Text, Text, Book> {
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        // Split the input line by comma
        String[] bookData = value.toString().split(",");
        // Extract title, author, and year from the input line
        String title = bookData[0];
        String author = bookData[1];
        int year = Integer.parseInt(bookData[2]);
        // Write the author and book details to the context
        context.write(new Text(author), new Book(title, author, year));
    }
}

This BookMapper class takes a line of input data in the format "title,author,year" and emits a key-value pair with the author as the key and a Book object as the value.

Custom BookReducer

Next, create a new Java file named BookReducer.java in the /home/hadoop directory with the following content:

import java.io.IOException;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

// BookReducer extends the Reducer class to aggregate book details by author
// Input key-value pairs are Text (author name) and Book (book details)
// Output key-value pairs are Text (author name) and Book (aggregated book details)
public class BookReducer extends Reducer<Text, Book, Text, Book> {
    @Override
    protected void reduce(Text key, Iterable<Book> values, Context context) throws IOException, InterruptedException {
        // Iterate through books for the same author and write each book to the context
        for (Book book : values) {
            context.write(key, book);
        }
    }
}

This BookReducer class simply emits the input key-value pairs as-is, effectively grouping the books by author.

Compile The Files

Fanilly, you will need to compile the Java classes use the following commands:

## Compile the Java classes
javac -source 8 -target 8 -classpath $HADOOP_HOME/share/hadoop/common/hadoop-common-3.3.6.jar:$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-core-3.3.6.jar:. BookMapper.java BookReducer.java

Running the MapReduce Job

In this step, we will create a Driver class to run the MapReduce job and process the book data.

Custom BookDriver

First, create a new Java file named BookDriver.java in the /home/hadoop directory with the following content:

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

// BookDriver sets up and submits the MapReduce job
public class BookDriver {
    public static void main(String[] args) throws Exception {
        // Create a new Hadoop job configuration
        Configuration conf = new Configuration();
        // Instantiate a Job object with the job configuration
        Job job = Job.getInstance(conf, "Book Processing");
        // Specify the job's jar file by class
        job.setJarByClass(BookDriver.class);
        // Set the Mapper class
        job.setMapperClass(BookMapper.class);
        // Set the Reducer class
        job.setReducerClass(BookReducer.class);
        // Set the output key class for the job
        job.setOutputKeyClass(Text.class);
        // Set the output value class for the job
        job.setOutputValueClass(Book.class);
        // Set the job's input path
        FileInputFormat.addInputPath(job, new Path(args[0]));
        // Set the job's output path
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        // Exit with the job's completion status
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

This BookDriver class sets up and runs the MapReduce job. It configures the job with the BookMapper and BookReducer classes, sets the input and output paths, and waits for the job to complete.

Executing the Job

To run the MapReduce job, you will need to compile the Java classes and create a JAR file. You can use the following commands:

## Compile the Java classes
javac -source 8 -target 8 -classpath $HADOOP_HOME/share/hadoop/common/hadoop-common-3.3.6.jar:$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-core-3.3.6.jar:. BookDriver.java

## Create a JAR file
jar -cvf book.jar *.class

Next, you will need to create an input directory and copy some sample book data into it. You can use the following commands:

## Create an input directory
hdfs dfs -mkdir /input

## Copy sample data to the input directory
hdfs dfs -put ./data.txt /input

Finally, you can run the MapReduce job using the following command:

hadoop jar book.jar BookDriver /input /output

This command will run the BookDriver class, using the input data from /input and outputting the results to /output. You view the result as follows command.

hdfs dfs -cat /output/part-r-00000

Summary

Congratulations! You have successfully mastered serialization in Hadoop MapReduce, navigating the creation of custom Writable classes. Through a scenario involving the Head Librarian of a Magical Academy managing a vast book collection, you crafted a Book class implementing the Writable interface for seamless data serialization and deserialization. Crafting a BookMapper class to extract book information and a BookReducer class to efficiently group books by author, you orchestrated the process with a BookDriver class. This involved tasks like compiling Java classes, creating a JAR file, and executing the job on the Hadoop cluster. Throughout, you gained invaluable experience in Hadoop MapReduce, honing skills in custom Writable classes, Mapper and Reducer classes, and orchestrating MapReduce jobs for large-scale data processing tasks.

Other Hadoop Tutorials you may like