Hadoop MapReduce を使った魔法の本の直列化を効率化する

はじめに

この実験では、著名な魔法学校にある膨大な魔法の書籍の目録作成プロセスをどのように効率化するか学びます。Hadoop MapReduce の力を使って、書籍データの直列化を処理し、円滑な処理と分析を保証します。これにより、書籍情報を効率的に保存および処理し、最終的に学校の教職員と学生により良いサービスを提供できるようになります。

Writable インターフェイスの実装

このステップでは、書籍データを表すカスタム Writable クラスを作成します。このクラスは、Apache Hadoop によって提供される Writable インターフェイスを実装し、MapReduce プロセス中のデータの効率的な直列化と逆直列化を可能にします。

まず、su - hadoop コマンドを使用して hadoop ユーザーになり、/home/hadoop ディレクトリに次の内容の新しい Java ファイル Book.java を作成します。

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import org.apache.hadoop.io.Writable;

// The Book class represents a book with a title, author, and publication year.
// It implements the Writable interface for Hadoop's serialization.
public class Book implements Writable {
    // Private fields for the book's title, author, and year of publication.
    private String title;
    private String author;
    private int year;

    // Default constructor - required for creating a new instance for deserialization.
    public Book() {}

    // Constructor with parameters to initialize the fields with given values.
    public Book(String title, String author, int year) {
        this.title = title;
        this.author = author;
        this.year = year;
    }

    // The write method serializes the object's fields to the DataOutput.
    public void write(DataOutput out) throws IOException {
        out.writeUTF(title); // Writes the title as UTF-8
        out.writeUTF(author); // Writes the author as UTF-8
        out.writeInt(year); // Writes the publication year as an integer
    }

    // The readFields method deserializes the object's fields from the DataInput.
    public void readFields(DataInput in) throws IOException {
        title = in.readUTF(); // Reads the title as UTF-8
        author = in.readUTF(); // Reads the author as UTF-8
        year = in.readInt(); // Reads the publication year as an integer
    }

    // The toString method provides a string representation of the object,
    // which is useful for printing and logging.
    @Override
    public String toString() {
        return "Title: " + title + ", Author: " + author + ", Year: " + year;
    }

    // Getters and setters are omitted for brevity but are necessary for accessing the fields.
}

この Book クラスには、書籍のタイトル、著者、出版年のためのフィールドが含まれています。write メソッドは書籍データをバイトストリームに直列化し、readFields メソッドはバイトストリームからデータを逆直列化します。両方のメソッドは Writable インターフェイスによって必要とされます。

次に、Java クラスをコンパイルするには、次のコマンドを使用します。

## Compile the Java classes
javac -source 8 -target 8 -classpath $HADOOP_HOME/share/hadoop/common/hadoop-common-3.3.6.jar:$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-core-3.3.6.jar:. Book.java

Mapper と Reducer の実装

このステップでは、MapReduce パラダイムを使用して書籍データを処理するための Mapper と Reducer クラスを作成します。

カスタム BookMapper

まず、/home/hadoop ディレクトリに次の内容の新しい Java ファイル BookMapper.java を作成します。

import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

// BookMapper extends the Mapper class to process text input files
// Input key-value pairs are LongWritable (line number) and Text (line content)
// Output key-value pairs are Text (author name) and Book (book details)
public class BookMapper extends Mapper<LongWritable, Text, Text, Book> {
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        // Split the input line by comma
        String[] bookData = value.toString().split(",");
        // Extract title, author, and year from the input line
        String title = bookData[0];
        String author = bookData[1];
        int year = Integer.parseInt(bookData[2]);
        // Write the author and book details to the context
        context.write(new Text(author), new Book(title, author, year));
    }
}

この BookMapper クラスは、"title,author,year" 形式の入力データの 1 行を受け取り、著者をキーとして、Book オブジェクトを値とするキー - 値ペアを emit します。

カスタム BookReducer

次に、/home/hadoop ディレクトリに次の内容の新しい Java ファイル BookReducer.java を作成します。

import java.io.IOException;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

// BookReducer extends the Reducer class to aggregate book details by author
// Input key-value pairs are Text (author name) and Book (book details)
// Output key-value pairs are Text (author name) and Book (aggregated book details)
public class BookReducer extends Reducer<Text, Book, Text, Book> {
    @Override
    protected void reduce(Text key, Iterable<Book> values, Context context) throws IOException, InterruptedException {
        // Iterate through books for the same author and write each book to the context
        for (Book book : values) {
            context.write(key, book);
        }
    }
}

この BookReducer クラスは、入力のキー - 値ペアをそのまま emit し、著者による書籍のグループ化を行います。

ファイルのコンパイル

最後に、Java クラスをコンパイルするには、次のコマンドを使用します。

## Compile the Java classes
javac -source 8 -target 8 -classpath $HADOOP_HOME/share/hadoop/common/hadoop-common-3.3.6.jar:$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-core-3.3.6.jar:. BookMapper.java BookReducer.java

MapReduce ジョブの実行

このステップでは、MapReduce ジョブを実行して書籍データを処理するための Driver クラスを作成します。

カスタム BookDriver

まず、/home/hadoop ディレクトリに次の内容の新しい Java ファイル BookDriver.java を作成します。

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

// BookDriver sets up and submits the MapReduce job
public class BookDriver {
    public static void main(String[] args) throws Exception {
        // Create a new Hadoop job configuration
        Configuration conf = new Configuration();
        // Instantiate a Job object with the job configuration
        Job job = Job.getInstance(conf, "Book Processing");
        // Specify the job's jar file by class
        job.setJarByClass(BookDriver.class);
        // Set the Mapper class
        job.setMapperClass(BookMapper.class);
        // Set the Reducer class
        job.setReducerClass(BookReducer.class);
        // Set the output key class for the job
        job.setOutputKeyClass(Text.class);
        // Set the output value class for the job
        job.setOutputValueClass(Book.class);
        // Set the job's input path
        FileInputFormat.addInputPath(job, new Path(args[0]));
        // Set the job's output path
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        // Exit with the job's completion status
        System.exit(job.waitForCompletion(true)? 0 : 1);
    }
}

この BookDriver クラスは、MapReduce ジョブを設定して実行します。BookMapper と BookReducer クラスを使ってジョブを構成し、入力と出力のパスを設定し、ジョブが完了するのを待ちます。

ジョブの実行

MapReduce ジョブを実行するには、Java クラスをコンパイルして JAR ファイルを作成する必要があります。次のコマンドを使用できます。

## Compile the Java classes
javac -source 8 -target 8 -classpath $HADOOP_HOME/share/hadoop/common/hadoop-common-3.3.6.jar:$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-core-3.3.6.jar:. BookDriver.java

## Create a JAR file
jar -cvf book.jar *.class

次に、入力ディレクトリを作成してサンプルの書籍データをコピーする必要があります。次のコマンドを使用できます。

## Create an input directory
hdfs dfs -mkdir /input

## Copy sample data to the input directory
hdfs dfs -put./data.txt /input

最後に、次のコマンドを使用して MapReduce ジョブを実行できます。

hadoop jar book.jar BookDriver /input /output

このコマンドは、BookDriver クラスを実行し、/input の入力データを使用して結果を /output に出力します。結果を表示するには、次のコマンドを使用します。

hdfs dfs -cat /output/part-r-00000

まとめ

おめでとうございます！あなたは Hadoop MapReduce における直列化をうまくマスターし、カスタム Writable クラスの作成をこなしました。魔法学校の図書館長が膨大な書籍コレクションを管理するシナリオを通じて、データのシリアル化と逆シリアル化を円滑に行うために Writable インターフェイスを実装した Book クラスを作成しました。書籍情報を抽出する BookMapper クラスと著者による書籍の効率的なグループ化を行う BookReducer クラスを作成し、BookDriver クラスでプロセスを調整しました。これには、Java クラスのコンパイル、JAR ファイルの作成、Hadoop クラスター上でのジョブの実行などのタスクが含まれました。この過程を通じて、あなたは Hadoop MapReduce において貴重な経験を得ました。カスタム Writable クラス、Mapper と Reducer クラス、および大規模なデータ処理タスクのための MapReduce ジョブの調整に関するスキルを磨きました。

魔法の直列化のマスタリー