Hadoop MapReduce 로 마법 책 직렬화 간소화

소개

본 랩에서는 명망 높은 마법 아카데미의 방대한 마법 서적 컬렉션에 대한 카탈로그화 프로세스를 간소화하는 방법을 배우게 됩니다. Hadoop MapReduce 의 강력한 기능을 통해 책 데이터의 직렬화 (serialization) 를 처리하여 원활한 처리 및 분석을 보장합니다. 이를 통해 책 정보를 효율적으로 저장하고 처리하여 궁극적으로 아카데미의 교수진과 학생들에게 더 나은 서비스를 제공할 수 있습니다.

Writable 인터페이스 구현

이 단계에서는 책 데이터를 나타내는 사용자 정의 Writable 클래스를 생성합니다. 이 클래스는 Apache Hadoop 에서 제공하는 Writable 인터페이스를 구현하여 MapReduce 프로세스 중에 데이터의 효율적인 직렬화 (serialization) 및 역직렬화 (deserialization) 를 가능하게 합니다.

먼저, su - hadoop 명령을 사용하여 hadoop 사용자가 되어 /home/hadoop 디렉토리에 Book.java라는 새 Java 파일을 다음 내용으로 생성해야 합니다.

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import org.apache.hadoop.io.Writable;

// The Book class represents a book with a title, author, and publication year.
// It implements the Writable interface for Hadoop's serialization.
public class Book implements Writable {
    // Private fields for the book's title, author, and year of publication.
    private String title;
    private String author;
    private int year;

    // Default constructor - required for creating a new instance for deserialization.
    public Book() {}

    // Constructor with parameters to initialize the fields with given values.
    public Book(String title, String author, int year) {
        this.title = title;
        this.author = author;
        this.year = year;
    }

    // The write method serializes the object's fields to the DataOutput.
    public void write(DataOutput out) throws IOException {
        out.writeUTF(title); // Writes the title as UTF-8
        out.writeUTF(author); // Writes the author as UTF-8
        out.writeInt(year); // Writes the publication year as an integer
    }

    // The readFields method deserializes the object's fields from the DataInput.
    public void readFields(DataInput in) throws IOException {
        title = in.readUTF(); // Reads the title as UTF-8
        author = in.readUTF(); // Reads the author as UTF-8
        year = in.readInt(); // Reads the publication year as an integer
    }

    // The toString method provides a string representation of the object,
    // which is useful for printing and logging.
    @Override
    public String toString() {
        return "Title: " + title + ", Author: " + author + ", Year: " + year;
    }

    // Getters and setters are omitted for brevity but are necessary for accessing the fields.
}

이 Book 클래스에는 책의 제목, 저자 및 출판 연도에 대한 필드가 포함되어 있습니다. write 메서드는 책 데이터를 바이트 스트림으로 직렬화하고, readFields 메서드는 바이트 스트림에서 데이터를 역직렬화합니다. 두 메서드 모두 Writable 인터페이스에 의해 필요합니다.

그런 다음, 다음 명령을 사용하여 Java 클래스를 컴파일해야 합니다.

## Compile the Java classes
javac -source 8 -target 8 -classpath $HADOOP_HOME/share/hadoop/common/hadoop-common-3.3.6.jar:$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-core-3.3.6.jar:. Book.java

Mapper 및 Reducer 구현

이 단계에서는 MapReduce 패러다임을 사용하여 책 데이터를 처리하기 위해 Mapper 및 Reducer 클래스를 생성합니다.

사용자 정의 BookMapper

먼저, /home/hadoop 디렉토리에 BookMapper.java라는 새 Java 파일을 다음 내용으로 생성합니다.

import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

// BookMapper extends the Mapper class to process text input files
// Input key-value pairs are LongWritable (line number) and Text (line content)
// Output key-value pairs are Text (author name) and Book (book details)
public class BookMapper extends Mapper<LongWritable, Text, Text, Book> {
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        // Split the input line by comma
        String[] bookData = value.toString().split(",");
        // Extract title, author, and year from the input line
        String title = bookData[0];
        String author = bookData[1];
        int year = Integer.parseInt(bookData[2]);
        // Write the author and book details to the context
        context.write(new Text(author), new Book(title, author, year));
    }
}

이 BookMapper 클래스는 "title,author,year" 형식의 입력 데이터 한 줄을 가져와 저자를 키로, Book 객체를 값으로 하는 키 - 값 쌍을 내보냅니다.

사용자 정의 BookReducer

다음으로, /home/hadoop 디렉토리에 BookReducer.java라는 새 Java 파일을 다음 내용으로 생성합니다.

import java.io.IOException;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

// BookReducer extends the Reducer class to aggregate book details by author
// Input key-value pairs are Text (author name) and Book (book details)
// Output key-value pairs are Text (author name) and Book (aggregated book details)
public class BookReducer extends Reducer<Text, Book, Text, Book> {
    @Override
    protected void reduce(Text key, Iterable<Book> values, Context context) throws IOException, InterruptedException {
        // Iterate through books for the same author and write each book to the context
        for (Book book : values) {
            context.write(key, book);
        }
    }
}

이 BookReducer 클래스는 입력 키 - 값 쌍을 그대로 내보내어 효과적으로 책을 저자별로 그룹화합니다.

파일 컴파일

마지막으로, 다음 명령을 사용하여 Java 클래스를 컴파일해야 합니다.

## Compile the Java classes
javac -source 8 -target 8 -classpath $HADOOP_HOME/share/hadoop/common/hadoop-common-3.3.6.jar:$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-core-3.3.6.jar:. BookMapper.java BookReducer.java

MapReduce 작업 실행

이 단계에서는 MapReduce 작업을 실행하고 책 데이터를 처리하기 위해 Driver 클래스를 생성합니다.

사용자 정의 BookDriver

먼저, /home/hadoop 디렉토리에 BookDriver.java라는 새 Java 파일을 다음 내용으로 생성합니다.

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

// BookDriver sets up and submits the MapReduce job
public class BookDriver {
    public static void main(String[] args) throws Exception {
        // Create a new Hadoop job configuration
        Configuration conf = new Configuration();
        // Instantiate a Job object with the job configuration
        Job job = Job.getInstance(conf, "Book Processing");
        // Specify the job's jar file by class
        job.setJarByClass(BookDriver.class);
        // Set the Mapper class
        job.setMapperClass(BookMapper.class);
        // Set the Reducer class
        job.setReducerClass(BookReducer.class);
        // Set the output key class for the job
        job.setOutputKeyClass(Text.class);
        // Set the output value class for the job
        job.setOutputValueClass(Book.class);
        // Set the job's input path
        FileInputFormat.addInputPath(job, new Path(args[0]));
        // Set the job's output path
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        // Exit with the job's completion status
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

이 BookDriver 클래스는 MapReduce 작업을 설정하고 실행합니다. BookMapper 및 BookReducer 클래스로 작업을 구성하고, 입력 및 출력 경로를 설정하며, 작업이 완료될 때까지 대기합니다.

작업 실행

MapReduce 작업을 실행하려면 Java 클래스를 컴파일하고 JAR 파일을 생성해야 합니다. 다음 명령을 사용할 수 있습니다.

## Compile the Java classes
javac -source 8 -target 8 -classpath $HADOOP_HOME/share/hadoop/common/hadoop-common-3.3.6.jar:$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-core-3.3.6.jar:. BookDriver.java

## Create a JAR file
jar -cvf book.jar *.class

다음으로, 입력 디렉토리를 생성하고 일부 샘플 책 데이터를 복사해야 합니다. 다음 명령을 사용할 수 있습니다.

## Create an input directory
hdfs dfs -mkdir /input

## Copy sample data to the input directory
hdfs dfs -put ./data.txt /input

마지막으로, 다음 명령을 사용하여 MapReduce 작업을 실행할 수 있습니다.

hadoop jar book.jar BookDriver /input /output

이 명령은 /input에서 입력 데이터를 사용하여 BookDriver 클래스를 실행하고 결과를 /output으로 출력합니다. 다음 명령으로 결과를 볼 수 있습니다.

hdfs dfs -cat /output/part-r-00000

요약

축하합니다! Hadoop MapReduce 에서 사용자 정의 Writable 클래스를 생성하는 과정을 통해 직렬화 (serialization) 를 성공적으로 마스터했습니다. 마법 학교의 사서가 방대한 책 컬렉션을 관리하는 시나리오를 통해, 원활한 데이터 직렬화 및 역직렬화를 위해 Writable 인터페이스를 구현하는 Book 클래스를 만들었습니다. 책 정보를 추출하는 BookMapper 클래스와 저자별로 책을 효율적으로 그룹화하는 BookReducer 클래스를 제작하여 BookDriver 클래스로 프로세스를 조율했습니다. 여기에는 Java 클래스 컴파일, JAR 파일 생성, Hadoop 클러스터에서 작업 실행과 같은 작업이 포함되었습니다. 이 과정을 통해 사용자 정의 Writable 클래스, Mapper 및 Reducer 클래스, 대규모 데이터 처리 작업을 위한 MapReduce 작업 조율에 대한 기술을 연마하면서 Hadoop MapReduce 에 대한 귀중한 경험을 얻었습니다.

마법 직렬화 마스터리

소개