Hadoop 에서 분산 캐시의 비밀을 풀어보세요

소개

잃어버린 문명의 고대 유적에서, 현대 탐험가 그룹은 지식과 지혜의 신에게 헌정된 숨겨진 사원을 발견했습니다. 사원의 벽은 고대 사제들이 사용했던 진보된 데이터 처리 시스템의 비밀을 담고 있는 복잡한 상형 문자로 장식되어 있었습니다.

탐험가 중 한 명인 숙련된 Hadoop 엔지니어는 대사제의 역할을 맡아 상형 문자를 해독하고 사원의 미스터리를 풀었습니다. 목표는 고대 사제들이 수 세기 전에 했던 것처럼, Hadoop 의 분산 캐시 (distributed cache) 의 강력한 기능을 활용하여 대규모 데이터 세트를 효율적으로 처리하는 고대 데이터 처리 시스템을 재구성하는 것이었습니다.

데이터 세트 및 코드 준비

이 단계에서는 고대 데이터 처리 시스템을 시뮬레이션하기 위해 필요한 파일과 코드를 설정합니다.

먼저, 사용자를 hadoop으로 변경한 다음 hadoop 사용자의 홈 디렉토리로 전환합니다.

su - hadoop

distributed-cache-lab이라는 새 디렉토리를 만들고 해당 디렉토리로 이동합니다.

mkdir distributed-cache-lab
cd distributed-cache-lab

다음으로, ancient-texts.txt라는 텍스트 파일을 다음 내용으로 만듭니다.

The wisdom of the ages is eternal.
Knowledge is the path to enlightenment.
Embrace the mysteries of the universe.

이 파일은 우리가 처리하려는 고대 텍스트를 나타냅니다.

이제 AncientTextAnalyzer.java라는 Java 파일을 다음 코드로 만듭니다.

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class AncientTextAnalyzer {

    public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {
        private final static IntWritable IntWritable(1);
        private Text word = new Text();

        public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
            StringTokenizer itr = new StringTokenizer(value.toString());
            while (itr.hasMoreTokens()) {
                word.set(itr.nextToken());
                context.write(word, one);
            }
        }
    }

    public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
        private IntWritable result = new IntWritable();

        public void reduce(Text key, Iterable<IntWritable> values, Context context)
                throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }
            result.set(sum);
            context.write(key, result);
        }
    }

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
        if (otherArgs.length != 2) {
            System.err.println("Usage: AncientTextAnalyzer <in> <out>");
            System.exit(2);
        }

        Job job = Job.getInstance(conf, "Ancient Text Analyzer");
        job.setJarByClass(AncientTextAnalyzer.class);
        job.setMapperClass(TokenizerMapper.class);
        job.setCombinerClass(IntSumReducer.class);
        job.setReducerClass(IntSumReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
        FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));

        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

이 코드는 입력 파일에서 각 단어의 발생 횟수를 계산하는 간단한 MapReduce 프로그램입니다. 이 코드를 사용하여 Hadoop 에서 분산 캐시의 사용법을 시연합니다.

코드 컴파일 및 패키징

이 단계에서는 Java 코드를 컴파일하고 배포를 위한 JAR 파일을 생성합니다.

먼저, Hadoop core JAR 파일이 클래스패스 (classpath) 에 있는지 확인합니다. Apache Hadoop 웹사이트에서 다운로드하거나 Hadoop 설치에 제공된 파일을 사용할 수 있습니다.

AncientTextAnalyzer.java 파일을 컴파일합니다.

javac -source 8 -target 8 -classpath "/home/hadoop/hadoop/share/hadoop/common/hadoop-common-3.3.6.jar:/home/hadoop/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-core-3.3.6.jar:/home/hadoop/hadoop/share/hadoop/common/lib/*" AncientTextAnalyzer.java

이제 컴파일된 클래스 파일로 JAR 파일을 생성합니다.

jar -cvf ancient-text-analyzer.jar AncientTextAnalyzer*.class

분산 캐시를 사용하여 MapReduce 작업 실행

이 단계에서는 MapReduce 작업을 실행하고 분산 캐시를 활용하여 클러스터의 모든 노드에 입력 파일을 제공합니다.

먼저, 입력 파일 ancient-texts.txt를 Hadoop 분산 파일 시스템 (HDFS) 에 복사합니다.

hadoop fs -mkdir /input
hadoop fs -put ancient-texts.txt /input/ancient-texts.txt

다음으로, 분산 캐시 옵션을 사용하여 MapReduce 작업을 실행합니다.

hadoop jar ancient-text-analyzer.jar AncientTextAnalyzer -files ancient-texts.txt /input/ancient-texts.txt /output

이 명령은 -files 옵션을 사용하여 ancient-texts.txt 파일을 클러스터의 모든 노드에 배포하는 AncientTextAnalyzer MapReduce 작업을 실행합니다. 입력 경로는 /input/ancient-texts.txt이고, 출력 경로는 /output입니다.

작업이 완료된 후 출력을 확인할 수 있습니다.

hadoop fs -cat /output/part-r-00000

다음과 유사한 단어 수 출력 결과를 볼 수 있습니다.

Embrace 1
Knowledge       1
The     1
ages    1
enlightenment.  1
eternal.        1
is      2
mysteries       1
of      2
path    1
the     4
to      1
universe.       1
wisdom  1

요약

이 랩에서는 고대 텍스트 분석 시스템을 구현하여 Hadoop 의 분산 캐시 기능의 강력함을 탐구했습니다. 분산 캐시를 활용하여 클러스터의 모든 노드에 입력 파일을 효율적으로 배포하여 병렬 처리를 가능하게 하고 네트워크를 통한 데이터 전송 오버헤드를 줄일 수 있었습니다.

이 실습 경험을 통해 Hadoop 의 분산 캐시가 분산 컴퓨팅 환경에서 데이터 처리를 최적화하는 방법에 대한 더 깊은 이해를 얻었습니다. 자주 액세스하는 데이터를 클러스터 전체에서 캐싱함으로써 특히 대규모 데이터 세트 또는 복잡한 계산을 처리할 때 성능을 크게 향상시키고 네트워크 트래픽을 줄일 수 있습니다.

또한, 이 랩은 Hadoop MapReduce, Java 프로그래밍, 그리고 Hadoop 클러스터에서 작업을 실행하는 데 대한 실질적인 경험을 제공했습니다. 이론적 지식과 실습의 조합은 빅 데이터 처리 능력을 향상시켰으며, 더 발전된 Hadoop 관련 문제에 대비할 수 있도록 했습니다.

분산 캐시의 고대 지혜

소개

데이터 세트 및 코드 준비

코드 컴파일 및 패키징

분산 캐시를 사용하여 MapReduce 작업 실행

요약