Understanding the Mapper Concept
The Mapper is a crucial component in the Hadoop MapReduce framework. It is responsible for processing the input data and transforming it into key-value pairs, which are then passed to the Shuffle and Sort phase.
The Role of the Mapper
The primary role of the Mapper is to apply a user-defined function to each input record, generating one or more key-value pairs as output. This transformation process is known as the Map phase.
The input to the Mapper is typically a set of key-value pairs, where the key represents the offset or location of the data within the input file, and the value represents the actual data. The Mapper's job is to process this input and produce a set of intermediate key-value pairs, which are then passed to the Reduce phase for further processing.
Mapper Implementation
To implement a Mapper in Hadoop MapReduce, you need to create a custom Mapper class that extends the org.apache.hadoop.mapreduce.Mapper
class. This class defines the map()
method, which is the core of the Mapper implementation.
The map()
method takes three arguments:
- The input key: The offset or location of the input data.
- The input value: The actual input data.
- The output collector: A reference to the output collector, which is used to emit the intermediate key-value pairs.
Here's an example of a simple Mapper implementation in Java:
public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
In this example, the WordCountMapper
class processes the input data (a line of text) and emits a key-value pair for each word, where the key is the word and the value is the integer 1, representing the count of the word.
graph TD
A[Input Data] --> B[WordCountMapper]
B --> C["(word, 1)"]
C --> D[Shuffle & Sort]
The Mapper implementation is a crucial part of the Hadoop MapReduce workflow, as it determines how the input data is processed and transformed into intermediate key-value pairs, which are then used by the Reducer to produce the final output.