|
CS 4300 / INFO 4300
Information Retrieval
Fall 2008
Map Reduce Programming Hints
Getting started
These notes are intended for Assignment 4 of CS/Info 4300. To complete this assignment you need to do two tasks:
You can write your Java program on any convenient system using the development environment that you are most comfortable with. You need to install the Hadoop libraries on the computer and ensure that you add the Hadoop libraries to the build path of your programming project. Instructions for how to do this for Windows and Macintosh computers are at:
Documentation about release 0.17.2 of the Hadoop API is at: http://hadoop.apache.org/core/docs/r0.17.2/api/. Refer to the API specification for the details of the classes that are introduced in these notes.
These notes refer to a sample program Indexer, with an associated class WordinDoc.
Keys and values in a Map Reduce program
Choice of keys
A MapReduce program consists of several passes through the data, with the output from one pass being the input to the next. Each pass goes through the following steps:
input (k1, v1) -> map -> (k2, v2) -> combine -> (k2, {v2}) -> reduce -> (k3, v3) output
In these steps, k1, k2, and k3 are keys; v1, v2, and v3 are values. Note that the output of the map step is a set of pairs, (k2, v2), while the input to the reduce step is a single record for each key, k2, and a list of values {v2}, all the values that have that key.
In designing a MapReduce program focus on the reduce step. Use the map and combine steps to bring together all the data that you wish to process together for a given key. For example:
If your program needs to group the data by different keys, use several passes with a different choice of k2 for each pass. The output of one pass is the input to the next.
Types of keys and values
Hadoop requires that keys and values implement the Writable interface and that keys implement the Writable Compare interface. For this purpose, Hadoop provides several classes, which are described in the API documentation.
The Text class is particularly versatile since it can represent any data that can be expressed as a string. One way to write a Hadoop programs is to have all keys and values as Text objects. With this approach, programs do a great deal of work converting other types to String and hence to Text, or in parsing String objects. It is better to use other classes provided by Hadoop, such as IntWritable or LongWritable.
Writing your own classes that implement the Writable or Writable Compare interfaces is straightforward. See WordinDoc for a simple example.
Hadoop assumes that the keys k2 and k3 are of the same type, and that the values v2 and v3 are of the same type.
The structure of a MapReduce program
Hadoop provides great flexibility and there are many ways to structure a Hadoop program. Indexer uses a structure that has proved suitable for a variety of simple programs.
The public class Indexer extends Configured and implements Tool. It has the following methods:
For each pass there must be:
Passing shared data to the map and reduce tasks
Because Mapper and Reducer tasks are typically running on different nodes of a cluster there is no direct communication between them. Special procedures are needed to initialize shared data. Indexer uses two such methods:
Input and output to the map and reduce tasks
Map input
The input to a MapReduce pass consists of one or more directories, each consisting of many files, which may themselves be very large. The InputFormat interface is complex and we recommend that you use one of the Hadoop standard FileInputFormat classes. You can specify the input format with the setInputFormat method of the JobConf class. Two input formats are particularly valuable.
These input formats do a lot of work for you, including the division of the input data into splits that are processed by separate map tasks.
The disadvantage is that the values are passed to the map method as Text, which has to be parsed by the program. The map method of Indexer shows how this can be done by converting to String and creating a StringTokenizer object.
Reduce input
For each key, k2, the input to the reduce method is a list of values, {v2}. Sometimes this may be a long list. Hadoop provides a simple way to read through this using the Iterator interface, as shown in the Indexer example.
Map and reduce output
The map and reduce methods use the same mechanism to write output records. This is the output.collect method. It has a key, value pair as its parameters and writes them out separated by a tab character.
[ Home | Syllabus | Readings | Assignments | Examinations | Academic Integrity ]
William Y. Arms
(wya@cs.cornell.edu)
Last changed: November 12, 2008