CS 4300 / INFO 4300
Information Retrieval
Fall 2008

Hadoop Hints


Getting started

These notes are intended for Assignment 4 of CS/Info 4300. To complete this assignment you have two tasks:

  1. Write a MapReduce program and create a jar file TestQuery.jar. (For hints, see MapReduceHints.html.)
  2. Run TestQuery.jar with the test data on the Hadoop cluster. This page provides hints on how to do this.

Information about the Hadoop cluster is at http://www.infosci.cornell.edu/hadoop/. Documentation about Hadoop and MapReduce is at http://www.infosci.cornell.edu/hadoop/docs.html.

Setting up your development environment

The Hadoop cluster is a Linux system with a specialized file system. To run programs on the cluster you will need to interact with it at the command level.

If you are inexperienced with Linux, you may prefer to write your MapReduce program on a Macintosh or Windows computer, create a jar file, and test it there with small data sets. When your program is working, move the jar file to the cluster for final testing and to run it with larger data sets.

If you choose to do program development on a Macintosh or Windows computer, you need to install the Hadoop libraries on the computer. Instructions for how to do this are at:

These instructions describe how to add the Hadoop libraries to your build path and run MapReduce programs on your local machine with small data sets. They also describe how to connect to the cluster using the ssh protocol.

Connecting to the Hadoop cluster

The address of the cluster is wl01.cac.cornell.edu. (Note that the second character is the letter "l".) To interact with the cluster you need to open a terminal window on your computer and connect to the cluster using the ssh (Secure Shell) protocol. If your netID is abc123, you will use the following command to connect:

   ssh abc123@wl01.cac.cornell.edu

You will then be prompted for your password. You will then see a prompt, such as: -sh-3.2$. In the following examples, the prompt is written, simply as $.

Once connected using ssh, you are in an entirely Linux environment. If you are unfamiliar with Linux commands, here is a very basic guide. We will schedule a CS/Info 4300 workshop that will help you learn the few commands that you need for your assignment.

The first time that you log in you will be asked to change your password. (When prompted "Kerberos 5 Password:" type in your current password.)

Transferring files between your local computer and the Hadoop cluster

When transferring files between your local computer and the Hadoop cluster be very careful of the formats.

To copy files between your local computer and the cluster you have two options.

Using the Hadoop cluster

When you are at the command prompt on wl01.cac.cornell.edu, you have access to two file systems.

The Hadoop file system

All command to the Hadoop distributed file system begin:

   $ hadoop dfs

The command:

   $ hadoop dfs -help

lists the commands supported by Hadoop shell, and the command:

   $ hadoop dfs -help command-name

displays more detailed help for a command. These commands support most of the normal files system operations such as copying files or changing file permissions, and a few special operations such as changing replication of files. At the CS/Info 4300 workshop we will demonstrate how to interact with the Hadoop file system .

Running a Map Reduce job using Hadoop on Demand

The basic Hadoop command to run a job is:

   $ hadoop jar jar-filename main-class arguments

For example, suppose that your jar file is called Indexer.jar and the main class is called setup.Indexer. To run this program with input directory docssmall and output directory testsmall, the Hadoop command is:

   $ hadoop jar Indexer.jar setup.Indexer docssmall testsmall

Before you execute this command, be sure that you do not already have a directory called testsmall. Hadoop throws an error if the output directory already exists.

Hadoop on Demand

Hadoop on Demand is a scheduling system for jobs to be run on the cluster. Please use it for all jobs, so that you do not use up the entire cluster and prevent other users from doing their work. Hadoop on Demand will create a virtual cluster with the number of nodes that you request. For CS/Info 4300 assignments, please do not request more than 8 nodes.

To submit a job, you create a file containing a list of Hadoop commands. To submit this job using Hadoop on Demand, you might create a file called runIndexer that contains two commands:

   hadoop dfs rmr testsmall
   hadoop jar Indexer.jar setup.Indexer docssmall testsmall

The first of these commands removes the directory testsmall if it already exists. The second runs the program.

To submit the job, the Hadoop on Demand command is (assuming that your netID is abc123):

   $ hod script -d /hdfs/abc123 -n 8 -s runIndexer

This command creates a virtual cluster for user abc123. The -n command is a command to allocate 8 nodes to the job and the -s command states that the name of the script to execute is runIndexer.


[ Home | Syllabus | Readings | Assignments | Examinations | Academic Integrity ]


William Y. Arms
(wya@cs.cornell.edu)
Last changed: November 12, 2008