Running a MapReduce Job on the Hadoop Cluster
This page will describe how to shows you how to take a JAR file along with the test data, and run the job on the CAC cluster, using SSH.
You will need to run a Bash Shell on your local computer. This is called the Terminal application on a Macintosh and the Cygwin Bash Shell on a Windows computer.
- Obtain a CAC account
If you are taking a course which requires the use of the cluster, the instructor should organize the CAC account for you. If you are using the cluster for research, the Principal Investigator will add you to their CAC project. In either case, you will receive an email to your Cornell email address with your username and password for the CAC. Then, you will need to log in and change your password to something secure and easy to remember. The easiest way to do this is via Remote Desktop Connection, but you can also use SSH. - Use SSH to connect to the job tracker node
Run the following from the Bash Shell command line. See instructions above for opening the Bash Shell if you have closed it.
ssh netid@wl01.cac.cornell.edu
where netid is replaced by your CAC username. Note: the address starts with doubleu-el-zero-one NOT doubleu-zero-one-zero. Enter your CAC password when prompted. Once you are SSH'd in you will be placed in your CAC home directory - the same directory that you previously copied files into. You can run "ls" to list the files, and ensure your files were copied over successfully. - Copy JAR and input files into HDFS
Make a directory in the Hadoop Distributed File System (dfs) for your input files. You can see the list of commands available for working on the dfs by executing the following:
/usr/local/hadoop/bin/hadoop dfs
More information about the commands is available here. Note, that to execute any hadoop dfs command, you must type /usr/local/hadoop/bin/hadoop dfs -command, where command is the dfs command to run.
To copy input data files into dfs from your home directory, do the following:
/usr/local/hadoop/bin/hadoop dfs -copyFromLocal input - Run your job
Perform the following:
/usr/local/hadoop/bin/hadoop jar WordCount.jar WordCount input output
This will place the result files in a directory called "output" in the dfs. You can then copy these files back to your CAC home directory by executing the following:
/usr/local/hadoop/bin/hadoop dfs -copyToLocal output output
Now you can retrieve the output files in the same fashion that you copied the input files to your home directory. Note, that one output file is produced for each reduce job you run. The WordCount example uses the system-configured limit of the number of reduce jobs, so do not be surprised to see 10-20 output files (the exact number depends on the number of cluster nodes running and their configuration). You can control this limit programatically via the setNumReduceTasks() method of the JobConf class in the hadoop API. Refer to the map reduce tutorial for more details on running map reduce jobs.
When you are finished with the output files, you should delete the output directory. Hadoop will not automatically do this for you, and it will throw an error if you run it while there is an old output directory. To do this, execute:
/usr/local/hadoop/bin/hadoop dfs -rmr output
Last revised: October 20,2008