I provisioned a simple EMR cluster and wanted to run my own WordCount on it. It took a few tries so here are the lessons learnt:
When you ‘Add Step’ to run a job on the new cluster, the key properties are as follows:
JAR location: s3n://poobar/wordcount.jar
Arguments: org.adrian.WordCount s3n://poobar/hamlet111.txt hdfs://10.32.43.156:9000/poobar/out
So my uberjar with the MapReduce job has been uploaded to S3, in a top level bucket called ‘poobar’.
There are three arguments to the job.
The first is the main class – this is always the first argument on EMR.
The following args get passed into the job’s
public static void main(String args), i.e. the main method.
The job uses args for the inputfile and args for the output folder, which is fairly standard.
The inputfile has already been uploaded by me to S3, like the Jar itself.
The outputfolder has to be addressed using an HDFS protocol – a relative folder location seems to do the trick.
So with my setup, everything ends up in the ‘poobar’ bucket.