Deploying a JAR to Amazon EMR

I provisioned a simple EMR cluster and wanted to run my own WordCount on it. It took a few tries so here are the lessons learnt:

When you ‘Add Step’ to run a job on the new cluster, the key properties are as follows:

JAR location: s3n://poobar/wordcount.jar
Arguments: org.adrian.WordCount s3n://poobar/hamlet111.txt hdfs://10.32.43.156:9000/poobar/out

So my uberjar with the MapReduce job has been uploaded to S3, in a top level bucket called ‘poobar’.

There are three arguments to the job.

The first is the main class – this is always the first argument on EMR.
The following args get passed into the job’s public static void main(String[] args), i.e. the main method.
The job uses args[0] for the inputfile and args[1] for the output folder, which is fairly standard.
The inputfile has already been uploaded by me to S3, like the Jar itself.
The outputfolder has to be addressed using an HDFS protocol – a relative folder location seems to do the trick.

So with my setup, everything ends up in the ‘poobar’ bucket.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s