Ant, Sonar and Jacoco Working Example

Much to my chagrin, legacy code exists. Sometimes, not infrequently enough, it is built using an Ant script. The following is an anonymized example of a simple Ant project with unit tests running a Sonar target.

The project structure is simple:

/src
/test
build.xml
ivy.xml

Classpaths (dependencies) are managed through Ivy. The script will download and install Ivy under your home folder if required. Also, regarding Ivy:

  • use Ant’s <setproxy> if the process hangs while attempting to download Ivy or any of the dependencies
  • add an Ivy plugin to your favourite IDE if you are managing your code there
  • specify your local binary artifact repo, if you have one, using ivysettings.xml

If you don’t want to use Ivy, just replace the classpathref values below with whatever you use for your classpaths.

The most difficult part was getting Sonar to pick up the unit test metrics. The following message can generally be ignored; it shows up even when all the unit test analysis has been successful and I wasted a fair amount of time googling information on it:

No information about coverage per test.

The integration between jacoco and the Sonar plugin is not well documented and is pretty tricksy, but here are the essential points:

  • jacoco:coverage generates coverage metrics and stores them in a binary file specified by the destfile attribute (typically called ‘jacoco.exec’). The Sonar plugin then looks for this file using the property sonar.jacoco.reportPath
  • jacoco:coverage writes test execution data (total number of tests, execution times etc.) NOT to the jacoco.exec binary but to separate XML files. It uses the same format as Maven’s Surefire plugin which ensures they will work with Sonar. The files are created ONLY if you include the node <formatter type="xml"/>. The files are written to the folder specified by the batchtest ‘todir’ attribute. The Sonar plugin looks for these using the property sonar.junit.reportsPath.
  • jacoco:report generates coverage HTML and XML reports but these are not actually used by the Sonar plugin. In the below example, I am using this task to generate an HTML report but, to repeat, this is not needed by the Sonar plugin.

With all that said, here is the full sample build.xml:

<project basedir="." xmlns:sonar="antlib:org.sonar.ant" xmlns:ivy="antlib:org.apache.ivy.ant">

	<property name="src.dir" value="src" />
	<property name="test.src.dir" value="test" />
	<property name="build.dir" value="build" />
	<property name="classes.dir" value="${build.dir}/classes" />
	<property name="test.classes.dir" value="${build.dir}/test-classes" />
	<property name="reports.dir" value="${build.dir}/reports" />

        <!-- Sonar connection details here-->

	<property name="sonar.projectKey" value="org.adrian:hello" />
	<property name="sonar.projectVersion" value="0.0.1-SNAPSHOT" />
	<property name="sonar.projectName" value="Adrian Hello" />

	<property name="sonar.sources" value="${src.dir}" />
	<property name="sonar.binaries" value="${classes.dir}" />
	<property name="sonar.tests" value="${test.src.dir}" />

	<property name="sonar.junit.reportsPath" value="${reports.dir}/junit" />
	<property name="sonar.dynamicAnalysis" value="reuseReports" />
	<property name="sonar.java.coveragePlugin" value="jacoco" />
	<property name="sonar.jacoco.reportPath" value="${build.dir}/jacoco.exec" />

	<target name="clean" description="Cleanup build files">
		<delete dir="${build.dir}"/>
	</target>

	<target name="ivy-check">
		<available file="${user.home}/.ant/lib/ivy.jar" property="ivy.isInstalled"/>
	</target>

	<target name="bootstrap" description="Install ivy" depends="ivy-check" unless="ivy.isInstalled">
		<mkdir dir="${user.home}/.ant/lib"/>
		<get dest="${user.home}/.ant/lib/ivy.jar" src="http://search.maven.org/remotecontent?filepath=org/apache/ivy/ivy/2.3.0/ivy-2.3.0.jar"/>
	</target>

	<target name="resolve" depends="bootstrap" description="Download dependencies and setup classpaths">
		<ivy:resolve/>
		<ivy:report todir='${reports.dir}/ivy' graph='false' xml='false'/>

		<ivy:cachepath pathid="compile.path" conf="compile"/>
		<ivy:cachepath pathid="test.path"    conf="test"/>
		<ivy:cachepath pathid="build.path"   conf="build"/>
	</target>

	<target name="init" depends="resolve" description="Create build directories">
		<mkdir dir="${classes.dir}"/>
		<mkdir dir="${test.classes.dir}"/>
		<mkdir dir="${reports.dir}/"/>
		<mkdir dir="${reports.dir}/junit"/>
	</target>

	<target name="compile" depends="init" description="Compile source code">
		<javac srcdir="${src.dir}" destdir="${classes.dir}" includeantruntime="false" debug="true" classpathref="compile.path"/>
	</target>

	<target name="compile-tests" depends="compile" description="Compile test source code">
		<javac srcdir="${test.src.dir}" destdir="${test.classes.dir}" includeantruntime="false" debug="true">
			<classpath>
				<path refid="test.path"/>
				<pathelement path="${classes.dir}"/>
			</classpath>
		</javac>
	</target>
		
	<target name="junit" depends="compile-tests" description="Run unit tests and code coverage reporting">
		<taskdef uri="antlib:org.jacoco.ant" resource="org/jacoco/ant/antlib.xml" classpathref="build.path"/>
		<jacoco:coverage destfile="${build.dir}/jacoco.exec" xmlns:jacoco="antlib:org.jacoco.ant">
			<junit haltonfailure="no" fork="true" forkmode="once">
				<classpath>
					<path refid="test.path"/>
					<pathelement path="${classes.dir}"/>
					<pathelement path="${test.classes.dir}"/>
				</classpath>
				<formatter type="xml"/>
				<batchtest todir="${reports.dir}/junit">
					<fileset dir="${test.src.dir}">
						<include name="**/*Test*.java"/>
					</fileset>
				</batchtest>
			</junit>
		</jacoco:coverage>
	</target>
		
	<target name="test-report" depends="junit"> 
		<taskdef uri="antlib:org.jacoco.ant" resource="org/jacoco/ant/antlib.xml"  classpathref="build.path"/>
		<jacoco:report xmlns:jacoco="antlib:org.jacoco.ant">
			<executiondata>
				<file file="${build.dir}/jacoco.exec" />
			</executiondata>

			<structure name="JaCoCo Ant Example">
				<classfiles>
					<fileset dir="${classes.dir}" />
				</classfiles>
				<sourcefiles encoding="UTF-8">
					<fileset dir="${src.dir}" />
				</sourcefiles>
			</structure>
			
			<html destdir="${reports.dir}" />
		</jacoco:report>
	</target>

	<target name="sonar" depends="test-report" description="Upload metrics to Sonar">
		<taskdef uri="antlib:org.sonar.ant" resource="org/sonar/ant/antlib.xml" classpathref="build.path"/>

		<ivy:cachepath pathid="sonar.libraries" conf="compile"/>

		<sonar:sonar xmlns:sonar="antlib:org.sonar.ant"/>
	</target>

	<target name="clean-all" depends="clean" description="Additionally purge ivy cache">
		<ivy:cleancache/>
	</target>
</project>

Here is the ivy.xml file, specifying compilation, test (i.e. junit) and ‘build’ (i.e. jacoco and sonar) dependencies:

<ivy-module version="2.0">
    <info organisation="org.adrian" module="demo"/>

    <configurations defaultconfmapping="compile->default">
        <conf name="compile" description="Required to compile application"/>
        <conf name="test"    description="Required for test only" extends="compile"/>
        <conf name="build"   description="Build dependencies"/>
    </configurations>

    <dependencies>
        <!-- compile dependencies -->
        <-- set up your classpath here>

        <!-- test dependencies -->
        <dependency org="junit" name="junit" rev="4.11" conf="test->default"/>

        <!-- build dependencies -->
        <dependency org="org.codehaus.sonar-plugins" name="sonar-ant-task" rev="2.2" conf="build->default"/>
        <dependency org="org.jacoco" name="org.jacoco.ant" rev="0.7.2.201409121644" conf="build->default"/> 

        <!-- Global exclusions -->
        <exclude org="org.apache.ant"/>
    </dependencies>
</ivy-module>

In production code I would move most of the Ant properties to an external build.properties file, but here I have bundled it all together.

Option and getOrElse in Scala

An Option can be thought of as a container (or proxy) for an object to prevent null pointer exceptions.

The paradigmatic case is returning an Option type from function calls, instantiated by either a Some or None subclass:

def toInt(in: String): Option[Int] = {
    try {
        Some(Integer.parseInt(in.trim))
    } catch {
        case e: NumberFormatException =>; None
    }
}

Some or None can be matched to functions by consumers:

toInt(someString) match {
    case Some(i) =>; println(i)
    case None =>; println("That didn't work.")
}

Say we have this type:

case class Person(age: Int)

In the case where it has a value:

val person = Option(Person(100))   > subclass Some
val age = person.map(_.age +5)     > Some(105)
age.getOrElse(0)                   > 105

In the case where it has no value:

val person = Option[Person] = Option(null)          > None
val age = person.map(_.age +5)                      > None
age.getOrElse(0)                                    > 0

You can see how getOrElse specifically requires the developer to specify the action for the null case.

Refactoring – towards a better definition

According to SAFe, refactoring is

“modifying an entity, such as a module, method or program in such a way that it’s external appearance, i.e. its functionality, remains unchanged.”

Refactoring, the author says, may increase speed or address security concerns. According to the preface of Martin Fowler’s book ‘Refactoring’, refactoring is

“changing a software system in such a way that is does not alter the external behaviour of the code yet improves the internal structure. It is a disciplined way to clean up code that minimizes the chances of introducing bugs. In essence, when you refactor you are improving the design of the code after it has been written.”

Both agree that refactoring leaves the system ‘externally’ unchanged. Both agree it makes no sense to say, as I have heard developers say in the past, ‘I want to refactor the code to add this new feature for the end user.’ That is not a refactoring; it is a new feature. I have a question about this. Are performance improvements such as adding internal caching or adding an index examples of refactoring? SAFe thinks so. But is a performance improvement not a change for the end user? Moreover, you can make a horrible mess in the code when you implement a performance improvement. Is it helpful to call that a refactoring? Do we want to be able to say ‘I refactored the IdLookup component to make it faster, but I made a mess; we need to refactor it again to make it cleaner.’ Performance improvements are non-functional improvements or, better, quality improvements that affect external users. This is not exactly refactoring, which has a focus on the ‘internal’.

What about moving configuration from code to an external config file, to improve maintainability or flexibility? This is another example of refactoring given by SAFe. I can see how, from a certain point of view, this could be thought of as an ‘internal’ change. But some ‘external’ stakeholders will care greatly about this change. The time it takes to deploy a configuration change can be greatly affected by externalizing configuration from a hard-coded value, since the code does not have to be recompiled and rebuilt. Build and package enhancements are quality improvements that affect external stakeholders such as testing and operations, which is not, strictly speaking, refactoring.

When we do TDD, we say we ‘refactor’ after each passing test. What does this really mean? At it simplest, this typically involves removing duplication, reducing complexity, applying certain patterns, fixing style violations, and so forth. Note that these are quality attributes of the system that are generally measurable with static analysis and code review. And there is a stakeholder who cares about these qualities; the dev team as a whole, the team lead, or, better, QA.  These improvements are about code quality, technical debt, and primarily affect the internal development team.  These are true refactorings.

Clojure Multimethods by Example

Essentially, these are similar to switch/case logic or, perhaps more similar to a strategy pattern.  defmulti will select a defmethod, to which it will dispatch your call.

From the popular clojure-koans, we have this example:

(defmulti diet (fn [x] (:eater x)))
(defmethod diet :herbivore [a] (str (:name a) " eats veggies."))
(defmethod diet :carnivore [a] (str (:name a) " eats animals"))
(defmethod diet :default [a] (str "I don't know what " (:name a) " eats."))

The defmulti ‘diet’ will select the appropriate defmethod from among the three options by evaluating (fn [x] (:eater x)) for x.

We can see from  (:eater x) that x is supposed to be a map, with an :eater key.  The value for that key will determine which defmethod to call.

So if x contains a value of :herbivore for :eater, then when we call diet, (str (:name a) ” eats veggies.”) will be evaluated.

(diet {:name "Bambi"  :eater :herbivore})
>= "Bambi eats veggies."

Likewise if x contains a value of :carnivore for :eater, then when we call diet,  (str (:name a) ” eats animals.”) will be evaluated.

(diet {:name "Simba&amp;quot; :eater :carnivore})
>= "Simba eats animals."

And the default defmethod will get executed when there is no other match.

That is all.

Deploying a JAR to Amazon EMR

I provisioned a simple EMR cluster and wanted to run my own WordCount on it. It took a few tries so here are the lessons learnt:

When you ‘Add Step’ to run a job on the new cluster, the key properties are as follows:

JAR location: s3n://poobar/wordcount.jar
Arguments: org.adrian.WordCount s3n://poobar/hamlet111.txt hdfs://10.32.43.156:9000/poobar/out

So my uberjar with the MapReduce job has been uploaded to S3, in a top level bucket called ‘poobar’.

There are three arguments to the job.

The first is the main class – this is always the first argument on EMR.
The following args get passed into the job’s public static void main(String[] args), i.e. the main method.
The job uses args[0] for the inputfile and args[1] for the output folder, which is fairly standard.
The inputfile has already been uploaded by me to S3, like the Jar itself.
The outputfolder has to be addressed using an HDFS protocol – a relative folder location seems to do the trick.

So with my setup, everything ends up in the ‘poobar’ bucket.

Python Streaming vs Clojure-Hadoop for MapReduce Implementations

You have a number of options these days for coding MR jobs.  Broadly, these fall into two categories –

  • Custom Jar
    • Hadoop allows programs packaged as Java JAR files to run as Mappers and Reducers
    • Any language which can be compiled into a JAR and can interop with Hadoop API can be used to implement MR jobs (such as Clojure, Scala, Groovy etc.)
  • Streaming
    • Create MR jobs in non-JVM languages, without no custom JAR, such as Python or C++.
    • These are programs which read from stdin and write to stdout, and are managed by Hadoop as Mappers and Reducers.

Java is not good at this.  The code is horrible.  I wanted to compare the syntax of a couple of alternatives: Python Streaming, and a Custom Jar written in Clojure.

1. Python Streaming

Wordcount is implemented here as two Python scripts passed to Hadoop’s streaming JAR:

Execute with:

hadoop jar /opt/hadoop/contrib/streaming/hadoop-streaming.jar \
-mapper "python $PWD/mapper.py" \
-reducer "python $PWD/reducer.py" \
-input  hdfs-input-file  \
-output hdfs-output-folder

MR implementation:

mapper.py:

import sys

for line in sys.stdin:
#--- output tuples [word, 1] in tab-delimited format to stdout---
  for word in line.split():
    print '%s\t%s' % (word, "1")

reducer.py

import sys

words = {}

# transform to map and count for each key
for line in sys.stdin:
 word, count = line.split('\t', 1)
 words[word] = words[word]+int(count)

# write the tuples to stdout
for word in words.keys():
 print '%s\t%s'% ( word, words[word] )

Note because the input to the reducer is just a sys.stdin, logic is required to transform the lines into a format that can be readily reduced – here, a map (‘words’). In custom Jars, this is done by the Hadoop framework itself.

One nice feature of streaming is that it is so easy to run this outside of hadoop using regular text files for the IO streams.

2. Pydoop

With pydoop, we can simplify this since we don’t deal directly with stdin/out streams:

def mapper(_, line, writer):
  for word in line.split():
    writer.emit(word, "1")

def reducer(word, icounts, writer):
  writer.emit(word, sum(map(int, icounts)))

Notice the reducer is simpler than in streaming, since icounts comes in the form [1 1 1 1 1 1 1] which needs no transformation in order to do reduce functions on it.

Execute with:

pydoop script wordcount.py hdfs-input-file hdfs-output-file

3. Clojure-Hadoop

This clojure library wraps the Hadoop Java API so Clojure-Java interop is hidden to the developer.  You can pass Clojure functions directly to Hadoop as Mappers and Reducers.

Execute with:

java -cp mycustomjar.jar clojure_hadoop.job \
-input hdfs-input-file \
-output  hdfs-output-folder \
-map clojure-hadoop.examples.wordcount/my-map \
-map-reader clojure-hadoop.wrap/int-string-map-reader \
-reduce clojure-hadoop.examples.wordcount/my-reduce \
-input-format text

MR implementation:

(ns clojure-hadoop.examples.wordcount
  (:import (java.util StringTokenizer)))

(defn my-map [key value]
  (map (fn [token] [token 1]) (enumeration-seq (StringTokenizer. value))))

(defn my-reduce [key values-fn]
  [[key (reduce + (values-fn))]])

my-map returns a vector [“word” 1] for each token in the parameter ‘value’
my-reduce returns a vector [“word” n] where n is the sum of occurrences, i.e. the wordcount

Note the entirety of clojure and this library must be packaged into an uberjar for this to run on Hadoop.

Sonar Unit and Integration Test Coverage with Maven

Lots of posts on the web about this, few seem to work. Here is my config, which works.

First, configuration of the maven sonar plugin:

<sonar.dynamicAnalysis>reuseReports</sonar.dynamicAnalysis>  <!-- Ensure you run mvn install before sonar:sonar -->
<sonar.java.codeCoveragePlugin>jacoco</sonar.java.codeCoveragePlugin>
<sonar.surefire.reportsPath>/target/surefire-reports</sonar.surefire.reportsPath>
<sonar.jacoco.reportPath>target/jacoco.exec</sonar.jacoco.reportPath>    <!-- This is the default, put here to be explicit -->
<sonar.jacoco.itReportPath>target/jacoco-it.exec</sonar.jacoco.itReportPath>

Next, the Jacoco plugin. Here we use the default output file for unit tests, and a separate output file for integration tests:

<plugin>
    <groupId>org.jacoco</groupId>
    <artifactId>jacoco-maven-plugin</artifactId>
    <version>0.6.2.201302030002</version>
    <executions>
        <execution>
            <id>pre-unit-test</id>
            <goals>
                <goal>prepare-agent</goal>
            </goals>
        </execution>
        <execution>
            <id>post-unit-test</id>
            <phase>test</phase>
            <goals>
                <goal>report</goal>
            </goals>
        </execution>
        <execution>
            <id>pre-integration-test</id>
            <phase>pre-integration-test</phase>
            <goals>
                <goal>prepare-agent</goal>
            </goals>
            <configuration>
                <destFile>target/jacoco-it.exec</destFile>
                <propertyName>failsafe.argLine</propertyName>
            </configuration>
        </execution>
        <execution>
            <id>post-integration-test</id>
            <phase>post-integration-test</phase>
            <goals>
                <goal>report</goal>
            </goals>
            <configuration>
                <dataFile>target/jacoco-it.exec</dataFile>
            </configuration>
        </execution>
    </executions>
</plugin>

Finally, the failsafe plugin. This ensures tests will be instrumented during the integration test phase, and the results collected during the verify phase. The reference to argLine is critical, because this causes failsafe to write to the correct Jacoco output file.

<plugin>
    <groupId>org.apache.maven.plugins</groupId>
    <artifactId>maven-failsafe-plugin</artifactId>
    <version>2.14</version>
    <configuration>
        <argLine>${failsafe.argLine}</argLine>
    </configuration>
    <executions>
        <execution>
            <id>integration-test</id>
            <goals>
                <goal>integration-test</goal>
            </goals>
        </execution>
        <execution>
            <id>verify</id>
            <goals>
                <goal>verify</goal>
            </goals>
        </execution>
    </executions>
</plugin>