Thoughts on Systems

Emil Sit

Sep 2, 2012 - 4 minute read - gradle hadoop cloudera maven

Developing Cloudera Applications with Gradle and Eclipse

This post is a translation/knock-off of Cloudera’s post on developing CDH applications with Maven and Eclipse for Gradle. It should help you get started using Gradle with Cloudera’s Hadoop. Hadapt makes significant use of Gradle for exactly this purpose.

Gradle is a build automation tool that can be used for Java projects. Since nearly all the Apache Hadoop ecosystem is written in Java, Gradle is a great tool for managing projects that build on top of the Hadoop APIs. In this post, we’ll configure a basic Gradle project that will be able to build applications against CDH (Cloudera’s Distribution including Apache Hadoop) binaries.

Gradle projects are defined using a file called build.gradle, which describes things like the projects dependencies on other modules, the build order, and any other plugins that the project uses. The complete build.gradle described below, which can be used with CDH, is available as a gist. Gradle’s build files are short and simple, combining the power of Apache Maven’s configuration by convention with the ability to customize that convention easily (and in enterprise friendly ways).

The most basic Java project can be compiled with a simple build.gradle that contains the one line:

apply plugin: "java"

While optional, it is helpful to start off your build.gradle declaring project metadata as well:

// Set up group and version info for the artifact
group = "com.mycompany.hadoopproject"
version = "1.0"

Since we want to use this project for Hadoop development, we need to add some dependencies on the Hadoop libraries. Gradle resolves dependencies by downloading jar files from remote repositories. This must be configured, so we add both the Maven Central Repository (that contains useful things like JUnit) and the CDH repository. This is done in the build.gradle like this:

repositories {
    // Standard Maven 
    mavenCentral()
    maven {
        url "https://repository.cloudera.com/artifactory/cloudera-repos/"
    }
}

The second repository enables us to add a Hadoop dependency in the dependencies section. The first repository enables us to add a JUnit dependency.

dependencies {
    compile "org.apache.hadoop:hadoop-client:2.0.0-mr1-cdh4.0.1"
    testCompile "junit:junit:4.8.2"
}

A project with the above dependency would compile against the CDH4 MapReduce v1 library. Cloudera provides a list of Maven artifacts included in CDH4 for finding HBase and other components.

Since Hadoop requires at least Java 1.6, we should also specify the compiler version for Gradle:

// Java version selection
sourceCompatibility = 1.6
targetCompatibility = 1.6

This gets us to a point where we’ve got a fully functional project, and we can build a jar by running gradle build.

In practice, it’s good to declare the version string as a property, since there is a high likelihood of dependencies on more than one artifact with the same version.

ext.hadoopVersion = "2.0.0-mr1-cdh4.0.1"
dependencies {
    compile "org.apache.hadoop:hadoop-client:${hadoopVersion}"
    testCompile "junit:junit:4.8.2"
}

Now, whenever we want to upgrade our code to a new CDH version, we only need to change the version string in one place.

Note that the configuration here produces a jar that does not contain the project dependencies within it. This is fine, so long as we only require Hadoop dependencies, since the Hadoop daemons will include all the Hadoop libraries in their own classpaths. If the Hadoop dependencies are not sufficient, it will be necessary to package the other dependencies into the jar. We can configure Gradle to package a jar with dependencies included by adding the following block:

// Emulate Maven shade plugin with a fat jar.
// http://docs.codehaus.org/display/GRADLE/Cookbook#Cookbook-Creatingafatjar
jar {
    from configurations.compile.collect { it.isDirectory() ? it : zipTree(it) }
}

Unfortunately, the jar now contains all the Hadoop libraries, which would conflict with the Hadoop daemons’ classpaths. We can indicate to Gradle that certain dependencies need to be downloaded for compilation, but will be provided to the application at runtime by augmenting the Hadoop dependencies. The code then looks like this, with an added dependency on Guava:

// Provided configuration as suggested in GRADLE-784
configuration {
    provided
}
sourceSets {
    main {
        compileClasspath += configurations.provided
    }
}

ext.hadoopVersion = "2.0.0-mr1-cdh4.0.1"
dependencies {
    provided "org.apache.hadoop:hadoop-client:${hadoopVersion}"

    compile "com.google.guava:guava:11.0.2"

    testCompile "junit:junit:4.8.2"
}

Gradle also has integration with a number of IDEs, such as Eclipse and IntelliJ IDEA. The default integrations can be provided by adding

apply plugin: "eclipse"
apply plugin: "idea"

to add support for generating Eclipse .classpath and .project files and IntelliJ .iml files. The default build output locations may not be desirable, so we configure Eclipse as follows:

eclipse {
    // Ensure Eclipse build output appears in build directory
    classpath {
        defaultOutputDir = file("${buildDir}/eclipse-classes")
    }
}

For Eclipse, simply run gradle eclipse and then import the project into Eclipse. As you update/add dependencies, re-run gradle eclipse to update the .classpath file and refresh in Eclipse. Gradle automatically handles generating a classpath, including linking to source jars.

Recent versions of IntelliJ and the SpringSource Tool Suite also support direct import of Gradle projects. When using this integration, the apply plugin lines are not necessary.

Gradle represents a well-documented and powerful alternative to developing projects in Maven. While not without its quirks, I am significantly happier maintaining an enterprise build in Gradle at Hadapt, compared to the complex Maven build I maintained at VMware. Give it a try.