Thoughts on Systems

Emil Sit

How to Configure Linux Networking for EC2 AMIs

At Hadapt we provision a lot of EC2 instances for development and test purposes. This gives us some unhappily deep experience with error cases around EC2 provisioning (at least, in us-east, where we provision most of our nodes).

One problem we have seen is nodes transitioning into a running state but not becoming available to SSH. This can be detected programmatically (or on the EC2 dashboard) as a node that fails the reachability check. If you are looping trying to reconnect via SSH, it might look something like this:

ssh: connect to host 23.20.165.232 port 22: Connection timed out

looping forever, whereas a normally operating host would transition from timing out to connection refused (IP address acquired but before SSH starts) to normal operation:

ssh: connect to host 107.21.164.85 port 22: Connection timed out
ssh: connect to host 107.21.164.85 port 22: Connection refused
Warning: Permanently added '107.21.164.85' (RSA) to the list of known hosts.

We’ve learned, through working with EC2 support, that this can happen if the new instance misses the DHCP offer from the local DHCP server and thus never acquires its IP address.

The Amazon Linux AMI uses the following configuration for /etc/sysconfig/network-scripts/ifcfg-eth0 to avoid this problem:

1
2
3
4
5
6
7
8
DEVICE=eth0
BOOTPROTO=dhcp
ONBOOT=yes
TYPE=Ethernet
USERCTL=yes
PEERDNS=yes
IPV6INIT=no
PERSISTENT_DHCLIENT=yes

Notably, it disables the use of NetworkManager (which may cause an interface to be disabled) and configures dhclient to be persistent and retry until it gets a lease. It also disables IPv6.

Neither the CentOS public AMIs nor the Baston cloud-init-enabled CentOS AMIs use this configuration. If you are using those frequently and seeing SSH timeout issues, you may wish to re-capture those images with this networking configuration. If you are constructing your own images based on CentOS (e.g., using Packer), it would probably be a good idea to use this to configure network interfaces in your AMI.

SCNA 2012 Summary

Software Craftsmanship North America is an annual conference bringing together software craftsmen—developers who are interested in improving their own ability to program. In his opening remarks at SCNA 2012, 8th Light’s co-founder, Micah Martin described the conference as the “yearly attitude adjustment” for software craftsmen.

The speakers covered topics ranging from professional and product development, to engineering practices like testing and architecture, to theoretical CS concepts like monoids and logic programming. I have a complete-ish set of notes on my Flickr but here are some highlights.

Cory Foy talked about a model for teaching programmers (slides) that starts with work that has low context and low cognitive demand (such as katas, koans) and brings them up to doing work with high context, high cognitive demand (such as adding features and listening to code). This mirrors closely the 8th Light apprenticeship model. He also talked about how we need to learn to listen to the code and not try to force it to do things that it is not suited for; to listen requires understanding, to understand requires practice, and to practice requires context.

There were several discussions about apprenticeship. My sense is that 3 months is enough time to train people in basic craftsmanship suitable for basic web development (the equivalent of a semester, so maybe 4 courses worth). It obviously isn’t the ten thousand hours necessary to produce a master. The successes described also suggests that apprenticeship is not necessarily good at producing developers that can be hired for other companies. Of the 20 or so apprentices trained by apprentice.io (a program at Thoughtbot to try to commercialize apprenticeships), only one has been actually placed in an external company despite over a hundred companies interested in hiring out of the apprentice pool. On the other hand, they’ve hired about eight themselves. 8th Light has similarly grown much of its current 20+ craftsmen through its internal apprenticeship program.

8th Light has shared their internal syllabus for training craftsmen. Thoughtbot, the team behind apprentice.io, has also produced a set of basic trailmaps for learning basic techniques that the community can contribute to on GitHub.

I’m curious about adopting a more formal apprenticeship/mentoring program at places not primarily doing web app development, and in particular at systems-y companies like Hadapt (where time and money is limited) and VMware (where there is both more existing training, and where resources are less scarce). Certainly, some of the basic skills and culture do need to be acquired, but so does the knowledge necessary to build a distributed query execution engine or a shadow page table walker.

Uncle Bob’s talk (video/summary) spoke more broadly. He argued that we need to behave professionally because, one day, some software glitch will result in lots of deaths (think Therac-25) and the world will demand an answer from the tech industry. If we don’t want government regulation, we better behave professionally. As Uncle Bob put it, to be professional means that we do not ship shit. That we say no. That we work as a team. And that we learn continuously: Uncle Bob proposed upwards of 20 hours a week on our own.

There were many talks about aspects of testing. Michael Feathers gave a talk that sort of questioned some of ones assumptions about testing by focusing on the value delivered by tests. He talked about, for example, deleting tests—if they no longer provide value. Value of tests can come from many places: guiding the design of objects, detecting changes in behavior, acting as documentation, guiding acceptance criteria. The value of a test can change over time and we should not over-venerate any specific test. He argued that it is more appropriate to set a time budget for testing.

Gary Bernhardt gave a beautiful talk about mixing functional programming and object oriented programming. He noted that mocks and stubs cause tests to become isolated from reality but that purely functional code does not require mocking: it always behaves the same way given the same inputs. Thus, he argued that code should be structured to have a functional core surrounded by a more imperative/OO shell that sequences the results with actions, a style he called “Faux-O”. By focusing on providing values (functional results), we free the computation from the execution model (for example, how Java Callable’s can be plugged into a variety of ExecutorServices).

Justin Searls took a different tack to testing, bridging Michael and Gary’s talks in a sense. His big picture observation is that different kinds of testing deliver different amounts of reality and we should choose tests that give us the amount of reality we need. (He has a nice taxonomy of tests on his blog.) One takeaway from his talk is that we should adopt a standard for what kind of testing we do and stick to it: he liked the GOOS style of using isolation tests to guide design and more end-to-end acceptance tests to prove functionality, but listed a few others.

Drilling down into more specific tools/techniques, Brian Marick gave a talk about generating data for tests using logic programming, using an example in Clojure. His goal was to ensure that he only says as much about the data used for a test as is absolutely necessary for the test and to allow other aspects of that data to vary; this can be achieved by writing a logic program to state the test’s requirements and allowing the runtime to solve for the right data. In fact, you could imagine automatically testing all valid values that the logic program generated, instead of just one (much like Guava’s Collections test suite builder does more imperatively). We have explored this idea for system-level testing at both VMware and Hadapt, where it would be useful for tests to declare their dependencies on the system (e.g., requires a system configured in a particular way) and have the test framework automatically satisfy those dependencies in some way that the test does not care about. Logic programming would provide a way to bind the resulting dependencies to variables that could be used by the test.

Susan Potter gave a talk about monoids at a very theoretical level, but they have a practical impact on code expressiveness. A nice way to understand monoids is to see how monoids apply to FizzBuzz. At a more systems level, monoids are used by Twitter in their services stack to compose asynchronous results. As we develop tools at Hadapt for provisioning systems or manipulating internal plan trees, I expect to apply monoids to help ensure composable abstractions.

The last talk of the conference was by Leon Gersing and was a great motivational talk about personal development. You should watch it.

The talks were only half the time at SCNA. Networking with other developers made up the rest, as well as being intermixed with fun activities like kata battles (wherein two developers race to complete a basic coding kata live on screen in front of the audience) and Jeopardy. There was also a re-factoring kata fishbowl where I narrowly missed an opportunity to pair with Uncle Bob. While I got a lot of value from the talks, I wished there had been more time for pairing and working on code with the other developers there. On the last day, I got a tutorial from Randy Coulman, who has been programming in SmallTalk for 10 years, as he did the coin changer kata in SmallTalk. More explicit time for that sort of impromptu practice (not just chatting about work) would have made the conference even better.

Overall, SCNA was a great conference and I hope to be able to spend more time with software craftsmen in the future.

Growing a Software Craftsman Engineering Organization

One of the hallmarks of a software craftsman is the desire to improve and hone one’s abilities. Certainly, this is one of the reasons that I am attending Software Craftsmanship North America (SCNA) this year. As a leader in an engineering organization, however, I am also curious about how to grow an engineering organization that is focused on not only delivering value, but doing so in a way that values well-crafted software.

The population of people who are already craftsmen (outside of conferences such as this) is somewhat limited, so hiring solely craftsmen is not likely to be scalable. At the SCNA mixer last night, I heard two basic approaches to developing a team of craftsmen.

8th Light uses an apprenticeship model. 8th Light hires people in as an apprentice: there is a clear understanding that an apprentice is learning about the craft and how they work at 8th Light. There is a good ratio of craftsmen to apprentices and everyone is invested in teaching and learning. During the apprentice period, the apprentice may be unpaid or paid at below-market rates as they finish training/learning. (I’m sure this is done in a fair way and everyone gets value from the arrangement.) What was surprising to me was that not only do they hire in experienced developers (who have self-selected as being interested in improving/learning), but they hire people with aptitude but relatively little programming experience. Over the course of a year, these true apprentices grow into journeymen and craftsmen. It appears one successful model is to budget time and money into training up your own pipeline of craftsmen.

A second approach I heard about was through injection of a leader/manager who drove craftsmanship into the organization. I spoke with people at a financial services company and at a publishing company; in both cases, about a year ago, someone was brought in who drove the engineers in the direction of craftsmanship. Today, those teams practice TDD/BDD, watch Clean Coders videos to learn, and attend conferences like SCNA.

I hope over the next few days, and through continuing conversations afterwards, to get more insight into organizations that successfully balance the need for delivery with training its team to deliver high quality code, and what principles and tactics they use to transition to a high productivity state.

If you have any thoughts, please share them!

Developing Cloudera Applications With Gradle and Eclipse

This post is a translation/knock-off of Cloudera’s post on developing CDH applications with Maven and Eclipse for Gradle. It should help you get started using Gradle with Cloudera’s Hadoop. Hadapt makes significant use of Gradle for exactly this purpose.

Gradle is a build automation tool that can be used for Java projects. Since nearly all the Apache Hadoop ecosystem is written in Java, Gradle is a great tool for managing projects that build on top of the Hadoop APIs. In this post, we’ll configure a basic Gradle project that will be able to build applications against CDH (Cloudera’s Distribution including Apache Hadoop) binaries.

Gradle projects are defined using a file called build.gradle, which describes things like the projects dependencies on other modules, the build order, and any other plugins that the project uses. The complete build.gradle described below, which can be used with CDH, is available as a gist. Gradle’s build files are short and simple, combining the power of Apache Maven’s configuration by convention with the ability to customize that convention easily (and in enterprise friendly ways).

The most basic Java project can be compiled with a simple build.gradle that contains the one line:

apply plugin: "java"

While optional, it is helpful to start off your build.gradle declaring project metadata as well:

// Set up group and version info for the artifact
group = "com.mycompany.hadoopproject"
version = "1.0"

Since we want to use this project for Hadoop development, we need to add some dependencies on the Hadoop libraries. Gradle resolves dependencies by downloading jar files from remote repositories. This must be configured, so we add both the Maven Central Repository (that contains useful things like JUnit) and the CDH repository. This is done in the build.gradle like this:

repositories {
    // Standard Maven 
    mavenCentral()
    maven {
        url "https://repository.cloudera.com/artifactory/cloudera-repos/"
    }
}

The second repository enables us to add a Hadoop dependency in the dependencies section. The first repository enables us to add a JUnit dependency.

dependencies {
    compile "org.apache.hadoop:hadoop-client:2.0.0-mr1-cdh4.0.1"
    testCompile "junit:junit:4.8.2"
}

A project with the above dependency would compile against the CDH4 MapReduce v1 library. Cloudera provides a list of Maven artifacts included in CDH4 for finding HBase and other components.

Since Hadoop requires at least Java 1.6, we should also specify the compiler version for Gradle:

// Java version selection
sourceCompatibility = 1.6
targetCompatibility = 1.6

This gets us to a point where we’ve got a fully functional project, and we can build a jar by running gradle build.

In practice, it’s good to declare the version string as a property, since there is a high likelihood of dependencies on more than one artifact with the same version.

ext.hadoopVersion = "2.0.0-mr1-cdh4.0.1"
dependencies {
    compile "org.apache.hadoop:hadoop-client:${hadoopVersion}"
    testCompile "junit:junit:4.8.2"
}

Now, whenever we want to upgrade our code to a new CDH version, we only need to change the version string in one place.

Note that the configuration here produces a jar that does not contain the project dependencies within it. This is fine, so long as we only require Hadoop dependencies, since the Hadoop daemons will include all the Hadoop libraries in their own classpaths. If the Hadoop dependencies are not sufficient, it will be necessary to package the other dependencies into the jar. We can configure Gradle to package a jar with dependencies included by adding the following block:

// Emulate Maven shade plugin with a fat jar.
// http://docs.codehaus.org/display/GRADLE/Cookbook#Cookbook-Creatingafatjar
jar {
    from configurations.compile.collect { it.isDirectory() ? it : zipTree(it) }
}

Unfortunately, the jar now contains all the Hadoop libraries, which would conflict with the Hadoop daemons’ classpaths. We can indicate to Gradle that certain dependencies need to be downloaded for compilation, but will be provided to the application at runtime by augmenting the Hadoop dependencies. The code then looks like this, with an added dependency on Guava:

// Provided configuration as suggested in GRADLE-784
configuration {
    provided
}
sourceSets {
    main {
        compileClasspath += configurations.provided
    }
}

ext.hadoopVersion = "2.0.0-mr1-cdh4.0.1"
dependencies {
    provided "org.apache.hadoop:hadoop-client:${hadoopVersion}"

    compile "com.google.guava:guava:11.0.2"

    testCompile "junit:junit:4.8.2"
}

Gradle also has integration with a number of IDEs, such as Eclipse and IntelliJ IDEA. The default integrations can be provided by adding

apply plugin: "eclipse"
apply plugin: "idea"

to add support for generating Eclipse .classpath and .project files and IntelliJ .iml files. The default build output locations may not be desirable, so we configure Eclipse as follows:

eclipse {
    // Ensure Eclipse build output appears in build directory
    classpath {
        defaultOutputDir = file("${buildDir}/eclipse-classes")
    }
}

For Eclipse, simply run gradle eclipse and then import the project into Eclipse. As you update/add dependencies, re-run gradle eclipse to update the .classpath file and refresh in Eclipse. Gradle automatically handles generating a classpath, including linking to source jars.

Recent versions of IntelliJ and the SpringSource Tool Suite also support direct import of Gradle projects. When using this integration, the apply plugin lines are not necessary.

Gradle represents a well-documented and powerful alternative to developing projects in Maven. While not without its quirks, I am significantly happier maintaining an enterprise build in Gradle at Hadapt, compared to the complex Maven build I maintained at VMware. Give it a try.

Let’s Improve Our Code

New Year’s is a good time to set intentions for the coming year. Many people come off the holidays with the intention to exercise more, but if you’re reading this blog, you’re probably a programmer (if you’re not, consider signing up for Code Year…), so let’s set an intention about our programming. But first, a musical interlude.

Earl Hines was a jazz pianist; in this 9 minute video, he describes how his early playing evolved.

As you watch it, notice how he not only describes and demonstrates how his style evolved, he also describes why. For example, he talks about how his melodic line was drowned out in the larger bands so he picks up playing in octaves (doubling up the notes).

In his TED talk, David Byrne generalizes the idea of environment influencing music by talking about how music has always evolved to fit the architecture in which it was performed: from how the ethereal sounds of early church music were driven by the open acoustics of churches to how the smaller rooms of the 18th and 19th centuries allowed for the more complex rhythms and patterns of classical music to be heard. (Watch it here.)

Can we as programmers reflect similarly about our programming styles? What influences the way our programs look? And more importantly, perhaps, why should we care?

For music, Byrne argues that the evolution of styles was driven by the needs of the audience and the acoustics of the performance hall. Understanding these consciously allows contemporary musicians to make more informed choices about what and how they perform.

As programmers, our programs must communicate: with the compiler, of course, so that it will render our code executable, but also with the human readers of our code, be that our future selves or our colleagues. So to write better programs—programs that communicate their intent more concisely and clearly, as opposed to those that execute more efficiently or that are more clever—we should consider what affects the structure and readability of the programs we write.

The frameworks and mechanisms available to us most obviously affect the structure of code. Write a program in a system based on callbacks, such as the async XML HTTP request that underlies AJAX, and you will find yourself with code that chains callbacks together, preserves state in various heap objects, and is requires that callbacks be called from the right contexts to work properly. Write code for a threaded system and your code will have all manner of locks and constructs to control memory write visibility. Regular expressions can be called from Perl with the overhead of only m// so it is easier to write text munging code in Perl than almost any other language.

Our methodologies, tools, and processes—how we program—also determine how our code looks. Test-driven development will tend to produce stronger and more usable abstractions. Stream of consciousness programming results in a mess. Using an editor that supports refactoring patterns will make it more likely that you will refactor. Code review or pair programming will similarly result in code improvements, simply because you had to communicate while writing the code. (Even just commenting your code helps in this regard.) The end result of these practices is code that is more understandable.

Our audience (that is, our teammates) also affects our code. This is the role of engineering culture. What will your teammates accept versus some ideal? To get code committed to the Linux kernel requires detailed commit messages, a well structured patch series and surviving code review on the kernel mailing list. To get code committed to your personal project requires nothing outside of what you ask of yourself.

We have control over these factors. We can vary our tools, our practices, our choice of frameworks, and influence our team culture. If we are framework or API developers, we can consciously evaluate what code we induce our users to write and improve on what we provide to simplify their lives, and facilitate their communication and self-expression.

This year, let’s set an intention to examine our code and improve how it reads. Let’s experiment and play with the factors under our control to see which choices work better for our teams. Ask your teammates whether one way or another works better for them. Spend some time analyzing your own code and consider how it got that way.

I’ll try to share some of what I learn from my team at Hadapt and I’m curious to hear what you learn from yours.

Git Is More Usable Than Mercurial

Once upon a time, I used Mercurial for development. When I moved to VMware, people there seemed to favor Git and so I spent the past few years learning Git and helping to evangelize its use within VMware. I have written about why I chose Mercurial, as well as my initial reactions upon starting to use Git. Hadapt happens to be using Mercurial today and so I have been re-visiting Git and Mercurial.

What I wrote about Git and Mercurial in 2008 is still true: Git and Mercurial are similar in may respects—for example, you can represent the same commit graph structure in both—and they are both certainly better than Subversion and CVS. However, there are a lot of differences to appreciate in terms of user experience that I am now in a better position to evaluate.

In using Mercurial, I find myself oddly hobbled in my ability to do things. At first, I thought that this might simply be because some things are simply done differently in Mercurial but at this point, I actually think that Git’s design and attention to detail result in it actually being more usable than Mercurial.

There are three “philosophical” distinctions that are in Git’s favor:

  1. Git has one branching model. Mercurial has several that have evolved over time; Steve Losh has a comprehensive essay describing ways to branch in Mercurial. The effect of this is that different Mercurial users branch in different ways and the different styles don’t really mix well in one repo. Git users, once they learn how branching works, are unlikely to be confused by branches.

  2. Git has names (refs) that don’t change unexpectedly. Every Git commit you care about has a name that you can choose. Some Mercurial commits that you might care about do not have a name. For example, the default branch in Mercurial can have multiple heads, so it interprets -r default as the tip-most commit. Unfortunately, that commit will vary depending on who has committed what to which head (and when you see it).

    Further, Git exposes relative naming by allowing you to refer to the branches in remote repositories by name, without affecting your own names.

    Putting this together, consider what happens after you pull in Mercurial. Your last commit used to be called default but after the pull, default is something from the upstream. Your commit is a separate head that now has no name. In Git, your master doesn’t move after a fetch and the remote’s branch is called origin/master.

    Git even tracks the changes what commit each name refers to in a reflog. You can easily refer to things that the name used to refer to. In Mercurial, branch names don’t have reliable meanings, and it doesn’t track them.

  3. Git commands operate in a local context by default. Mercurial commands often operate on a repository context. For example, git grep operates on the current sub-directory of your work tree, hg grep operates on your history. The Git analog of hg grep is using the log pick-axe; the Mercurial analog of git grep is to use ack, or if you must, something like hg grep -r reverse(::.) pattern . (Seriously?)

    Another example is the log command. Git’s log command shows you the history of the commit you are on right now. Mercurial’s log command shows you something about the whole repository unless you restrict with some combination of -b and -f. Combined with Mercurial’s way of resolving branch names to commits, it becomes very difficult to use hg log to compare two heads or explore what has changed in another head of the same branch.

    More often than not, I care about things in their current tree more than how things are in some random other branch that I am not working on and Mercurial makes it hard to do that.

There are other usability issues that I’ve found that are more detail-oriented than philosophical. I’ll note a few here.

hg log doesn’t display the full text of the commit message unless you hg log --debug. This is an unfortunate disincentive to writing good commit messages.

hg log -p doesn’t pay as much attention to merge commits as Git does; the help for hg log reads:

log -p/–patch may generate unexpected diff output for merge changesets, as it will only compare the merge changeset against its first parent. Also, only files different from BOTH parents will appear in files:.

whereas git log has a variety of options to control how the merge diff is displayed, including showing diffs to both parents, removing “uninteresting” changes that did not conflict, or showing the full merge against either just the first or all parents of the merge commit.

Both Mercurial and Git have lots of configurable options; Git has a thin veneer over editing a config file in the form of the git config sub-command. Mercurial involves editing a file even if just setting up your initial username or enabling extensions. I often wound up editing Git config files directly, but having the commands were nice for sharing instructions with others.

Git support for working with patches natively is better. Mercurial supports e-mailing and applying patches, but oddly the extension for sending out patches is built in (patchbomb) but the extension for importing from an mbox (mbox) is not. There’s no direct analog of git apply; instead you have to use a patch queue. Patch queues are okay, but branches and well-integrated rebase/e-mail/apply support are much nicer than patch queues: you don’t need to manual find some .hg/patches/series file and edit it to re-order stuff.

I could write more and indeed many people have written about Git and Mercurial—you can explore my bookmarks about git for some of the better ones. Let me close here with three interesting features in Mercurial 2.0:

  • the new largefiles extension allows users to not transfer large files down until they are needed;
  • subrepos can be Git or Subversion in addition to Mercurial;
  • revsets allow you to search your history in very flexible ways.

Overall, I feel that Git is significantly more usable for day-to-day development than Mercurial. I’d be curious to hear if you think the opposite is true.

A New Adventure

Friday, 4 November, was my last day at VMware.

I started at VMware in 2008, working on a project that has now become VMware’s Horizon Mobile. Last year, I switched to working on the latest release of VMware’s vCloud Director.

VMware has a lot going for it as a place to work. For example, it has:

It wasn’t an easy choice to leave.

Last month, I became aware that a startup in the big data space was moving to Boston. I’d been wondering about life outside VMware and this opportunity seemed just about perfect. So I’m beginning a new adventure at Hadapt. As an early employee, I imagine I’ll be doing a little bit of everything. I hope to combine the skills and knowledge I built up from my graduate work and the practical experience of delivering enterprise software at VMware to help Hadapt build a powerful, scalable, data analytics platform and make Hadapt a successful company.

I’m excited to get started and I hope to share here with you some of my experiences as I go.

Rules for Development Happiness

Inspired by Alex Payne’s Rules for Computing Happiness, some rules for having happy developers and being happy as a developer.

  • Use version control. (See The Joel Test.) In particular, use a distributed version control system (like Mercurial or Git). This ensures you can commit offline and also conduct code archaeology offline.
  • Have a correct and fast incremental build (e.g., non-recursive Make or Gradle) to avoid this.
  • Have a system for testing your changes in a safe environment prior to code submission.
  • Avoid dependencies on system tools. Different developers tend to have different systems and hence different versions of tools.
  • Be able to work offline. Offline may mean when you’re on a plane, but it may also happen when the office network goes down. Both happen. Notably, the latter happens even when you work on a desktop with a wired connection. (It’s been pointed out to me that the network going down can be a good team-building experience.)
    • Be able to build offline. That means having all build dependencies cached locally.
    • Have all e-mail cached locally. Don’t be unable to find those key instructions some mailed you just because GMail is restoring your mail from tape. Helpful tools here are isync or offlineimap. Index your mail with mu. (Or configure Thunderbird/Apple Mail/etc to keep everything offline.)
    • Be able to send mail offline; e.g., have it queue locally for deliver when the network comes back. But, make sure you keep a copy locally in case the hotel’s WiFi is transparently re-directing out-bound SMTP connections to /dev/null. (This really happened to me.)
    • Have other documentation cached locally. (Use something like gollum for your wiki.)
    • If you work somewhere with a shared-storage home directory, make sure you can login when the network is down!

Store Hudson Configuration in Git

For any kind of server, it’s a good idea to keep its configuration in some sort of version control system. Hudson is a pluggable continuous integration system. Recently, I was trying to set one up and was wondering the best way to store Hudson’s configuration in version control (StackOverflow summary). The most complete answer is a post on the Hudson blog about how to keep Hudson’s configuration in Subversion; there are also plugins like a nascent SCM Sync configuration plugin. But, the former is very Subversion specific and the latter does not seem particularly mature. So, to understand how to do it in your workflow, there are two things to consider.

First, which files are relevant? Hudson puts configuration, run-time state, source code and build output all in the same sub-directory (called HUDSON_HOME). Second, relatedly, since normally you edit Hudson’s configuration through the GUI, when should you commit changes? Should it be automated (e.g., nightly at midnight) or manual (e.g., ssh into the server and manually commit)? I’ll answer those questions with an implementation in Git but you can translate the information easily to your preferred VCS.

Identify relevant files by using the following .gitignore file:

This ignores the uninteresting files and will allow git status to show you interesting new files. Note that I prefer to actually commit the binaries of plugins since I don’t want to rely on outside sources (namely, the mirror network) having the particular version of the plugin that I was using for the given configuration files. To use this if you are installing a new Hudson server, you can just

cd $HUDSON_HOME/.. # Default is /var/lib
rm -r hudson
git clone git://gist.github.com/780105.git hudson
# Don't forget to chown hudson hudson as appropriate for your environment

before starting Hudson for the first time. Then once it has started, run git commit to track the default config that Hudson creates.

The second question is when. The Hudson blog’s recommendation is to create a Hudson job that runs nightly at midnight to check for differences and automatically commit them. I prefer manually committing the changes on the server and then pushing it. This allows me to identify specific functional changes (using git add -p) and commit them individually. If you want to do it automatically, simply write a script or add a job that will

git commit -a -m "Automated commit of Hudson configuration"
git push

once you set up an appropriate origin.

Once you have this set up, you can even use something like Chef to automatically pull down updated configuration that you manage and test elsewhere and restart the Hudson server when necessary. Then you can re-create your Hudson server in case of failure at any time!

Programming Without Fear

This past weekend, I attended Gil Broza’s seminar on Programming Without Fear, organized by the Greater Boston Chapter of the ACM’s Journeyman Programmer initiative. The seminar was as advertised, and covered:

For anyone with more than a passing interest in agile, the material Gil presented (covered in the links above) will not be new.

The benefits of the seminar came from two things. First, Gil presented the information in a somewhat “formal” framework: a taxonomy of code smells, a set of refactoring patterns, a pair of mnemonics (PRICELESS unit tests and TRUST your refactoring process) to help remember basic techniques. This gives someone new to the material an organized set of knowledge to internalize. Second, Gil has prepared a series of exercises, interspersed with the lecture-y sections, that seminar participants work through in pairs, designed to reinforce the theoretical frameworks with practical experience. Even as someone moderately experienced with these concepts, the exercises are useful in that they focus on the fundamentals and force you to actively strengthen those fundamentals. (The weakest section, I thought, was the one on mocking which received insufficient exposition and dropped the class directly into jMock, which was a bit opaque.)

Gil is not the most exciting or funny teacher but he kept the attendees engaged by teaching with a Socratic flavor—he presented examples and solicited audience evaluations, allowing the audience to interact to reach conclusions. The practical exercises were followed by group de-briefs. This encouraged the audience to stay engaged and better absorb the material.

My main worry about the techniques is the overall reliance on Eclipse (or other IDE) as a developer’s assistant: while certainly the tooling is convenient, they make me worry about Java and whether the use of tools and wizards weaken developers who may never learn how to do things themselves.

What I really enjoyed was the experience of actually developing and refactoring with the protection of a unit test suite and learning techniques to perform refactoring without more than a moment or two of compiler errors. This was in sharp contrast to my normal refactoring experience of making a top-level change and then following all the compiler warnings until the work is done. Now if only every codebase I worked on came with such a set of tests…