Thoughts on Systems

Emil Sit

Jun 23, 2015 - 8 minute read - tips migrations

4 Lessons for Smoother Technology Migrations

Sooner or later, every technology organization faces a necessary evil: migrations. Maybe the old system is not scaling or the implementation is too brittle to adapt to a new feature. Maybe it’s not capable of interoperating with newer systems. Whatever the reason, the time will inevitably come to migrate part of your application from one implementation to another.

Performing migrations can be a thankless task. The old system, flaws and all, works. And, everyone expects the new system to do everything the old one does, in addition to whatever new functionality is required. Even if users can be moved to the new system incrementally, people don’t have much patience for products behaving differently and functionality being broken or missing. This is something we learned the hard way this past year.

My team is one of the teams that maintains the infrastructure behind HubSpot’s Sidekick product, a freemium tool that provides email productivity tools such as email open and click tracking. Over the past year, we made two big migrations. First, in order to achieve a better balance of performance, cost, and control, we moved our primary storage for our email open and click feature from one data storage platform to another. This required copying and transforming hundreds of gigabytes of data and building an entirely new pipeline to process clicks and opens into the new storage system reliably.

Second, after several months of manually modifying settings in our billing backend to support our new Sidekick for Business product and pricing, we re-wrote much of our billing system to standardize and automate handling of multiple pricing levels. This required significant changes to our team membership flows and internal accounting.

Neither of these migrations were seamless. We followed the general principles behind our known tactics for safely rewriting a big API (such as gating and routing traffic incrementally), but we neglected to do a few things that could have helped us avoid a few potholes. So, here are four lessons we learned from these Sidekick migrations that will hopefully make your next migration a smoother process.

Understand What the Old System Did and Why

One big challenge in technology migrations is knowing everything the old system does and why. Over time, a code base accretes little tweaks and hacks that help it deal with edge cases that come up in a production system. Like scar tissue, this code is critical in keeping the product functioning smoothly, but as people join and leave teams, and as products evolve, it’s hard to keep track of the cuts that led to scarring in the first place. Without a team that knows the old system inside and out, important things can get lost in the migration.

Something important did get lost when we were migrating our open and click tracking system: the suppression of notifications when you open your own emails. Although the job of our tracking system is to tell our users when someone has opened an email that they’ve sent, users don’t want to be notified when they open their own emails. So, an earlier team had built special checks to detect when users open their own email and discard that notification. When my team built the updated pipeline, we didn’t realize this and didn’t build in these checks when we initially rolled out the new system. Users immediately began complaining and we had to scramble to fix the problem. This was only one of several instances of “scar tissue” that ended up getting lost in translation during the migration and causing us problems.

When we started our billing migration, we knew we needed to be more careful. We dug into the code and wrote a basic specification of the system. Writing the spec helped us document what needed to be done and why, and forced us to think through what we may have otherwise overlooked during the migration. While we certainly didn’t get everything right, having the document as a reference made a huge difference in glitch-proofing our process as much as we could.

Engage All Stakeholders Early On

As you start exploring the system’s behavior, you might discover that different people depend on it in different ways. It’s natural to involve other technical teams and management in the migration process, but we realized that non-technical parties, from your social media team to your support staff, from your salespeople to your finance department, should be clued in from the get-go. The goal is to understand what’s important to them so you can keep an eye on it from the technical side.

These stakeholders interact with your systems every day and have their own special tricks or patterns for getting their jobs done. In fact, over time, you’ve probably built a variety of little tools that they’ve become dependent on. Migrations are designed to improve the system for your customers (e.g. better performance, more features) but if the new system breaks or eliminates those tools, you’ve done the opposite for your internal stakeholders. Changing their workflows without warning means your technical team will be bombarded with questions that stakeholders used to handle themselves.

For example, when we re-engineered the billing system, we changed the semantics of some details on the internal billing administration page, and broke the ability for support to make certain account adjustments. As a result, our support team was often confused about how to interpret what they saw on the page for a given account, and was also unable to rectify common problems that they had previously been able to handle. Needless to say, this led to a lot of stress for everyone. By being more explicit about changes and keeping tooling changes in line with product changes, we could have made this much easier for everyone. Do your team and stakeholders a favor by communicating the migration early on and keeping them in the loop throughout the process.

An added benefit of working with other stakeholders is that they may help you spot problems that you didn’t even think to check for in testing. In our case, it’s often a race between our social media team and our support reps to see who gets the first word from customers that something is off after we deploy some new code. And one time, our finance department quickly pinged us when we “forgot” to enforce our freemium product limits for a week. Phew.

Detect Behavioral Differences

On that note, a powerful strategy for finding problems during a migration is to identify key business metrics and set up alerting on those metrics. Often, alerting focuses on technical problems: is your 99th percentile response time spiking? Are there too many server errors? However, a vast spectrum of product failures do not trigger these alarms.

One example of a business metric we use concerns our user onboarding process. During onboarding, we show a new user how to send a tracked email and what it is like to receive a notification. Based on historical data, we know how many people should experience this interaction each hour. If it doesn’t happen at the right rate, we know we’ve broken something. Because anomaly detection can be difficult, just setting thresholds for extremely abnormal behaviors can be helpful (e.g., if zero new users receive a notification, that’s a bad sign). This means thinking about your business metrics before you start coding. Business metrics tend to be more robust to technology change than technical metrics because you still need to provide the same business value even if your technical implementation is changing.

Another technique is to compare end-to-end output, if possible. If you are changing your data source, make sure the user rendered output remains the same. Here, “same” can mean anything you want—you can literally compare the rendered output pixel by pixel or you can just make sure there are certain div elements that contain the right text. For developers, it can be very helpful to have a mechanism that forces the application to use the old component or the new component at runtime. We used a secret URL to allow us to view user activity data using both the new and the old data stores as a way to detect issues and verify fixes.

Prepare your Infrastructure and Architecture for Iteration

Having a plan, working with stakeholders, and monitoring key metrics lets you catch problems before or as they happen. But, there will still be things that slip through the cracks. That’s why it’s important that both your infrastructure and your application’s architecture supports rapid iteration.

Our billing system and email tracking pipeline were initially part of a more monolithic system, so despite HubSpot’s microservices architecture and deployment infrastructure, we could not deploy pieces of it independently and big changes were risky to deploy. This was frustrating because, as we migrated users and data incrementally (in both cases!), we found a lot of edge case and anomalous data that would not work properly in the new system. These were not even cases that you can really plan for: you often won’t know the kinds of weird data you and your users have put into the system over the years until you try to migrate it.

To address this, we improved our architecture by extracting these components from our monolithic system, enabling us to iterate more quickly. Our email tracking pipeline now has many incremental processing stages that can be independently and reliably deployed. We extracted our billing system’s UI into a separately deployable front-end component, so we could make front-end improvements without having to worry about changes being coupled to our larger monolithic system. In both these cases, it benefited ourselves and our users when we could improve, tweak and test our systems more quickly and safely.

These four lessons have helped us manage the constant technology change involved in operating a SaaS product. What do you do to help ensure successful technology migrations?

This post was originally published on the HubSpot Engineering Blog.

Jun 5, 2015 - 9 minute read - tips management

Four Things I Wish I Knew When I Became a Tech Lead

Years ago, a mentor of mine talked to me about the distinction between leadership (coping with change) and management (coping with complexity). A tech lead does a little bit of both: we have to come up with the vision for growing the technical systems that solve problems for our customers and shepherd the solution from concept to production. While all that’s going on, we need to guide, manage, and support the people on our team so that they can always be growing. For first-time TLs, this can be an intimidatingly wide spectrum of responsibility. I know it was for me.

Now, as a senior TL, I have found some modicum of success at HubSpot. There was no rulebook or guide that helped me master this role overnight. But a lot of mistakes and ‘ah-ha!’ moments helped me realize a lot of things I wish I had known when my title changed to tech lead for the first time. Here are four that I think are key in getting started as a leader.

You Can’t Learn Everything on the Job

Unfortunately, just like how an education only prepares you in the most basic ways for being an effective individual contributor, being an effective individual contributor only gives you a fraction of the skills you need to be an effective leader. That’s why I think it’s important to be proactive about learning as much as you can about leading a team before, during, and after you become a TL.

I was lucky enough when I started my career to have a manager who hooked me up with someone outside of our company who helped me navigate my career path. He shared tons of insight with me on what’s expected of a TL, recognizing the different strengths of people on your team, and balancing people and product demands simultaneously. Whether they’re your manager or someone with a similar career trajectory at another company, finding a mentor is a serious asset in becoming a leader. But you have to be intentional about it; no one’s going to knock on your door begging you to let them mentor you. Be on the lookout for events where maybe you can make a good connection or communities where leaders can connect.

Depending on the company, you might not have to go too far to find ways to cultivate your leadership style. One of the most exciting things I learned after joining HubSpot was that there’s an internal team here that provides explicit training for managers. They run a 12-course program on everything from running effective 1:1s, to providing coaching, to discussing career goals. These classes have been invaluable as a manager and full of practical insights that I’m still trying to master. Luckily, if these types of programs aren’t available at the office, organizations like Intelligent.ly (in Boston) host leadership workshops and management trainings, too.

There are a handful of books I’ve gleaned leadership insight from over the years, too. In his book, Turn This Ship Around, former submarine captain L. David Marquet frames the success of a leader as creating more leaders instead of followers: if you are a good leader, when you leave your organization, it continues to function well. In order to do that, you have to establish technical competence in all the members of your team and provide them with the organizational clarity to know what to do. This has been really helpful for me to navigate HubSpot’s culture of small autonomous teams. On a more personal note, I recently read The Heart and the Fist by Eric Greitens, which captures the importance of resilience and willpower when facing challenging situations, something every TL does frequently.

There are a million resources out there that have the potential to change your thinking and prepare you just a little bit more for running a team. But you have to be proactive about seeking them out and making learning part of the job.

Leadership Style Should Reflect the Team, Not the Leader

When I first became a leader at a previous company, I knew what I didn’t like. I had been on teams that were “agile” because “agile” (aka scrum) was the thing to do. We had standups where people who had just been working together would stand up and say what it was that they were just working on. We had standups where people who didn’t work on any of the same things (but were on the same “team”) gave cursory summaries of work they were doing for people who didn’t have any context. I went to retrospectives where no one wanted to say anything.

I dreamed of being part of a self-organizing agile team, the kind you hear about in scrummaster training but are never part of, where individuals pick up the work necessary and make it happen amongst themselves. I wanted meetings that were interactive and inclusive. So, I just tried running my team and meetings that way.

Parts of my strategy worked out okay. For example, having people write ideas on sticky notes during a quiet brainstorming period at the start of the meeting instead of calling them out in-person (a trick I stole from Dave Gray’s Gamestorming book) helped some of the quieter people on my team engage during planning and retrospective meetings instead of being overpowered by a few forceful personalities. It may feel strange to do the first time, but a few women on my team told me it made a big difference for them.

Others parts fell flat. My team wasn’t ready to be self-organizing because they never had had to be that autonomous before. We failed to deliver projects on time. Tasks didn’t get done; it was like watching a volleyball hit the ground because no one called it. Instead of leading and managing, I let the team run itself sideways.

Don’t assume that what’s right for you will be right for your team. I’ve realized that when you match your leadership style with what the individual (and team) is ready for, you feel more confident in their output and they feel more comfortable doing it. Once you’ve established a rhythm, you can work on growing their skills and giving them more independence.

(Over)communication is Key

I was having dinner once with our CEO, Brian Halligan, when he asked every HubSpotter at the table what we would do if we were CEO. I said I would make scaling communication and transparency a priority. I had been at companies before where the core values and mission were diluted by the time they trickled down from management to individual contributors. Every quarter, the CEO or the VP of our business unit would hammer home the vision. But there were so many layers of middle management and so much PowerPoint markitecture that it was hard to link our day-to-day individual contributions to the bigger picture.

When companies grow, especially when they grow quickly, it gets harder to be proactive and intentional about every moving piece. But good internal communication should never get lost in the shuffle. Upper management has to be thoughtful about keeping an entire company in tune with how they’re driving the business, and as a TL, you have to be fixed on doing the same for your team.

My team here is hungry to know how they can contribute, find something to own, and make an impact. I just need to make sure I’m guiding that energy and skill in the right direction for the business. Sometimes, I realize haven’t communicated the context of a project or technical decision as well as I could have. But when I do distill the bigger picture in a way that’s actionable and personalized to our team, their world becomes clearer, their ability to work independently improves, and they can tie their efforts to our larger mission.

It’s also important to communicate beyond your team to help the larger organization understand what you’re up to. Especially when things aren’t going well, I’ve found it very helpful to document everything in a shared document (we use Google Docs), that has the most up-to-date information. This allows the team to collaborate and contribute to solving the problem, but also becomes a resource to bring other parties (e.g., legal, operations, other development teams) up to speed quickly, without requiring that people read through hours of chat logs or email threads.

You’ll know when you fall short on communication. You can see it in the direction your team took a project and their confusion when they need to rethink their solution. You will hear hard questions coming down from your management. That’s why it’s best to over communicate.

Lean On People Smarter Than You

The TL is almost always the final arbiter of technical decisions regarding the product in our organization. We love giving people responsibility and we can afford to do that because we make sure they have all the tools and knowledge they need to make the right decision. As a TL, I have multiple “spotters” who watch out for me and help me get back on track if things go off the rails.

My first spotter is my manager, who holds a regular 1:1 with me. My manager has proposed alternatives that I had forgotten to consider, and also pointed out times when I wasn’t providing enough leadership and my team was getting lost. We also use 15five as a tool to encourage everyone to reflect on their week; my 15five reports give my manager a different view of what’s going and allow him to ask questions and make suggestions. Having a more experienced and detached eye that can look at the situation and provide feedback has been invaluable.

As a SaaS company, we also need our operational systems to be highly reliable. Our Director of Reliability functions as a spotter by helping TLs manage operational crises, conduct post mortems successfully, and implement remediations. For me, working with our reliability team has taught me how to think about the severity of issues and apply best practices from the rest of HubSpot to our Sidekick organization.

TLs are also are given explicit opportunities to learn from one another. In addition to the daily work that span teams (giving implicit opportunities to learn from peers), we have a weekly rotating TL lunch program. This gives us a platform to share problems and solutions with minds we might not get to work with everyday. Beyond tapping into other TLs, we have another program that gives us the opportunity to lean on our senior executives from time to time (sometimes over dinner, drinks, or bowling.) I’ve realized there’s nothing wrong with asking for help or looking outside yourself for guidance. In fact, it’s the only way to grow into an effective TL.

Becoming a great TL is a long-term investment. I’m still running into new problems and hard conversations all the time. That’s why the last thing I wish I had known a few years ago is that TLs, especially those just starting out, need to get comfortable with being uncomfortable. Being proactive about reaching out to mentors and learning as much as I could early on was in my control, but I learned just as much, if not more, from the things that weren’t. Instead of letting a mistake throw everything off course, it’s important to look at it as one more lesson that’ll make leadership come more naturally down the line.

This post was originally published on the HubSpot Engineering Blog.

Sep 3, 2013 - 2 minute read - ec2 tips

How to configure Linux networking for EC2 AMIs

At Hadapt we provision a lot of EC2 instances for development and test purposes. This gives us some unhappily deep experience with error cases around EC2 provisioning (at least, in us-east, where we provision most of our nodes).

One problem we have seen is nodes transitioning into a running state but not becoming available to SSH. This can be detected programmatically (or on the EC2 dashboard) as a node that fails the reachability check. If you are looping trying to reconnect via SSH, it might look something like this:

ssh: connect to host 23.20.165.232 port 22: Connection timed out

looping forever, whereas a normally operating host would transition from timing out to connection refused (IP address acquired but before SSH starts) to normal operation:

ssh: connect to host 107.21.164.85 port 22: Connection timed out
ssh: connect to host 107.21.164.85 port 22: Connection refused
Warning: Permanently added '107.21.164.85' (RSA) to the list of known hosts.

We’ve learned, through working with EC2 support, that this can happen if the new instance misses the DHCP offer from the local DHCP server and thus never acquires its IP address.

The Amazon Linux AMI uses the following configuration for /etc/sysconfig/network-scripts/ifcfg-eth0 to avoid this problem:

DEVICE=eth0
BOOTPROTO=dhcp
ONBOOT=yes
TYPE=Ethernet
USERCTL=yes
PEERDNS=yes
IPV6INIT=no
PERSISTENT_DHCLIENT=yes

Notably, it disables the use of NetworkManager (which may cause an interface to be disabled) and configures dhclient to be persistent and retry until it gets a lease. It also disables IPv6.

Neither the CentOS public AMIs nor the Baston cloud-init-enabled CentOS AMIs use this configuration. If you are using those frequently and seeing SSH timeout issues, you may wish to re-capture those images with this networking configuration. If you are constructing your own images based on CentOS (e.g., using Packer), it would probably be a good idea to use this to configure network interfaces in your AMI.

Nov 23, 2012 - 7 minute read - scna craftsmanship conferences

SCNA 2012 summary

Software Craftsmanship North America is an annual conference bringing together software craftsmen—developers who are interested in improving their own ability to program. In his opening remarks at SCNA 2012, 8th Light’s co-founder, Micah Martin described the conference as the “yearly attitude adjustment” for software craftsmen.

The speakers covered topics ranging from professional and product development, to engineering practices like testing and architecture, to theoretical CS concepts like monoids and logic programming. I have a complete-ish set of notes on my Flickr but here are some highlights.

Cory Foy talked about a model for teaching programmers (slides) that starts with work that has low context and low cognitive demand (such as katas, koans) and brings them up to doing work with high context, high cognitive demand (such as adding features and listening to code). This mirrors closely the 8th Light apprenticeship model. He also talked about how we need to learn to listen to the code and not try to force it to do things that it is not suited for; to listen requires understanding, to understand requires practice, and to practice requires context.

There were several discussions about apprenticeship. My sense is that 3 months is enough time to train people in basic craftsmanship suitable for basic web development (the equivalent of a semester, so maybe 4 courses worth). It obviously isn’t the ten thousand hours necessary to produce a master. The successes described also suggests that apprenticeship is not necessarily good at producing developers that can be hired for other companies. Of the 20 or so apprentices trained by apprentice.io (a program at Thoughtbot to try to commercialize apprenticeships), only one has been actually placed in an external company despite over a hundred companies interested in hiring out of the apprentice pool. On the other hand, they’ve hired about eight themselves. 8th Light has similarly grown much of its current 20+ craftsmen through its internal apprenticeship program.

8th Light has shared their internal syllabus for training craftsmen. Thoughtbot, the team behind apprentice.io, has also produced a set of basic trailmaps for learning basic techniques that the community can contribute to on GitHub.

I’m curious about adopting a more formal apprenticeship/mentoring program at places not primarily doing web app development, and in particular at systems-y companies like Hadapt (where time and money is limited) and VMware (where there is both more existing training, and where resources are less scarce). Certainly, some of the basic skills and culture do need to be acquired, but so does the knowledge necessary to build a distributed query execution engine or a shadow page table walker.

Uncle Bob’s talk (video/summary) spoke more broadly. He argued that we need to behave professionally because, one day, some software glitch will result in lots of deaths (think Therac-25) and the world will demand an answer from the tech industry. If we don’t want government regulation, we better behave professionally. As Uncle Bob put it, to be professional means that we do not ship shit. That we say no. That we work as a team. And that we learn continuously: Uncle Bob proposed upwards of 20 hours a week on our own.

There were many talks about aspects of testing. Michael Feathers gave a talk that sort of questioned some of ones assumptions about testing by focusing on the value delivered by tests. He talked about, for example, deleting tests—if they no longer provide value. Value of tests can come from many places: guiding the design of objects, detecting changes in behavior, acting as documentation, guiding acceptance criteria. The value of a test can change over time and we should not over-venerate any specific test. He argued that it is more appropriate to set a time budget for testing.

Gary Bernhardt gave a beautiful talk about mixing functional programming and object oriented programming. He noted that mocks and stubs cause tests to become isolated from reality but that purely functional code does not require mocking: it always behaves the same way given the same inputs. Thus, he argued that code should be structured to have a functional core surrounded by a more imperative/OO shell that sequences the results with actions, a style he called “Faux-O”. By focusing on providing values (functional results), we free the computation from the execution model (for example, how Java Callable’s can be plugged into a variety of ExecutorServices).

Justin Searls took a different tack to testing, bridging Michael and Gary’s talks in a sense. His big picture observation is that different kinds of testing deliver different amounts of reality and we should choose tests that give us the amount of reality we need. (He has a nice taxonomy of tests on his blog.) One takeaway from his talk is that we should adopt a standard for what kind of testing we do and stick to it: he liked the GOOS style of using isolation tests to guide design and more end-to-end acceptance tests to prove functionality, but listed a few others.

Drilling down into more specific tools/techniques, Brian Marick gave a talk about generating data for tests using logic programming, using an example in Clojure. His goal was to ensure that he only says as much about the data used for a test as is absolutely necessary for the test and to allow other aspects of that data to vary; this can be achieved by writing a logic program to state the test’s requirements and allowing the runtime to solve for the right data. In fact, you could imagine automatically testing all valid values that the logic program generated, instead of just one (much like Guava’s Collections test suite builder does more imperatively). We have explored this idea for system-level testing at both VMware and Hadapt, where it would be useful for tests to declare their dependencies on the system (e.g., requires a system configured in a particular way) and have the test framework automatically satisfy those dependencies in some way that the test does not care about. Logic programming would provide a way to bind the resulting dependencies to variables that could be used by the test.

Susan Potter gave a talk about monoids at a very theoretical level, but they have a practical impact on code expressiveness. A nice way to understand monoids is to see how monoids apply to FizzBuzz. At a more systems level, monoids are used by Twitter in their services stack to compose asynchronous results. As we develop tools at Hadapt for provisioning systems or manipulating internal plan trees, I expect to apply monoids to help ensure composable abstractions.

The last talk of the conference was by Leon Gersing and was a great motivational talk about personal development. You should watch it.

The talks were only half the time at SCNA. Networking with other developers made up the rest, as well as being intermixed with fun activities like kata battles (wherein two developers race to complete a basic coding kata live on screen in front of the audience) and Jeopardy. There was also a re-factoring kata fishbowl where I narrowly missed an opportunity to pair with Uncle Bob. While I got a lot of value from the talks, I wished there had been more time for pairing and working on code with the other developers there. On the last day, I got a tutorial from Randy Coulman, who has been programming in SmallTalk for 10 years, as he did the coin changer kata in SmallTalk. More explicit time for that sort of impromptu practice (not just chatting about work) would have made the conference even better.

Overall, SCNA was a great conference and I hope to be able to spend more time with software craftsmen in the future.

Nov 9, 2012 - 2 minute read - programming selfimprovement craftsmanship scna

Growing a Software Craftsman Engineering Organization

One of the hallmarks of a software craftsman is the desire to improve and hone one’s abilities. Certainly, this is one of the reasons that I am attending Software Craftsmanship North America (SCNA) this year. As a leader in an engineering organization, however, I am also curious about how to grow an engineering organization that is focused on not only delivering value, but doing so in a way that values well-crafted software.

The population of people who are already craftsmen (outside of conferences such as this) is somewhat limited, so hiring solely craftsmen is not likely to be scalable. At the SCNA mixer last night, I heard two basic approaches to developing a team of craftsmen.

8th Light uses an apprenticeship model. 8th Light hires people in as an apprentice: there is a clear understanding that an apprentice is learning about the craft and how they work at 8th Light. There is a good ratio of craftsmen to apprentices and everyone is invested in teaching and learning. During the apprentice period, the apprentice may be unpaid or paid at below-market rates as they finish training/learning. (I’m sure this is done in a fair way and everyone gets value from the arrangement.) What was surprising to me was that not only do they hire in experienced developers (who have self-selected as being interested in improving/learning), but they hire people with aptitude but relatively little programming experience. Over the course of a year, these true apprentices grow into journeymen and craftsmen. It appears one successful model is to budget time and money into training up your own pipeline of craftsmen.

A second approach I heard about was through injection of a leader/manager who drove craftsmanship into the organization. I spoke with people at a financial services company and at a publishing company; in both cases, about a year ago, someone was brought in who drove the engineers in the direction of craftsmanship. Today, those teams practice TDD/BDD, watch Clean Coders videos to learn, and attend conferences like SCNA.

I hope over the next few days, and through continuing conversations afterwards, to get more insight into organizations that successfully balance the need for delivery with training its team to deliver high quality code, and what principles and tactics they use to transition to a high productivity state.

If you have any thoughts, please share them!

Sep 2, 2012 - 4 minute read - gradle hadoop cloudera maven

Developing Cloudera Applications with Gradle and Eclipse

This post is a translation/knock-off of Cloudera’s post on developing CDH applications with Maven and Eclipse for Gradle. It should help you get started using Gradle with Cloudera’s Hadoop. Hadapt makes significant use of Gradle for exactly this purpose.

Gradle is a build automation tool that can be used for Java projects. Since nearly all the Apache Hadoop ecosystem is written in Java, Gradle is a great tool for managing projects that build on top of the Hadoop APIs. In this post, we’ll configure a basic Gradle project that will be able to build applications against CDH (Cloudera’s Distribution including Apache Hadoop) binaries.

Gradle projects are defined using a file called build.gradle, which describes things like the projects dependencies on other modules, the build order, and any other plugins that the project uses. The complete build.gradle described below, which can be used with CDH, is available as a gist. Gradle’s build files are short and simple, combining the power of Apache Maven’s configuration by convention with the ability to customize that convention easily (and in enterprise friendly ways).

The most basic Java project can be compiled with a simple build.gradle that contains the one line:

apply plugin: "java"

While optional, it is helpful to start off your build.gradle declaring project metadata as well:

// Set up group and version info for the artifact
group = "com.mycompany.hadoopproject"
version = "1.0"

Since we want to use this project for Hadoop development, we need to add some dependencies on the Hadoop libraries. Gradle resolves dependencies by downloading jar files from remote repositories. This must be configured, so we add both the Maven Central Repository (that contains useful things like JUnit) and the CDH repository. This is done in the build.gradle like this:

repositories {
    // Standard Maven 
    mavenCentral()
    maven {
        url "https://repository.cloudera.com/artifactory/cloudera-repos/"
    }
}

The second repository enables us to add a Hadoop dependency in the dependencies section. The first repository enables us to add a JUnit dependency.

dependencies {
    compile "org.apache.hadoop:hadoop-client:2.0.0-mr1-cdh4.0.1"
    testCompile "junit:junit:4.8.2"
}

A project with the above dependency would compile against the CDH4 MapReduce v1 library. Cloudera provides a list of Maven artifacts included in CDH4 for finding HBase and other components.

Since Hadoop requires at least Java 1.6, we should also specify the compiler version for Gradle:

// Java version selection
sourceCompatibility = 1.6
targetCompatibility = 1.6

This gets us to a point where we’ve got a fully functional project, and we can build a jar by running gradle build.

In practice, it’s good to declare the version string as a property, since there is a high likelihood of dependencies on more than one artifact with the same version.

ext.hadoopVersion = "2.0.0-mr1-cdh4.0.1"
dependencies {
    compile "org.apache.hadoop:hadoop-client:${hadoopVersion}"
    testCompile "junit:junit:4.8.2"
}

Now, whenever we want to upgrade our code to a new CDH version, we only need to change the version string in one place.

Note that the configuration here produces a jar that does not contain the project dependencies within it. This is fine, so long as we only require Hadoop dependencies, since the Hadoop daemons will include all the Hadoop libraries in their own classpaths. If the Hadoop dependencies are not sufficient, it will be necessary to package the other dependencies into the jar. We can configure Gradle to package a jar with dependencies included by adding the following block:

// Emulate Maven shade plugin with a fat jar.
// http://docs.codehaus.org/display/GRADLE/Cookbook#Cookbook-Creatingafatjar
jar {
    from configurations.compile.collect { it.isDirectory() ? it : zipTree(it) }
}

Unfortunately, the jar now contains all the Hadoop libraries, which would conflict with the Hadoop daemons’ classpaths. We can indicate to Gradle that certain dependencies need to be downloaded for compilation, but will be provided to the application at runtime by augmenting the Hadoop dependencies. The code then looks like this, with an added dependency on Guava:

// Provided configuration as suggested in GRADLE-784
configuration {
    provided
}
sourceSets {
    main {
        compileClasspath += configurations.provided
    }
}

ext.hadoopVersion = "2.0.0-mr1-cdh4.0.1"
dependencies {
    provided "org.apache.hadoop:hadoop-client:${hadoopVersion}"

    compile "com.google.guava:guava:11.0.2"

    testCompile "junit:junit:4.8.2"
}

Gradle also has integration with a number of IDEs, such as Eclipse and IntelliJ IDEA. The default integrations can be provided by adding

apply plugin: "eclipse"
apply plugin: "idea"

to add support for generating Eclipse .classpath and .project files and IntelliJ .iml files. The default build output locations may not be desirable, so we configure Eclipse as follows:

eclipse {
    // Ensure Eclipse build output appears in build directory
    classpath {
        defaultOutputDir = file("${buildDir}/eclipse-classes")
    }
}

For Eclipse, simply run gradle eclipse and then import the project into Eclipse. As you update/add dependencies, re-run gradle eclipse to update the .classpath file and refresh in Eclipse. Gradle automatically handles generating a classpath, including linking to source jars.

Recent versions of IntelliJ and the SpringSource Tool Suite also support direct import of Gradle projects. When using this integration, the apply plugin lines are not necessary.

Gradle represents a well-documented and powerful alternative to developing projects in Maven. While not without its quirks, I am significantly happier maintaining an enterprise build in Gradle at Hadapt, compared to the complex Maven build I maintained at VMware. Give it a try.

Jan 1, 2012 - 4 minute read - art computer science music programming selfimprovement tdd tools style

Let's improve our code

New Year’s is a good time to set intentions for the coming year. Many people come off the holidays with the intention to exercise more, but if you’re reading this blog, you’re probably a programmer (if you’re not, consider signing up for Code Year…), so let’s set an intention about our programming. But first, a musical interlude.

Earl Hines was a jazz pianist; in this 9 minute video, he describes how his early playing evolved.

As you watch it, notice how he not only describes and demonstrates how his style evolved, he also describes why. For example, he talks about how his melodic line was drowned out in the larger bands so he picks up playing in octaves (doubling up the notes).

In his TED talk, David Byrne generalizes the idea of environment influencing music by talking about how music has always evolved to fit the architecture in which it was performed: from how the ethereal sounds of early church music were driven by the open acoustics of churches to how the smaller rooms of the 18th and 19th centuries allowed for the more complex rhythms and patterns of classical music to be heard. (Watch it here.)

Can we as programmers reflect similarly about our programming styles? What influences the way our programs look? And more importantly, perhaps, why should we care?

For music, Byrne argues that the evolution of styles was driven by the needs of the audience and the acoustics of the performance hall. Understanding these consciously allows contemporary musicians to make more informed choices about what and how they perform.

As programmers, our programs must communicate: with the compiler, of course, so that it will render our code executable, but also with the human readers of our code, be that our future selves or our colleagues. So to write better programs—programs that communicate their intent more concisely and clearly, as opposed to those that execute more efficiently or that are more clever—we should consider what affects the structure and readability of the programs we write.

The frameworks and mechanisms available to us most obviously affect the structure of code. Write a program in a system based on callbacks, such as the async XML HTTP request that underlies AJAX, and you will find yourself with code that chains callbacks together, preserves state in various heap objects, and is requires that callbacks be called from the right contexts to work properly. Write code for a threaded system and your code will have all manner of locks and constructs to control memory write visibility. Regular expressions can be called from Perl with the overhead of only m// so it is easier to write text munging code in Perl than almost any other language.

Our methodologies, tools, and processes—how we program—also determine how our code looks. Test-driven development will tend to produce stronger and more usable abstractions. Stream of consciousness programming results in a mess. Using an editor that supports refactoring patterns will make it more likely that you will refactor. Code review or pair programming will similarly result in code improvements, simply because you had to communicate while writing the code. (Even just commenting your code helps in this regard.) The end result of these practices is code that is more understandable.

Our audience (that is, our teammates) also affects our code. This is the role of engineering culture. What will your teammates accept versus some ideal? To get code committed to the Linux kernel requires detailed commit messages, a well structured patch series and surviving code review on the kernel mailing list. To get code committed to your personal project requires nothing outside of what you ask of yourself.

We have control over these factors. We can vary our tools, our practices, our choice of frameworks, and influence our team culture. If we are framework or API developers, we can consciously evaluate what code we induce our users to write and improve on what we provide to simplify their lives, and facilitate their communication and self-expression.

This year, let’s set an intention to examine our code and improve how it reads. Let’s experiment and play with the factors under our control to see which choices work better for our teams. Ask your teammates whether one way or another works better for them. Spend some time analyzing your own code and consider how it got that way.

I’ll try to share some of what I learn from my team at Hadapt and I’m curious to hear what you learn from yours.

Dec 5, 2011 - 5 minute read - git mercurial

Git is more usable than Mercurial

Once upon a time, I used Mercurial for development. When I moved to VMware, people there seemed to favor Git and so I spent the past few years learning Git and helping to evangelize its use within VMware. I have written about why I chose Mercurial, as well as my initial reactions upon starting to use Git. Hadapt happens to be using Mercurial today and so I have been re-visiting Git and Mercurial.

What I wrote about Git and Mercurial in 2008 is still true: Git and Mercurial are similar in may respects—for example, you can represent the same commit graph structure in both—and they are both certainly better than Subversion and CVS. However, there are a lot of differences to appreciate in terms of user experience that I am now in a better position to evaluate.

In using Mercurial, I find myself oddly hobbled in my ability to do things. At first, I thought that this might simply be because some things are simply done differently in Mercurial but at this point, I actually think that Git’s design and attention to detail result in it actually being more usable than Mercurial.

There are three “philosophical” distinctions that are in Git’s favor:

  1. Git has one branching model. Mercurial has several that have evolved over time; Steve Losh has a comprehensive essay describing ways to branch in Mercurial. The effect of this is that different Mercurial users branch in different ways and the different styles don’t really mix well in one repo. Git users, once they learn how branching works, are unlikely to be confused by branches.

  2. Git has names (refs) that don’t change unexpectedly. Every Git commit you care about has a name that you can choose. Some Mercurial commits that you might care about do not have a name. For example, the default branch in Mercurial can have multiple heads, so it interprets -r default as the tip-most commit. Unfortunately, that commit will vary depending on who has committed what to which head (and when you see it).

Further, Git exposes relative naming by allowing you to refer to the branches in remote repositories by name, without affecting your own names.

Putting this together, consider what happens after you pull in Mercurial. Your last commit used to be called default but after the pull, default is something from the upstream. Your commit is a separate head that now has no name. In Git, your master doesn’t move after a fetch and the remote’s branch is called origin/master.

Git even tracks the changes what commit each name refers to in a reflog. You can easily refer to things that the name used to refer to. In Mercurial, branch names don’t have reliable meanings, and it doesn’t track them.

  1. Git commands operate in a local context by default. Mercurial commands often operate on a repository context. For example, git grep operates on the current sub-directory of your work tree, hg grep operates on your history. The Git analog of hg grep is using the log pick-axe; the Mercurial analog of git grep is to use ack, or if you must, something like hg grep -r reverse(::.) pattern . (Seriously?)

Another example is the log command. Git’s log command shows you the history of the commit you are on right now. Mercurial’s log command shows you something about the whole repository unless you restrict with some combination of -b and -f. Combined with Mercurial’s way of resolving branch names to commits, it becomes very difficult to use hg log to compare two heads or explore what has changed in another head of the same branch.

More often than not, I care about things in their current tree more than how things are in some random other branch that I am not working on and Mercurial makes it hard to do that.

There are other usability issues that I’ve found that are more detail-oriented than philosophical. I’ll note a few here.

hg log doesn’t display the full text of the commit message unless you hg log --debug. This is an unfortunate disincentive to writing good commit messages.

hg log -p doesn’t pay as much attention to merge commits as Git does; the help for hg log reads: > log -p/–patch may generate unexpected diff output for merge > changesets, as it will only compare the merge changeset against its > first parent. Also, only files different from BOTH parents will appear > in files:.

whereas git log has a variety of options to control how the merge diff is displayed, including showing diffs to both parents, removing “uninteresting” changes that did not conflict, or showing the full merge against either just the first or all parents of the merge commit.

Both Mercurial and Git have lots of configurable options; Git has a thin veneer over editing a config file in the form of the git config sub-command. Mercurial involves editing a file even if just setting up your initial username or enabling extensions. I often wound up editing Git config files directly, but having the commands were nice for sharing instructions with others.

Git support for working with patches natively is better. Mercurial supports e-mailing and applying patches, but oddly the extension for sending out patches is built in (patchbomb) but the extension for importing from an mbox (mbox) is not. There’s no direct analog of git apply; instead you have to use a patch queue. Patch queues are okay, but branches and well-integrated rebase/e-mail/apply support are much nicer than patch queues: you don’t need to manual find some .hg/patches/series file and edit it to re-order stuff.

I could write more and indeed many people have written about Git and Mercurial—you can explore my bookmarks about git for some of the better ones. Let me close here with three interesting features in Mercurial 2.0:

  • the new largefiles extension allows users to not transfer large files down until they are needed;
  • subrepos can be Git or Subversion in addition to Mercurial;
  • revsets allow you to search your history in very flexible ways.

Overall, I feel that Git is significantly more usable for day-to-day development than Mercurial. I’d be curious to hear if you think the opposite is true.

Nov 6, 2011 - 2 minute read - Personal career hadapt transition vmware

A new adventure

Friday, 4 November, was my last day at VMware.

I started at VMware in 2008, working on a project that has now become VMware’s Horizon Mobile. Last year, I switched to working on the latest release of VMware’s vCloud Director.

VMware has a lot going for it as a place to work. For example, it has:

It wasn’t an easy choice to leave.

Last month, I became aware that a startup in the big data space was moving to Boston. I’d been wondering about life outside VMware and this opportunity seemed just about perfect. So I’m beginning a new adventure at Hadapt. As an early employee, I imagine I’ll be doing a little bit of everything. I hope to combine the skills and knowledge I built up from my graduate work and the practical experience of delivering enterprise software at VMware to help Hadapt build a powerful, scalable, data analytics platform and make Hadapt a successful company.

I’m excited to get started and I hope to share here with you some of my experiences as I go.

Oct 20, 2011 - 2 minute read - Hacking e-mail git gradle make reproducibility workflow

Rules for Development Happiness

Inspired by Alex Payne’s Rules for Computing Happiness, some rules for having happy developers and being happy as a developer.

  • Use version control. (See The Joel Test.) In particular, use a distributed version control system (like Mercurial or Git). This ensures you can commit offline and also conduct code archaeology offline.
  • Have a correct and fast incremental build (e.g., non-recursive Make or Gradle) to avoid this.
  • Have a system for testing your changes in a safe environment prior to code submission.
  • Avoid dependencies on system tools. Different developers tend to have different systems and hence different versions of tools.
  • Be able to work offline. Offline may mean when you’re on a plane, but it may also happen when the office network goes down. Both happen. Notably, the latter happens even when you work on a desktop with a wired connection. (It’s been pointed out to me that the network going down can be a good team-building experience.)
    • Be able to build offline. That means having all build dependencies cached locally.
    • Have all e-mail cached locally. Don’t be unable to find those key instructions some mailed you just because GMail is restoring your mail from tape. Helpful tools here are isync or offlineimap. Index your mail with mu. (Or configure Thunderbird/Apple Mail/etc to keep everything offline.)
    • Be able to send mail offline; e.g., have it queue locally for deliver when the network comes back. But, make sure you keep a copy locally in case the hotel’s WiFi is transparently re-directing out-bound SMTP connections to /dev/null. (This really happened to me.)
    • Have other documentation cached locally. (Use something like gollum for your wiki.)
    • If you work somewhere with a shared-storage home directory, make sure you can login when the network is down!