Thoughts on Systems

Emil Sit

Feb 15, 2009 - 4 minute read - Research chord dhash dht history sfs sfslite

The afterlife of systems research code

When a graduate student completes their PhD, the software they wrote for that degree begins an almost inexorable decline into obscurity. Other things become more important and, unless that code serves as a platform for further research or has spun off into a startup, there is precious little time to maintain and further develop code written for the degree. The code has already served a purpose—namely, advancing the state of the art in the form of research papers and a dissertation.

For example, one student I know wanted to test a new Byzantine Fault Tolerant system against an earlier one and was told that the old one only worked on one specific machine in the corner of lab (running some ancient release of RedHat). Another student in my group was trying to compare file system synchronization tools and found that the published source for some tools was hopelessly incompatible with modern compilers and libraries; he wound up finding some old RedHat isos and testing in a virtual machine.

I’m thinking about this today in the context of some recent activity on the Chord mailing list about code I wrote for my PhD. The entirety of my time as a graduate student was spent hacking on the PDOS Chord/DHash implementation, which has served as the reference implementation of Chord in numerous papers. While I hesitate to declare this implementation dead, it has become clear from numerous mailing list posting that it will be increasingly hard for others to use the implementation if it is not more actively maintained.

Chord and DHash are implemented using the libasync C++ library, an obscure and mostly undocumented creation of Prof. David Mazieres, who wrote it as part of his thesis on a Self-Certifying File System (SFS) ten years ago, back when C++ compilers sucked and the STL didn’t work very portably. libasync offers a lot of nice features such as managing TCP buffering for nonblocking I/O and a comprehensive framework for writing user-level NFS servers; this came in handy for early Chord applications such as the Cooperative File System. As a result, the libasync toolkit was adopted by many PDOS research projects.

Unfortunately, as Prof. Mazieres moved on to other research projects, SFS releases and the libasync codebase languished. The libasync library was fortunately adopted by Max Krohn in the form of sfslite. I switched Chord to sfslite, so that it would be possible to point people at a real release instead of asking them to checkout some recent version of SFS from its (now defunct?) CVS server.

As part of sfslite, Max went on to develop tame—a language designed to simplify writing event driven programs with callbacks using libasync, and as part of my thesis, I started to write code in tame to help test it. When sfslite 1.x was released, tame had evolved from its v1 to v2 which changed the syntax. Since I never had the time to take the potential stability hit involved in upgrading, Chord still relies on sfslite 0.8, which Max no longer has time to maintain.

Which brings us back to recent activity on the chord mailing list: as people try to build Chord with ever newer compilers and Linux versions, they run into problems finding a version of sfslite and Chord that build successfully. I still don’t have the time to update the Chord tree to use newer versions of sfslite (though it might not be too hard), nor do I have access to all the resources I used to test it in deployment. So, people are left to patch their own source trees and cobble together working bits.

I wish I had better news for people interested in Chord—there are a lot of useful lessons embedded in our implementation, and many reimplementations that might benefit from those lessons. Perhaps one day soon I will try to write up some of those implementation lessons. Or maybe someone would be willing to take over maintaining Chord. If you volunteer, I’d be happy to help you get started!