Systems Researchers: Mike Freedman

Mike Freedman and I have known each other since we were Masters students at MIT, working on things like the Tarzan anonymizing network (a parallel, pre-cursor to Tor). He went on to build the hugely successful (“as seen on Slashdot”) Coral content distribution network, which figured largely in his dissertation. It’s a great treat to have him talk here about how Coral was built and deployed. Be sure also to check out his research group blog for more interesting thoughts from him and his students!

What did you build?
CoralCDN is a semi-open content distribution network. Our stated goal with CoralCDN was to “democratize content publication”: namely, allow websites to scale by demand, without requiring special provisioning or commercial CDNs to provide the “heavy lifting” of serving their bits. Publishing with CoralCDN is as easy as slightly modifying a URL to include .nyud.net in its hostname (e.g., http://www.cnn.com.nyud.net/), and the content is subsequently requested from and served by CoralCDN’s network of caching web proxies.

Our initial goal for deploying CoralCDN was a network of volunteer sites that would cooperate to provide such “automated mirroring” functionality, much like sites do somewhat manually with open-source software distribution. As we progressed, I also imagined that small groups of users could also cooperate in a form of time-sharing for network bandwidth: they each would provide some relatively constant amount of upload capacity, with the goal of being able to then handle any sudden spikes (from the Slashdot effect, for example) to any participant. This model fits well with how 95th-percentile billing works for hosting and network providers, as it then becomes very important to flatten out bandwidth spikes. We started a deployment of CoralCDN on PlanetLab, although it never really migrated off that network. (We did have several hundred users, and even some major Internet exchange points, initially contact us to run CoralCDN nodes, but we didn’t go down that path, both for manageability and security reasons.)

CoralCDN consists of three main components, all written from scratch: a special-purpose web proxy, nameserver, and distributed hash table (DHT) indexing node. CoralCDN’s proxy and nameserver are what they sound like, although they have some differences given that they are specially designed for our setting. The proxy has a number of design choices and optimizations well-suited for interacting with websites that are on their last legs—CoralCDN is designed for dealing with “Slashdotted” websites, after all—as well as being part of a big cooperative caching network. The nameserver, on the other hand, is designed to dynamically synthesize DNS names (of the form .nyud.net), provide some locality and load balancing properties when selecting proxies (address records it returns), and ensure that the returned proxies are actually alive (as the proxy network itself is comprised of unreliable servers). The indexing node forms a DHT-based structured routing and lookup structure that exposes a put/get interface for finding other proxies caching a particular web object. Coral’s indexing layer differs from traditional DHTs (such as MIT’s Chord/DHash) in that it creates a hierarchy of locality-based clusters, each which maintains a separate DHT routing structure and put/get table, and it provides weaker consistency properties within each DHT structure. These latter guarantees are possibly because Coral only needs to find some proxies (preferably nearby ones) caching a particular piece of content, not all such proxies.

Tell us about what you built it with.
CoralCDN is built in C++ using David Maziereslibasync and libarpc libraries, originally built for the Self-certifying File System (SFS). This came out of my own academic roots in MIT’s PDOS group, where SFS was developed by David and its libraries are widely used. (David was my PhD advisor at NYU/Stanford, and I got my MEng degree in PDOS.) Some of the HTTP parsing libraries used by CoralCDN’s web proxy were from OKWS, Max Krohn’s webserver also written using SFS libraries. Max was research staff with David at NYU during part of the time I was there. It’s always great to use libraries written by people you know and can bother when you find a bug (although for those two, that was a rare occurrence indeed!).

When I started building CoralCDN in late 2002, I initially attempted to build its hierarchical indexing layer on top of the MIT Chord/DHash implementation, which also used SFS libraries. This turned out to be a mistake (dare I say nightmare?), as there was a layering mismatch between the two systems: I wanted to build distinct, localized DHT clusters in a certain way, while Chord/DHash sought to build a single, robust, global system. It was thus rather promiscuous in maintaining group membership, and I was really fighting the way it was designed. Plus, MIT Chord was still research-quality code at the time, so bugs naturally existed, and it was really difficult to debug the resulting system with large portions of complex, distributed systems code that I hadn’t written myself. Finally, we initially thought that the “web proxy” part of the system would be really simple, so our original proxy implementation was just in python. CoralCDN’s first implementation was scrapped after about 6 months of work, and I restarted by writing my own DHT layer and proxy (in C++ now) from scratch. It turns out that the web proxy has actually become the largest code base of the three, continually expanded during the system’s deployment to add security, bandwidth shaping and fair-sharing, and various other robustness mechanisms.

Anyway, back to development libraries. I think the SFS libraries provide a powerful development library that makes it easy to build flexible, robust, fast distributed services…provided that one spends time overcoming their higher learning curve. Once you learn them, however, they make it really easy to program in an event-based style, and the RPC libraries prevent many of the silly bugs normally associated with writing your own networking protocols. I think Max’s tame libraries significantly improve the readability and (hopefully) lessen the learning curve of doing such event-based programming, as tame removes the “stack-ripping” that one normally sees associated with events. Perhaps I’ll use tame in future projects, but as I’ve already climbed the learning curve of libasync myself, I haven’t yet.

That said, one of my PhD students at Princeton, Jeff Terrace, is building a high-throughput, strongly-consistent object-based (key/value) storage system called CRAQ using tame. He’s seemed to really like it.

How did you test your system for correctness?
I deploy it? In seriousness, it’s very difficult to test web proxies, especially ones deployed in chaotic environments and interacting with poorly-behaving clients and servers.

I did most of my testing during initial closed experiments on about 150-300 PlanetLab servers, which is a distributed testbed deployed at a few hundred universities and other institutions that each operate two or more servers. Testing that the DHT “basically” worked was relatively easy: see if you actually get() what you put(). There are a lot of corner cases here, however, especially when one encounters weird network conditions, some of which only became apparent after we moved Coral from the network-friendly North American and European nodes to those PlanetLab servers in China, India, and Australia. Always be suspicious with systems papers that describe the authors’ “wide deployment” on “selected” (i.e., cherry-picked) U.S. PlanetLab servers.

Much of the testing was just writing the appropriate level of debug information so we could trace requests through the system. I got really tired of staring at routing table dumps at that time. Last year I worked with Rodrigo Fonseca to integrate X-Trace into CoralCDN, which would have made it significantly easier to trace transactions through the DHT and the proxy network. I’m pretty excited about such tools for debugging and monitoring distributed systems in a fine-grained fashion.

Testing all the corner cases for the proxy turned out to be another level of frustration. There’s really no good way to completely debug these systems without rolling them out into production deployments, because there’s no good suite of possible test cases: The potential “space” of inputs is effectively unlimited. You constantly run into clients and servers which completely break the HTTP spec, and you just need to write your server to deal with these appropriately. Writing a proxy thus because a little bit of learning to “guess” what developers mean. I think this actually has become worse with time. Your basic browser (FireFox, IE) or standard webserver (Apache) is going to be quite spec-compliant. The problem is that you now have random developers writing client software (like podcasters, RSS readers, etc.) or generating Ajax-y XmlHttpRequest’s. Or casual developers dynamically generating HTTP on the server-side via some scripting language like PHP. Because who needs to generate vaguely spec-compliant HTTP if you are writing both the client and server? (Hint: there might be a middlebox on path.) And as it continues to become even easier to write Web services, you’ll probably continue to see lots of messy inputs and outputs from both sides.

So while I originally tested CoralCDN using its own controlled PlanetLab experiments, after the system went live, I started testing new versions by just pushing them out to one or a few nodes in the live deployment. Then I just monitor these new versions carefully and, if things seemed to work, slowly push them out across the entire network. Coral nodes include a shared secret in their packet headers, which excludes random people from joining our deployment. I also use these shared secrets to deploy new (non-backwards-compatible) versions of the software, as the new version (with a new secret nonce) won’t link up with DHT nodes belonging to previous versions.

How did you deploy your system? How big of a deployment?
CoralCDN has been running 24/7 on 200-400 PlanetLab servers since March 2004. I manage the network using AppManager, built by Ryan Huebsch from Berkeley, which provides a SQL server that keeps a record of current node run state, requested run state, install state, etc. So AppManager gives me a Web interface to control the desired runstate of nodes, then all nodes “call home” to the AppManager server to determine updated runstate. You write a bunch of shell scripts to actually use these run states to start or stop nodes, manage logs, etc. This “bunch of shell scripts” eventually grew to be about 3000 lines of bash, which was somewhat unexpected. While AppManager is a single server (although nodes are configured with a backup host for failover), CoralCDN’s scripts are designed for nodes to “fail same”. That is, requested runstate is stored durably on each node, so if the management server is offline or returns erroneous data (which it has in the past), the nodes will maintain their last requested runstate until the management server comes back online and provides a valid status update.

How did you evaluate your system?
We performed all the experiments one might expect in an academic evaluation on an initial test deployment on PlanetLab. Our NSDI ‘04 paper discusses these experiments.

After that stage, CoralCDN just runs — people continue to use it, so it provides some useful functionality. My interest transitioned from providing great service to just keeping it running (while I moved onto other research).

I probably spend about 10 minutes a week “keeping CoralCDN running”, which is typically spent answering abuse complaints, rather than actually managing the system. This is largely because the system’s algorithms were designed to be completely self-organizing — as we initially thought of CoralCDN as a peer-to-peer system — as opposed to a centrally-managed system designed for PlanetLab. System membership, fault detection and recovery, etc., is all completely automated.

Unfortunately, dynamic membership and failover doesn’t extend to the primary nameservers we have registered for .nyud.net with the .net gTLD servers. These 10-12 nameservers also run on PlanetLab servers, so if one of these servers go offline, our users experience bad DNS timeouts until I manually remove that server from the list registered with Network Solutions. (PlanetLab doesn’t provide any IP-layer virtualization that would allow us to failover to alternate physical servers without modifying the registered IP addresses.) And I have to admit I’m pretty lazy about updating the DNS registry, especially given the rather painful web UI that Network Solution provides. (In fairness, the UIs for GoDaddy and other registrars I’ve used are similarly painful). I think registrars should really provide a programmatic API for updating entries, but haven’t found one for low-cost registrars yet. Anyway, offline nameservers are probably the biggest performance problem with CoralCDN, and probably the main reason it seems slow at times. This is partly a choice I made, however, in not becoming a vigilant system operator for which managing CoralCDN becomes a full-time job.

There’s a lesson to be had here, I think, for academic systems that somehow “escape the lab” but don’t become commercial services: either promote lessened expectations for your users (and accept that reality yourself), build up a full-time developer/operations staff (a funding quandary), or expect the project to soon die-off after its initial developers lose interest or incentives.

Anything you’d like to add?
My research group actually just launched a blog. In the next few weeks, I’ll be writing a series about some of the lessons I’ve learned from building and deploying CoralCDN. I want to invite all your readers to look out for those posts and really welcome any comments or discussion around them.

Systems Researchers: Justin Cappos

Justin Cappos received his PhD from the University of Arizona under the supervision of John Hartman. I met Justin several years ago at a PlanetLab Consortium meeting when he was starting to work on Stork, a system to simplify package deployment. He is currently a Post Doc at the University of Washington working with Tom Anderson and Arvind Krishnamurthy.

What did you build?
The most relevant / longest term projects are: Stork, San Fermin, and Seattle.

Stork is a package manager that has security and functionality improvements over existing Linux package managers. Some of the advances we made in Stork have been adapted by APT, YUM, YaST and other popular Linux package managers.

San Fermin is a system for aggregating large quantities of data from computers. San Fermin provides the result faster and with better fault tolerance than existing techniques.

Seattle is an educational testbed built from resources donated by universities all around the world. The universities run a safe, lightweight VM that students from other universities can run code in.

Tell us about what you built it with.
I used Java for San Fermin because I needed to leverage existing Pastry code that was in Java. I used Python for Stork and Seattle. I found Python to be far superior for large projects (other languages I’ve used are Java, C, C++, QBASIC, and Pascal). Python has been a dream come true because it’s great for prototyping, easy to learn, and the resulting code is readable (so long as you have sensible style constraints on the written code).

Perhaps the most useful thing is getting other developers involved. I like to do the initial prototyping myself, but after that it is great to have others helping out.

How did you test your system for correctness?
There are whitebox / unit tests as well as blackbox / integration tests for most parts of the systems. The time that we spent building thorough test cases really paid off because it simplifies debugging.

I like to use my systems in real world environments with outside users, so the system is never done thus correctness is an iterative process. If I’m aware of bugs, we fix them. If I’m not aware of bugs, we’re adding features based upon user requests (and therefore may be adding more bugs for us to fix later). In general, the fact that we have users that rely on the software over long time periods is a testament to its stability which is related to correctness.

To more specifically answer the question you are really asking, I usually run my code by hand and evaluate it in small / constrained environments (turning these test runs into unit tests). I also follow a philosophy where I try to “detect and fail” as much as possible. I care more about correctness than performance (at least initially) and so add many redundant checks in my code to catch errors as soon as possible.

I find that if I’m careful and thorough when writing my code, I spend very little time debugging. I probably spend about 30% of the time writing code, 40% writing comments / docs (which I normally write before / during coding), 20% of the time writing test code for individual modules, and about 10% debugging after the fact. I think part of this is I’m really careful about checking input and boundary conditions and so I can normally pin-point the exact cause of a failure.

In terms of problems when writing code, I generally only use standard libraries and code I’ve written. I don’t depend on third party code because I don’t know what level of support it will have. Also, since I usually code in Python, it’s easy for me to add functionality to do whatever I need.

How did you deploy your system?  How big of a deployment?
Stork has been deployed on PlanetLab for about 6 years. For the majority of the time we’ve been deployed on every working PlanetLab node. Stork has managed > 500K VMs and when I last checked was used daily by users on around two dozen sites. We initially used AppManager to deploy Stork, but since have been using PlanetLab’s initscript functionality.

San Fermin was deployed on PlanetLab for use in combining and managing logs from Stork. However, we found that San Fermin was dysfunctional due to difficulties in getting Pastry to start reliably and work when non-transitive connectivity occurs. As a result, we mainly ended up using San Fermin as a publication vehicle.

Seattle is currently deployed on more than 1100 computers around the world. We have done our initial deployment by encouraging educators to use our platform in networking and distributed systems classes. Our longer term plans involve using Seattle to build a research testbed of 1 million nodes.

How did you evaluate your system?
With San Fermin (and other systems research I haven’t mentioned), there is fairly clear related work to compare against. In some cases, the biggest challenge has been getting the existing research prototype code from another project to run well enough for comparison.

For the work that I’ve done where I’ve focused on impact over publication impact (Seattle / Stork), evaluation is much more difficult because these aren’t incremental improvements over existing models and systems. These systems break the mold in terms of security and / or functionality so sometimes it’s difficult to know how to compare them to existing work.

Systems Researchers Interview Series

After my last post on implementing Chord, I thought it might be insightful and educational to examine how others are building research distributed systems. I have asked a number of colleagues who have built successful research systems to answer a few questions about their implementations. Here, successful roughly means that the system has seen some broad use or adoption, or (and?) has been presented at a notable peer-reviewed conference. The questions to be answered:

  • What did you build?
  • Tell us about what you built it with. For example, what languages or 3rd party tools/libraries did you use? Why? What was your experience with them?
  • How did you test your system for correctness?
  • How did you deploy your system? How big of a deployment?
  • How did you evaluate your system?
  • Anything you’d like to add?

A few folks have agreed to participate; as I post their responses, I will update this post with their names and links to the interviews. If you are interested in participating, just send me an e-mail!

Participants (last updated 25 Mar 2009):

On implementing Chord

The Chord protocol dynamically constructs robust and scalable overlay networks that map a given key to an active node. The MIT PDOS Chord implementation has served as a reference implementation of Chord, and over the years has accumulated many tweaks and improvements. While the theoretical highlights have largely been documented in our publications, the implementation details may have been omitted. Given the relative multitude of re-implementations I know of, it seemed worthwhile to try and identify some of these lessons instead of leaving them to bit-rot inside our implementation. A basic understanding of Chord is assumed. It is also important to realize that MIT Chord was largely run over the wide-area, not in a data-center.

MIT Chord is implemented in C++ using libasync. We did not like Java because (at least a few years ago) PlanetLab did not make it easy to push a JRE to all your nodes and would also kill processes with large memory footprints (which seemed to often be Java VMs). So, using C++ made deployment to PlanetLab easier and the code more efficient. Today, I might use the STL and select bits from Boost, which would make the implementation accessible to a broader audience. I’d probably also use threads.

Because of its libasync roots, MIT Chord uses XDR and SunRPC for its inter-node communcation. It would be better today to use something like Protocol Buffers or Thrift; the front-end should definitely support XML-RPC for maximum ease-of-application development. Other than NFS, basically no one uses SunRPC.

Chord uses its own transport called the Striped Transport Protocol (STP). Read the section, think about the problem, decide if STP is right for you. For actual data transfers, depending on your object size, it may be simpler to go with TCP. UDP without flow-control may be just fine for the routing RPCs, though managing retransmission timers and mapping it to node liveness will be lots of fun. Also read about Accordion’s approach to this issue.

Much of a DHT implementation revolves around keeping track of other nodes (e.g., your neighbors). The Chord paper describes the abstract successor list and finger table; our implementation keeps a bucket of nodes (the locationtable) that calculates successors and fingers on the fly from this table. The Chord stabilization protocol runs periodically to keep this bucket filled with relevant live nodes. This approach also let us easily implement Accordion in our codebase. A single reference-counted object is used to proxy remote nodes: this keeps all state about a remote node consistent, even if it is part of multiple in-flight operations.

Nodes often fail and MIT Chord works hard to stop using failed nodes, despite some Chord operations that pass node lists around, by tagging the node wire representation with the time since the node was last contacted directly. If node A has discovered for itself that node B has failed, it will ignore any mention of B from its neighbors that have not directly heard from B more recently. (This doesn’t quite work right in the case of asymmetrically reachable nodes, but it is unclear what to do with such nodes anyway.)

Definitely include tools to view a live node’s state. If you’re feeling fancy, your nodes will speak HTTP on a special port and produce pretty graphics (MIT Chord doesn’t). This will significantly improve your ability to debug weird routing issues (remotely).

These thoughts are a bit haphazard but hopefully they contain some tidbits of interest. I’ve actually had relatively little interaction with other implementors of Chord so if you’re one of them, I’m curious how your experiences compare.

Mendel Rosenblum on Virtualization in Modern Computing Environments

Today, VMware co-founder Mendel Rosenblum gave a Deuterzos Lecture at MIT titled The Impact of Virtualization on Modern Computing Environments. His talk outlined the general function of virtualization as an interposing layer between the hardware and the operating system, and the evolution of functionality in this layer over the past decade—in short, the talk was largely a super high-level summary of VMware’s greatest hits, from server integration to VMotion to high-availability via record replay to virtual appliances. Overall, I was not particularly impressed, though I thought it was well delivered and probably educational for a non-VMware-employee audience.

The most interesting observation Mendel made was in regarding the space between packaged software and software-as-a-service. In the former, the user is responsible for everything outside of software development: from hosting, to configuration, to maintenance. In the latter, everything other than the actual use is handled by the developers. The in-between space of deployment is enabled by virtualization: some of the maintenance/configuration can be handled by virtual appliances, and with the advent of vClouds and virtual data centers, you can transparently host things either internally or externally.

Anant Agarwal closed the QA section by asking Mendel to take his years of successful academic and industry work as a single sentence of advice. To paraphrase, Mendel said to jump on opportunities that present themselves, for even if you land badly, most people find that the experience was worthwhile.

Mobile virtualization demonstration at VMworld

At today’s VMworld keynote, CTO Steve Herrod included a brief demonstration of the project that I have been working on at VMware since last May: the Mobile Virtualization Platform (MVP). One of the downsides of working in industry as opposed to academia is that you have to wait for big release dates such as this before being able to talk about your work, so I’m excited to have the opportunity to blog a little bit about MVP today.

Steve chatted with Jerry Chen, who showed a Nokia N800 running the MVP bare-metal hypervisor which was hosting a Windows CE guest and an Android guest. The relevant part of the keynote, including the N800 loading and launching the Android VM can be found on Vimeo. The demo was also shown in a booth: Senior R&D Director Julia Austin walked through the demo for ITPro who has posted a 3 minute video of the full demo, which I’ve embedded at the end of this post.

Running Android and WinCE on an N800 is pretty cool in and of itself, but our demo also highlights one of the key use cases for mobile virtualization, namely the separation of work and personal settings. With your “work phone” as a virtual machine, it will be possible to easily backup and transfer your work settings from one phone to another, while keeping these settings cleanly partitioned from your personal settings. Similarly, when you upgrade a device to the next generation, you will be able to port your phone over with relatively little hassle.

By providing a stable virtual platform, MVP will also hopefully simplify application development by allowing developers to target MVP instead of a myriad different devices. Defining the virtual platform and mapping it into different physical platforms is an interesting challenge.

Some of the other challenges and benefits of virtualization in the mobile space are discussed in an eWeek article by one of the founders of VirtualLogix, a competitor in this space. At VMware, we are working hard on addressing these challenges with the aim of providing benefits to all stakeholders, from phone OEMs to enterprises to end-users. Hopefully, you’ll be able to see the results of this work in the palm of your hand in the near future!

The afterlife of systems research code

When a graduate student completes their PhD, the software they wrote for that degree begins an almost inexorable decline into obscurity. Other things become more important and, unless that code serves as a platform for further research or has spun off into a startup, there is precious little time to maintain and further develop code written for the degree. The code has already served a purpose—namely, advancing the state of the art in the form of research papers and a dissertation.

For example, one student I know wanted to test a new Byzantine Fault Tolerant system against an earlier one and was told that the old one only worked on one specific machine in the corner of lab (running some ancient release of RedHat). Another student in my group was trying to compare file system synchronization tools and found that the published source for some tools was hopelessly incompatible with modern compilers and libraries; he wound up finding some old RedHat isos and testing in a virtual machine.

I’m thinking about this today in the context of some recent activity on the Chord mailing list about code I wrote for my PhD. The entirety of my time as a graduate student was spent hacking on the PDOS Chord/DHash implementation, which has served as the reference implementation of Chord in numerous papers. While I hesitate to declare this implementation dead, it has become clear from numerous mailing list posting that it will be increasingly hard for others to use the implementation if it is not more actively maintained.

Chord and DHash are implemented using the libasync C++ library, an obscure and mostly undocumented creation of Prof. David Mazieres, who wrote it as part of his thesis on a Self-Certifying File System (SFS) ten years ago, back when C++ compilers sucked and the STL didn’t work very portably. libasync offers a lot of nice features such as managing TCP buffering for nonblocking I/O and a comprehensive framework for writing user-level NFS servers; this came in handy for early Chord applications such as the Cooperative File System. As a result, the libasync toolkit was adopted by many PDOS research projects.

Unfortunately, as Prof. Mazieres moved on to other research projects, SFS releases and the libasync codebase languished. The libasync library was fortunately adopted by Max Krohn in the form of sfslite. I switched Chord to sfslite, so that it would be possible to point people at a real release instead of asking them to checkout some recent version of SFS from its (now defunct?) CVS server.

As part of sfslite, Max went on to develop tame—a language designed to simplify writing event driven programs with callbacks using libasync, and as part of my thesis, I started to write code in tame to help test it. When sfslite 1.x was released, tame had evolved from its v1 to v2 which changed the syntax. Since I never had the time to take the potential stability hit involved in upgrading, Chord still relies on sfslite 0.8, which Max no longer has time to maintain.

Which brings us back to recent activity on the chord mailing list: as people try to build Chord with ever newer compilers and Linux versions, they run into problems finding a version of sfslite and Chord that build successfully. I still don’t have the time to update the Chord tree to use newer versions of sfslite (though it might not be too hard), nor do I have access to all the resources I used to test it in deployment. So, people are left to patch their own source trees and cobble together working bits.

I wish I had better news for people interested in Chord—there are a lot of useful lessons embedded in our implementation, and many reimplementations that might benefit from those lessons. Perhaps one day soon I will try to write up some of those implementation lessons. Or maybe someone would be willing to take over maintaining Chord. If you volunteer, I’d be happy to help you get started!

Capacity planning for cell phone networks

The New York Times has an article today about how the inauguration crowd will test cellphone networks. They wrote:

Sprint Nextel, which said it had been planning for the inauguration since April, has also increased capacity of its cell sites and terrestrial transmission lines to prepare the network to sustain 10 to 15 times the number of users it would serve on its networks during a normal day.

Web services (such as Twitter, also cited in the article) also have to deal with scalability and planning to provide adequate (network) capacity. It’s interesting to see how difficult (and expensive!) it can be to deal with flash crowds in other systems.