Thoughts on Systems

Emil Sit

Mar 25, 2009 - 11 minute read - Research academia cdn coral implementation interviews planetlab Research tarzan

Systems Researchers: Mike Freedman

Mike Freedman and I have known each other since we were Masters students at MIT, working on things like the Tarzan anonymizing network (a parallel, pre-cursor to Tor). He went on to build the hugely successful (“as seen on Slashdot”) Coral content distribution network, which figured largely in his dissertation. It’s a great treat to have him talk here about how Coral was built and deployed. Be sure also to check out his research group blog for more interesting thoughts from him and his students!

What did you build?
CoralCDN is a semi-open content distribution network. Our stated goal with CoralCDN was to “democratize content publication”: namely, allow websites to scale by demand, without requiring special provisioning or commercial CDNs to provide the “heavy lifting” of serving their bits. Publishing with CoralCDN is as easy as slightly modifying a URL to include .nyud.net in its hostname (e.g., http://www.cnn.com.nyud.net/), and the content is subsequently requested from and served by CoralCDN’s network of caching web proxies.

Our initial goal for deploying CoralCDN was a network of volunteer sites that would cooperate to provide such “automated mirroring” functionality, much like sites do somewhat manually with open-source software distribution. As we progressed, I also imagined that small groups of users could also cooperate in a form of time-sharing for network bandwidth: they each would provide some relatively constant amount of upload capacity, with the goal of being able to then handle any sudden spikes (from the Slashdot effect, for example) to any participant. This model fits well with how 95th-percentile billing works for hosting and network providers, as it then becomes very important to flatten out bandwidth spikes. We started a deployment of CoralCDN on PlanetLab, although it never really migrated off that network. (We did have several hundred users, and even some major Internet exchange points, initially contact us to run CoralCDN nodes, but we didn’t go down that path, both for manageability and security reasons.)

CoralCDN consists of three main components, all written from scratch: a special-purpose web proxy, nameserver, and distributed hash table (DHT) indexing node. CoralCDN’s proxy and nameserver are what they sound like, although they have some differences given that they are specially designed for our setting. The proxy has a number of design choices and optimizations well-suited for interacting with websites that are on their last legs—CoralCDN is designed for dealing with “Slashdotted” websites, after all—as well as being part of a big cooperative caching network. The nameserver, on the other hand, is designed to dynamically synthesize DNS names (of the form .nyud.net), provide some locality and load balancing properties when selecting proxies (address records it returns), and ensure that the returned proxies are actually alive (as the proxy network itself is comprised of unreliable servers). The indexing node forms a DHT-based structured routing and lookup structure that exposes a put/get interface for finding other proxies caching a particular web object. Coral’s indexing layer differs from traditional DHTs (such as MIT’s Chord/DHash) in that it creates a hierarchy of locality-based clusters, each which maintains a separate DHT routing structure and put/get table, and it provides weaker consistency properties within each DHT structure. These latter guarantees are possibly because Coral only needs to find some proxies (preferably nearby ones) caching a particular piece of content, not all such proxies.

Tell us about what you built it with.
CoralCDN is built in C++ using David Maziereslibasync and libarpc libraries, originally built for the Self-certifying File System (SFS). This came out of my own academic roots in MIT’s PDOS group, where SFS was developed by David and its libraries are widely used. (David was my PhD advisor at NYU/Stanford, and I got my MEng degree in PDOS.) Some of the HTTP parsing libraries used by CoralCDN’s web proxy were from OKWS, Max Krohn’s webserver also written using SFS libraries. Max was research staff with David at NYU during part of the time I was there. It’s always great to use libraries written by people you know and can bother when you find a bug (although for those two, that was a rare occurrence indeed!).

When I started building CoralCDN in late 2002, I initially attempted to build its hierarchical indexing layer on top of the MIT Chord/DHash implementation, which also used SFS libraries. This turned out to be a mistake (dare I say nightmare?), as there was a layering mismatch between the two systems: I wanted to build distinct, localized DHT clusters in a certain way, while Chord/DHash sought to build a single, robust, global system. It was thus rather promiscuous in maintaining group membership, and I was really fighting the way it was designed. Plus, MIT Chord was still research-quality code at the time, so bugs naturally existed, and it was really difficult to debug the resulting system with large portions of complex, distributed systems code that I hadn’t written myself. Finally, we initially thought that the “web proxy” part of the system would be really simple, so our original proxy implementation was just in python. CoralCDN’s first implementation was scrapped after about 6 months of work, and I restarted by writing my own DHT layer and proxy (in C++ now) from scratch. It turns out that the web proxy has actually become the largest code base of the three, continually expanded during the system’s deployment to add security, bandwidth shaping and fair-sharing, and various other robustness mechanisms.

Anyway, back to development libraries. I think the SFS libraries provide a powerful development library that makes it easy to build flexible, robust, fast distributed services…provided that one spends time overcoming their higher learning curve. Once you learn them, however, they make it really easy to program in an event-based style, and the RPC libraries prevent many of the silly bugs normally associated with writing your own networking protocols. I think Max’s tame libraries significantly improve the readability and (hopefully) lessen the learning curve of doing such event-based programming, as tame removes the “stack-ripping” that one normally sees associated with events. Perhaps I’ll use tame in future projects, but as I’ve already climbed the learning curve of libasync myself, I haven’t yet.

That said, one of my PhD students at Princeton, Jeff Terrace, is building a high-throughput, strongly-consistent object-based (key/value) storage system called CRAQ using tame. He’s seemed to really like it.

How did you test your system for correctness?
I deploy it? In seriousness, it’s very difficult to test web proxies, especially ones deployed in chaotic environments and interacting with poorly-behaving clients and servers.

I did most of my testing during initial closed experiments on about 150-300 PlanetLab servers, which is a distributed testbed deployed at a few hundred universities and other institutions that each operate two or more servers. Testing that the DHT “basically” worked was relatively easy: see if you actually get() what you put(). There are a lot of corner cases here, however, especially when one encounters weird network conditions, some of which only became apparent after we moved Coral from the network-friendly North American and European nodes to those PlanetLab servers in China, India, and Australia. Always be suspicious with systems papers that describe the authors’ “wide deployment” on “selected” (i.e., cherry-picked) U.S. PlanetLab servers.

Much of the testing was just writing the appropriate level of debug information so we could trace requests through the system. I got really tired of staring at routing table dumps at that time. Last year I worked with Rodrigo Fonseca to integrate X-Trace into CoralCDN, which would have made it significantly easier to trace transactions through the DHT and the proxy network. I’m pretty excited about such tools for debugging and monitoring distributed systems in a fine-grained fashion.

Testing all the corner cases for the proxy turned out to be another level of frustration. There’s really no good way to completely debug these systems without rolling them out into production deployments, because there’s no good suite of possible test cases: The potential “space” of inputs is effectively unlimited. You constantly run into clients and servers which completely break the HTTP spec, and you just need to write your server to deal with these appropriately. Writing a proxy thus because a little bit of learning to “guess” what developers mean. I think this actually has become worse with time. Your basic browser (FireFox, IE) or standard webserver (Apache) is going to be quite spec-compliant. The problem is that you now have random developers writing client software (like podcasters, RSS readers, etc.) or generating Ajax-y XmlHttpRequest’s. Or casual developers dynamically generating HTTP on the server-side via some scripting language like PHP. Because who needs to generate vaguely spec-compliant HTTP if you are writing both the client and server? (Hint: there might be a middlebox on path.) And as it continues to become even easier to write Web services, you’ll probably continue to see lots of messy inputs and outputs from both sides.

So while I originally tested CoralCDN using its own controlled PlanetLab experiments, after the system went live, I started testing new versions by just pushing them out to one or a few nodes in the live deployment. Then I just monitor these new versions carefully and, if things seemed to work, slowly push them out across the entire network. Coral nodes include a shared secret in their packet headers, which excludes random people from joining our deployment. I also use these shared secrets to deploy new (non-backwards-compatible) versions of the software, as the new version (with a new secret nonce) won’t link up with DHT nodes belonging to previous versions.

How did you deploy your system? How big of a deployment?
CoralCDN has been running 247 on 200-400 PlanetLab servers since March 2004. I manage the network using AppManager, built by Ryan Huebsch from Berkeley, which provides a SQL server that keeps a record of current node run state, requested run state, install state, etc. So AppManager gives me a Web interface to control the desired runstate of nodes, then all nodes “call home” to the AppManager server to determine updated runstate. You write a bunch of shell scripts to actually use these run states to start or stop nodes, manage logs, etc. This “bunch of shell scripts” eventually grew to be about 3000 lines of bash, which was somewhat unexpected. While AppManager is a single server (although nodes are configured with a backup host for failover), CoralCDN’s scripts are designed for nodes to “fail same”. That is, requested runstate is stored durably on each node, so if the management server is offline or returns erroneous data (which it has in the past), the nodes will maintain their last requested runstate until the management server comes back online and provides a valid status update.

How did you evaluate your system?
We performed all the experiments one might expect in an academic evaluation on an initial test deployment on PlanetLab. Our NSDI ‘04 paper discusses these experiments.

After that stage, CoralCDN just runs – people continue to use it, so it provides some useful functionality. My interest transitioned from providing great service to just keeping it running (while I moved onto other research).

I probably spend about 10 minutes a week “keeping CoralCDN running”, which is typically spent answering abuse complaints, rather than actually managing the system. This is largely because the system’s algorithms were designed to be completely self-organizing – as we initially thought of CoralCDN as a peer-to-peer system – as opposed to a centrally-managed system designed for PlanetLab. System membership, fault detection and recovery, etc., is all completely automated.

Unfortunately, dynamic membership and failover doesn’t extend to the primary nameservers we have registered for .nyud.net with the .net gTLD servers. These 10-12 nameservers also run on PlanetLab servers, so if one of these servers go offline, our users experience bad DNS timeouts until I manually remove that server from the list registered with Network Solutions. (PlanetLab doesn’t provide any IP-layer virtualization that would allow us to failover to alternate physical servers without modifying the registered IP addresses.) And I have to admit I’m pretty lazy about updating the DNS registry, especially given the rather painful web UI that Network Solution provides. (In fairness, the UIs for GoDaddy and other registrars I’ve used are similarly painful). I think registrars should really provide a programmatic API for updating entries, but haven’t found one for low-cost registrars yet. Anyway, offline nameservers are probably the biggest performance problem with CoralCDN, and probably the main reason it seems slow at times. This is partly a choice I made, however, in not becoming a vigilant system operator for which managing CoralCDN becomes a full-time job.

There’s a lesson to be had here, I think, for academic systems that somehow “escape the lab” but don’t become commercial services: either promote lessened expectations for your users (and accept that reality yourself), build up a full-time developer/operations staff (a funding quandary), or expect the project to soon die-off after its initial developers lose interest or incentives.

Anything you’d like to add?
My research group actually just launched a blog. In the next few weeks, I’ll be writing a series about some of the lessons I’ve learned from building and deploying CoralCDN. I want to invite all your readers to look out for those posts and really welcome any comments or discussion around them.