Mike Freedman and I have known each other since we were Masters students at MIT, working on things like the Tarzan anonymizing network (a parallel, pre-cursor to Tor). He went on to build the hugely successful (“as seen on Slashdot”) Coral content distribution network, which figured largely in his dissertation. It’s a great treat to have him talk here about how Coral was built and deployed. Be sure also to check out his research group blog for more interesting thoughts from him and his students!
What did you build?
CoralCDN is a semi-open content distribution network. Our stated goal with CoralCDN was to
“democratize content publication”: namely, allow websites to scale by
demand, without requiring special provisioning or commercial CDNs to
provide the “heavy lifting” of serving their bits. Publishing with
CoralCDN is as easy as slightly modifying a URL to include .nyud.net
in its hostname (e.g., http://www.cnn.com.nyud.net/),
and the content is subsequently requested from and served by
CoralCDN’s network of caching web proxies.
Our initial goal for deploying CoralCDN was a network of volunteer
sites that would cooperate to provide such “automated mirroring”
functionality, much like sites do somewhat manually with open-source
software distribution. As we progressed, I also imagined that small
groups of users could also cooperate in a form of time-sharing for
network bandwidth: they each would provide some relatively constant
amount of upload capacity, with the goal of being able to then handle
any sudden spikes (from the Slashdot effect, for example) to any
participant. This model fits well with how 95th-percentile billing
works for hosting and network providers, as it then becomes very
important to flatten out bandwidth spikes. We started a deployment of
CoralCDN on PlanetLab, although it never really migrated off that
network. (We did have several hundred users, and even some major
Internet exchange points, initially contact us to run CoralCDN nodes,
but we didn’t go down that path, both for manageability and security
reasons.)
CoralCDN consists of three main components, all written from scratch:
a special-purpose web proxy, nameserver, and distributed hash table
(DHT) indexing node. CoralCDN’s proxy and nameserver are what they
sound like, although they have some differences given that they are
specially designed for our setting. The proxy has a number of design
choices and optimizations well-suited for interacting with websites
that are on their last legs—CoralCDN is designed for dealing with
“Slashdotted” websites, after all—as well as being part of a big
cooperative caching network. The nameserver, on the other hand, is
designed to dynamically synthesize DNS names (of the form .nyud.net),
provide some locality and load balancing properties when selecting
proxies (address records it returns), and ensure that the returned
proxies are actually alive (as the proxy network itself is comprised
of unreliable servers). The indexing node forms a DHT-based
structured routing and lookup structure that exposes a put/get
interface for finding other proxies caching a particular web object.
Coral’s indexing layer differs from traditional DHTs (such as
MIT’s Chord/DHash) in that it creates a hierarchy of locality-based
clusters, each which maintains a separate DHT routing structure and
put/get table, and it provides weaker consistency properties within
each DHT structure. These latter guarantees are possibly because
Coral only needs to find some proxies (preferably nearby ones)
caching a particular piece of content, not all such proxies.
Tell us about what you built it with.
CoralCDN is built in C++ using David Mazieres’
libasync and libarpc
libraries, originally built for the Self-certifying File System (SFS).
This came out of my own academic roots in MIT’s PDOS group, where SFS
was developed by David and its libraries are widely used. (David was
my PhD advisor at NYU/Stanford, and I got my MEng degree in PDOS.)
Some of the HTTP parsing libraries used by CoralCDN’s web proxy were
from OKWS, Max Krohn’s webserver also written using SFS libraries.
Max was research staff with David at NYU during part of the time I was
there. It’s always great to use libraries written by people you know
and can bother when you find a bug (although for those two, that was a
rare occurrence indeed!).
When I started building CoralCDN in late 2002, I initially attempted
to build its hierarchical indexing layer on top of the MIT Chord/DHash
implementation, which also used SFS libraries. This turned out to be
a mistake (dare I say nightmare?), as there was a layering mismatch
between the two systems: I wanted to build distinct, localized DHT
clusters in a certain way, while Chord/DHash sought to build a single,
robust, global system. It was thus rather promiscuous in maintaining
group membership, and I was really fighting the way it was designed.
Plus, MIT Chord was still research-quality code at the time, so bugs
naturally existed, and it was really difficult to debug the resulting
system with large portions of complex, distributed systems code that I
hadn’t written myself. Finally, we initially thought that the “web
proxy” part of the system would be really simple, so our original
proxy implementation was just in python. CoralCDN’s first
implementation was scrapped after about 6 months of work, and I
restarted by writing my own DHT layer and proxy (in C++ now) from
scratch. It turns out that the web proxy has actually become the
largest code base of the three, continually expanded during the
system’s deployment to add security, bandwidth shaping and
fair-sharing, and various other robustness mechanisms.
Anyway, back to development libraries. I think the SFS libraries
provide a powerful development library that makes it easy to build
flexible, robust, fast distributed services…provided that one spends
time overcoming their higher learning curve. Once you learn them,
however, they make it really easy to program in an event-based style,
and the RPC libraries prevent many of the silly bugs normally
associated with writing your own networking protocols. I think Max’s
tame libraries significantly improve the readability and (hopefully)
lessen the learning curve of doing such event-based programming, as
tame removes the “stack-ripping” that one normally sees associated
with events. Perhaps I’ll use tame in future projects, but as I’ve
already climbed the learning curve of libasync myself, I haven’t yet.
That said, one of my PhD students at Princeton, Jeff Terrace, is
building a high-throughput, strongly-consistent object-based
(key/value) storage system called CRAQ using tame. He’s seemed to
really like it.
How did you test your system for correctness?
I deploy it? In seriousness, it’s very difficult to test web proxies,
especially ones deployed in chaotic environments and interacting with
poorly-behaving clients and servers.
I did most of my testing during initial closed experiments on about
150-300 PlanetLab servers, which is a distributed testbed deployed at
a few hundred universities and other institutions that each operate
two or more servers. Testing that the DHT “basically” worked was
relatively easy: see if you actually get() what you put(). There are
a lot of corner cases here, however, especially when one encounters
weird network conditions, some of which only became apparent after we
moved Coral from the network-friendly North American and European
nodes to those PlanetLab servers in China, India, and Australia.
Always be suspicious with systems papers that describe the authors’
“wide deployment” on “selected” (i.e., cherry-picked) U.S. PlanetLab
servers.
Much of the testing was just writing the appropriate level of debug
information so we could trace requests through the system. I got
really tired of staring at routing table dumps at that time. Last
year I worked with Rodrigo Fonseca to integrate X-Trace into CoralCDN, which would have
made it significantly easier to trace transactions through the DHT
and the proxy network. I’m pretty excited about such tools for
debugging and monitoring distributed systems in a fine-grained
fashion.
Testing all the corner cases for the proxy turned out to be another
level of frustration. There’s really no good way to completely debug
these systems without rolling them out into production deployments,
because there’s no good suite of possible test cases: The potential
“space” of inputs is effectively unlimited. You constantly run into
clients and servers which completely break the HTTP spec, and you just
need to write your server to deal with these appropriately. Writing a
proxy thus because a little bit of learning to “guess” what developers
mean. I think this actually has become worse with time. Your basic
browser (FireFox, IE) or standard webserver (Apache) is going to be
quite spec-compliant. The problem is that you now have random
developers writing client software (like podcasters, RSS readers,
etc.) or generating Ajax-y XmlHttpRequest’s. Or casual developers
dynamically generating HTTP on the server-side via some scripting
language like PHP. Because who needs to generate vaguely
spec-compliant HTTP if you are writing both the client and server?
(Hint: there might be a middlebox on path.) And as it continues to
become even easier to write Web services, you’ll probably continue to
see lots of messy inputs and outputs from both sides.
So while I originally tested CoralCDN using its own controlled
PlanetLab experiments, after the system went live, I started testing
new versions by just pushing them out to one or a few nodes in the
live deployment. Then I just monitor these new versions carefully
and, if things seemed to work, slowly push them out across the entire
network. Coral nodes include a shared secret in their packet headers,
which excludes random people from joining our deployment. I also use
these shared secrets to deploy new (non-backwards-compatible) versions
of the software, as the new version (with a new secret nonce) won’t
link up with DHT nodes belonging to previous versions.
How did you deploy your system? How big of a deployment?
CoralCDN has been running 24/7 on 200-400 PlanetLab servers since
March 2004. I manage the network using AppManager, built by Ryan
Huebsch from Berkeley, which provides a SQL server that keeps a record
of current node run state, requested run state, install state, etc.
So AppManager gives me a Web interface to control the desired runstate
of nodes, then all nodes “call home” to the AppManager server to
determine updated runstate. You write a bunch of shell scripts to
actually use these run states to start or stop nodes, manage logs,
etc. This “bunch of shell scripts” eventually grew to be about 3000
lines of bash, which was somewhat unexpected. While AppManager is a
single server (although nodes are configured with a backup host for
failover), CoralCDN’s scripts are designed for nodes to “fail same”.
That is, requested runstate is stored durably on each node, so if the
management server is offline or returns erroneous data (which it has
in the past), the nodes will maintain their last requested runstate
until the management server comes back online and provides a valid
status update.
How did you evaluate your system?
We performed all the experiments one might expect in an academic
evaluation on an initial test deployment on PlanetLab. Our NSDI
‘04 paper discusses these experiments.
After that stage, CoralCDN just runs – people continue to use it, so
it provides some useful functionality. My interest transitioned from
providing great service to just keeping it running (while I moved onto
other research).
I probably spend about 10 minutes a week “keeping CoralCDN running”,
which is typically spent answering abuse complaints, rather than
actually managing the system. This is largely because the system’s
algorithms were designed to be completely self-organizing – as we
initially thought of CoralCDN as a peer-to-peer system – as opposed
to a centrally-managed system designed for PlanetLab. System
membership, fault detection and recovery, etc., is all completely
automated.
Unfortunately, dynamic membership and failover doesn’t extend to the
primary nameservers we have registered for .nyud.net with the .net
gTLD servers. These 10-12 nameservers also run on PlanetLab servers,
so if one of these servers go offline, our users experience bad DNS
timeouts until I manually remove that server from the list registered
with Network Solutions. (PlanetLab doesn’t provide any IP-layer
virtualization that would allow us to failover to alternate physical
servers without modifying the registered IP addresses.) And I have to
admit I’m pretty lazy about updating the DNS registry, especially
given the rather painful web UI that Network Solution provides. (In
fairness, the UIs for GoDaddy and other registrars I’ve used are
similarly painful). I think registrars should really provide a
programmatic API for updating entries, but haven’t found one for
low-cost registrars yet. Anyway, offline nameservers are probably the
biggest performance problem with CoralCDN, and probably the main
reason it seems slow at times. This is partly a choice I made,
however, in not becoming a vigilant system operator for which managing
CoralCDN becomes a full-time job.
There’s a lesson to be had here, I think, for academic systems that
somehow “escape the lab” but don’t become commercial services: either
promote lessened expectations for your users (and accept that reality
yourself), build up a full-time developer/operations staff (a funding
quandary), or expect the project to soon die-off after its initial
developers lose interest or incentives.
Anything you’d like to add?
My research group actually just launched a blog. In the next few
weeks, I’ll be writing a series about some of the lessons I’ve learned
from building and deploying CoralCDN. I want to invite all your
readers to look out for those posts and really welcome any comments or
discussion around them.