Thoughts on Systems

Emil Sit

Oct 8, 2009 - 8 minute read - Technology conferences Dan Pilone devdays iPhone Joel Spolsky mercurial Ned Batchelder python stackoverflow

StackOverflow DevDays Boston 2009, Morning

Boston DevDays kicked off a month-long tour of technical talks aimed at programmers, organized by StackOverflow and Carsonified. I had the good fortune to attend, meet a few interesting people and see some fun talks. I tried to write a bit in real-time (search Twitter here) but the WiFi was pretty over-subscribed and there was no cell coverage to speak of so eventually I gave up. Here are some more detailed notes, starting with the morning sessions.

The day ran very much on schedule, and after some very loud music, DevDays opened with a funny video demonstrating the competent technical leadership of the keynote speaker…

Joel Spolsky

Joel opened the day by giving a demonstration of the tyranny of computers: you are constantly interrupted with questions asking for decisions, like “Do you want to install these 10 updates to Windows?” or “Enable caret browsing?”, that can be really hard to answer. He argued that one of the reasons that computers ask us so many questions is that programmers want to give users the power (and control) to do what they want. But that’s another way of saying that programmers don’t want to make the decisions to keep things simple. The rest of Joel’s somewhat meandering but always entertaining talk was a discussion of how programmers (us!) should approach decision making, framed as a trade-off between simplicity and power.

Decisions are ultimately hard to make—there have been many studies that demonstrate that when people have too many choices, they freeze up and choose nothing. Thus, we’ve seen strong push towards simplicity in recent years; one clear example of that has been the Getting Real book from 37signals. Joel (jokingly?) points out that three of the four 37signals apps are in fact just one big textarea tag that you type into. Other examples, given later, include of course Apple’s products and Google.

But why do we wind up with programs with lots of options? Well, if you have a simple program, you find that most people won’t buy your product if it doesn’t have feature X (or Y or Z or …). So you wind up adding features over time, as you get more customers and more experience. Thus, removing simplicity often happens as a side-effect of making money.

How to balance these? If we don’t want to take away all decisions from the user, we need a rule to guide us in what to remove and what to keep. One rule to follow is that the computer does not get to set the agenda. Good decision points are those that should help the user achieve what they want to do. Bad decision points are those that interrupt the user, that the user really isn’t equipped to answer (e.g., should GMail automatically display inline images in HTML?), that are things that the programmer cares about.

To decide what is good or bad, developers need a good model to understand what the user is trying to do—Joel says every user is ultimately trying to replicate their DNA, but you may have some more refined model. Joel gave the example of Amazon’s 1-Click purchasing where the user should just be able to buy something with a single click. Apparently, early drafts of 1-Click really weren’t one click: programmers kept wanting to put in things like confirmation pages. Eventually, they arrived at just one click—by not immediately starting the order processing, but holding it for a few minutes to consolidate related 1-Click orders and allow for cancellation of errors. This was more work for the developers, but simpler for the user. This is what we want to see happen.

Overall, I think Joel’s talk set a nice tone for how we should think as developers but didn’t offer anything particularly ground-breaking.

Ned Batchelder on Python

Ned Batchelder presented Python as a “Clean noise free environment to type your ideas.” Or, alternatively, “Python is awesome.” With a few bullets to lead-off (e.g., the REPL, duck typing, and “batteries included” nature of Python), we dove into code to really understand what Python is capable of doing. Ned’s slides are online if you want to take a look. I won’t cover his talk in much detail since it was largely explaining the Python language but will list some high points.

The first example was Peter Norvig’s concise spell corrector. This code makes ample use of list comprehensions and sets, plus functions like re.findall, collections.defaultdict, and parameterized max, which were all new to me.

The second example was a custom build-up of a simple Python templating engine, loosely based on Django’s template styles. This example demonstrated simple formatting with the percent operator, but then quickly moved into more advanced features like duck-typing (by implementing __getitem__) and callable.

One of the nice things about Ned’s presentation is that he demonstrated the power of Python in two short but extremely powerful examples that left people who didn’t know Python thinking, “Wow! That is really amazing!”

Ned recommended reading Dive into Python to learn more, and if you are in the Boston-area, checking out the Cambridge Python meetup.

Dan Pilone

Dan Pilone’s talk was an overview of iPhone development, first giving a quick market overview, then giving a broad overview of the nitty-gritty of developing in Objective C, and finally diving into the practical economics and realities of selling iPhone apps. The iPhone app market (as of Apple’s September numbers), consists of something like 85,000 applications, of which 75% are pay applications. Some 35% of applications are games, whereas 6% fall into more social applications. There are 5 or 6 different iPhone/iTouch hardware platforms, and something like 20 million plus devices sold. The iPhone has a great user experience and comes with a great delivery model (the iTunes app store). This combined with some of the numbers make it a good platform to develop for, with significant amounts of money that can be earned. (Dan emphasized that you really have to work to develop and market the application, hence the “earned”.)

The iPhone development environment is very shiny. I’m sure Ars Technica has a much better overview but in short Dan demonstrated some tools like XCode, CoreData (a graphical SQLite data modeler), reference counting support (aka retain), and Instruments (a memory profiler). Dan suggested that this shininess is to make up for some of the oddities of Objective C that you’ll have to live with. He also demonstrated some of the interface builder tools and how they link up.

Testing turns out to be quite interesting; the simulator is okay but limited and often your app will work in the simulator but fail on real devices. For example, your app on a real device might “run so slow you wish it had crashed”. The simulator also doesn’t enforce sandboxing as strictly as real devices, where each app has its own uid and single directory where it can store data. There are also many different hardware variants that you have to support that limit you in different ways: for example, early iPhones only give you 40MB of memory to play with whereas newer ones give you almost 120MB. This is not reflected well in the simulator either.

Shipping an iPhone app on the app store requires approval, a process that can take two weeks per round-trip. There’s no way to get around it so you must budget in time for that in development. The approval store helps guarantee a minimal level of quality of apps—they will verify that your app indeed works on all different hardware, and they will (eventually) catch any licensing violations but they’re overall pretty reasonable.

Once you get approval, you show up in the recent releases section of the app store, and there you have about 24 hours to get popular or else you will fade into the long tail of oblivion. In fact, if you are in the top 50 apps, you will easily get double the sales (presumably relative to the 51st app); if you are in the top 10, you’ll be getting an order-of-magnitude more sales. So, make sure you get your approval/release-date lined up with your marketing blitz. The alternative is to charge a bit more than $0.99, and go for slow but steady sales.

As a Blackberry owner and Linux user, I found Dan’s talk to be a great introduction to iPhone development. Presumably his new ORA book, Head First iPhone Development, would be a good buy if you are in to that sort of thing.

Joel on FogBugz

Before lunch, Joel took the opportunity to give a pitch for his company’s FogBugz product, and announce some new features. He gave us a walk-through of its capabilities, from organizing documentation, to project planning, to bug tracking, to user support. New features announced are a rich plugin architecture, plus support for Mercurial and code reviews in a new hosted plug-in to FogBugz called Kiln. He spent a fair amount of time on that, demonstrating calling hg push from the command line. He also demonstrated the evidence-based scheduling features of FogBugz.

Nothing too exciting for me working in a big company using Perforce, but a good marketing opportunity for FogCreek and a nice chance to see how some other people do scheduling and bug tracking. I was a bit disappointed that there’s no direct way to do pre-commit (i.e. pre-push) reviews a la Gerrit, but @jasonrr says you can set up a branch repo, push to that and then review there before merging to main. I expect this means that GitHub will be getting code review support soon.

Lunch!

With that, we broke for lunch. @jldio has the scoop on lunch, and his own write-up of the day too. More to come later, thanks for reading.

Mar 25, 2009 - 11 minute read - Research academia cdn coral implementation interviews planetlab Research tarzan

Systems Researchers: Mike Freedman

Mike Freedman and I have known each other since we were Masters students at MIT, working on things like the Tarzan anonymizing network (a parallel, pre-cursor to Tor). He went on to build the hugely successful (“as seen on Slashdot”) Coral content distribution network, which figured largely in his dissertation. It’s a great treat to have him talk here about how Coral was built and deployed. Be sure also to check out his research group blog for more interesting thoughts from him and his students!

What did you build?
CoralCDN is a semi-open content distribution network. Our stated goal with CoralCDN was to “democratize content publication”: namely, allow websites to scale by demand, without requiring special provisioning or commercial CDNs to provide the “heavy lifting” of serving their bits. Publishing with CoralCDN is as easy as slightly modifying a URL to include .nyud.net in its hostname (e.g., http://www.cnn.com.nyud.net/), and the content is subsequently requested from and served by CoralCDN’s network of caching web proxies.

Our initial goal for deploying CoralCDN was a network of volunteer sites that would cooperate to provide such “automated mirroring” functionality, much like sites do somewhat manually with open-source software distribution. As we progressed, I also imagined that small groups of users could also cooperate in a form of time-sharing for network bandwidth: they each would provide some relatively constant amount of upload capacity, with the goal of being able to then handle any sudden spikes (from the Slashdot effect, for example) to any participant. This model fits well with how 95th-percentile billing works for hosting and network providers, as it then becomes very important to flatten out bandwidth spikes. We started a deployment of CoralCDN on PlanetLab, although it never really migrated off that network. (We did have several hundred users, and even some major Internet exchange points, initially contact us to run CoralCDN nodes, but we didn’t go down that path, both for manageability and security reasons.)

CoralCDN consists of three main components, all written from scratch: a special-purpose web proxy, nameserver, and distributed hash table (DHT) indexing node. CoralCDN’s proxy and nameserver are what they sound like, although they have some differences given that they are specially designed for our setting. The proxy has a number of design choices and optimizations well-suited for interacting with websites that are on their last legs—CoralCDN is designed for dealing with “Slashdotted” websites, after all—as well as being part of a big cooperative caching network. The nameserver, on the other hand, is designed to dynamically synthesize DNS names (of the form .nyud.net), provide some locality and load balancing properties when selecting proxies (address records it returns), and ensure that the returned proxies are actually alive (as the proxy network itself is comprised of unreliable servers). The indexing node forms a DHT-based structured routing and lookup structure that exposes a put/get interface for finding other proxies caching a particular web object. Coral’s indexing layer differs from traditional DHTs (such as MIT’s Chord/DHash) in that it creates a hierarchy of locality-based clusters, each which maintains a separate DHT routing structure and put/get table, and it provides weaker consistency properties within each DHT structure. These latter guarantees are possibly because Coral only needs to find some proxies (preferably nearby ones) caching a particular piece of content, not all such proxies.

Tell us about what you built it with.
CoralCDN is built in C++ using David Maziereslibasync and libarpc libraries, originally built for the Self-certifying File System (SFS). This came out of my own academic roots in MIT’s PDOS group, where SFS was developed by David and its libraries are widely used. (David was my PhD advisor at NYU/Stanford, and I got my MEng degree in PDOS.) Some of the HTTP parsing libraries used by CoralCDN’s web proxy were from OKWS, Max Krohn’s webserver also written using SFS libraries. Max was research staff with David at NYU during part of the time I was there. It’s always great to use libraries written by people you know and can bother when you find a bug (although for those two, that was a rare occurrence indeed!).

When I started building CoralCDN in late 2002, I initially attempted to build its hierarchical indexing layer on top of the MIT Chord/DHash implementation, which also used SFS libraries. This turned out to be a mistake (dare I say nightmare?), as there was a layering mismatch between the two systems: I wanted to build distinct, localized DHT clusters in a certain way, while Chord/DHash sought to build a single, robust, global system. It was thus rather promiscuous in maintaining group membership, and I was really fighting the way it was designed. Plus, MIT Chord was still research-quality code at the time, so bugs naturally existed, and it was really difficult to debug the resulting system with large portions of complex, distributed systems code that I hadn’t written myself. Finally, we initially thought that the “web proxy” part of the system would be really simple, so our original proxy implementation was just in python. CoralCDN’s first implementation was scrapped after about 6 months of work, and I restarted by writing my own DHT layer and proxy (in C++ now) from scratch. It turns out that the web proxy has actually become the largest code base of the three, continually expanded during the system’s deployment to add security, bandwidth shaping and fair-sharing, and various other robustness mechanisms.

Anyway, back to development libraries. I think the SFS libraries provide a powerful development library that makes it easy to build flexible, robust, fast distributed services…provided that one spends time overcoming their higher learning curve. Once you learn them, however, they make it really easy to program in an event-based style, and the RPC libraries prevent many of the silly bugs normally associated with writing your own networking protocols. I think Max’s tame libraries significantly improve the readability and (hopefully) lessen the learning curve of doing such event-based programming, as tame removes the “stack-ripping” that one normally sees associated with events. Perhaps I’ll use tame in future projects, but as I’ve already climbed the learning curve of libasync myself, I haven’t yet.

That said, one of my PhD students at Princeton, Jeff Terrace, is building a high-throughput, strongly-consistent object-based (key/value) storage system called CRAQ using tame. He’s seemed to really like it.

How did you test your system for correctness?
I deploy it? In seriousness, it’s very difficult to test web proxies, especially ones deployed in chaotic environments and interacting with poorly-behaving clients and servers.

I did most of my testing during initial closed experiments on about 150-300 PlanetLab servers, which is a distributed testbed deployed at a few hundred universities and other institutions that each operate two or more servers. Testing that the DHT “basically” worked was relatively easy: see if you actually get() what you put(). There are a lot of corner cases here, however, especially when one encounters weird network conditions, some of which only became apparent after we moved Coral from the network-friendly North American and European nodes to those PlanetLab servers in China, India, and Australia. Always be suspicious with systems papers that describe the authors’ “wide deployment” on “selected” (i.e., cherry-picked) U.S. PlanetLab servers.

Much of the testing was just writing the appropriate level of debug information so we could trace requests through the system. I got really tired of staring at routing table dumps at that time. Last year I worked with Rodrigo Fonseca to integrate X-Trace into CoralCDN, which would have made it significantly easier to trace transactions through the DHT and the proxy network. I’m pretty excited about such tools for debugging and monitoring distributed systems in a fine-grained fashion.

Testing all the corner cases for the proxy turned out to be another level of frustration. There’s really no good way to completely debug these systems without rolling them out into production deployments, because there’s no good suite of possible test cases: The potential “space” of inputs is effectively unlimited. You constantly run into clients and servers which completely break the HTTP spec, and you just need to write your server to deal with these appropriately. Writing a proxy thus because a little bit of learning to “guess” what developers mean. I think this actually has become worse with time. Your basic browser (FireFox, IE) or standard webserver (Apache) is going to be quite spec-compliant. The problem is that you now have random developers writing client software (like podcasters, RSS readers, etc.) or generating Ajax-y XmlHttpRequest’s. Or casual developers dynamically generating HTTP on the server-side via some scripting language like PHP. Because who needs to generate vaguely spec-compliant HTTP if you are writing both the client and server? (Hint: there might be a middlebox on path.) And as it continues to become even easier to write Web services, you’ll probably continue to see lots of messy inputs and outputs from both sides.

So while I originally tested CoralCDN using its own controlled PlanetLab experiments, after the system went live, I started testing new versions by just pushing them out to one or a few nodes in the live deployment. Then I just monitor these new versions carefully and, if things seemed to work, slowly push them out across the entire network. Coral nodes include a shared secret in their packet headers, which excludes random people from joining our deployment. I also use these shared secrets to deploy new (non-backwards-compatible) versions of the software, as the new version (with a new secret nonce) won’t link up with DHT nodes belonging to previous versions.

How did you deploy your system? How big of a deployment?
CoralCDN has been running 247 on 200-400 PlanetLab servers since March 2004. I manage the network using AppManager, built by Ryan Huebsch from Berkeley, which provides a SQL server that keeps a record of current node run state, requested run state, install state, etc. So AppManager gives me a Web interface to control the desired runstate of nodes, then all nodes “call home” to the AppManager server to determine updated runstate. You write a bunch of shell scripts to actually use these run states to start or stop nodes, manage logs, etc. This “bunch of shell scripts” eventually grew to be about 3000 lines of bash, which was somewhat unexpected. While AppManager is a single server (although nodes are configured with a backup host for failover), CoralCDN’s scripts are designed for nodes to “fail same”. That is, requested runstate is stored durably on each node, so if the management server is offline or returns erroneous data (which it has in the past), the nodes will maintain their last requested runstate until the management server comes back online and provides a valid status update.

How did you evaluate your system?
We performed all the experiments one might expect in an academic evaluation on an initial test deployment on PlanetLab. Our NSDI ‘04 paper discusses these experiments.

After that stage, CoralCDN just runs – people continue to use it, so it provides some useful functionality. My interest transitioned from providing great service to just keeping it running (while I moved onto other research).

I probably spend about 10 minutes a week “keeping CoralCDN running”, which is typically spent answering abuse complaints, rather than actually managing the system. This is largely because the system’s algorithms were designed to be completely self-organizing – as we initially thought of CoralCDN as a peer-to-peer system – as opposed to a centrally-managed system designed for PlanetLab. System membership, fault detection and recovery, etc., is all completely automated.

Unfortunately, dynamic membership and failover doesn’t extend to the primary nameservers we have registered for .nyud.net with the .net gTLD servers. These 10-12 nameservers also run on PlanetLab servers, so if one of these servers go offline, our users experience bad DNS timeouts until I manually remove that server from the list registered with Network Solutions. (PlanetLab doesn’t provide any IP-layer virtualization that would allow us to failover to alternate physical servers without modifying the registered IP addresses.) And I have to admit I’m pretty lazy about updating the DNS registry, especially given the rather painful web UI that Network Solution provides. (In fairness, the UIs for GoDaddy and other registrars I’ve used are similarly painful). I think registrars should really provide a programmatic API for updating entries, but haven’t found one for low-cost registrars yet. Anyway, offline nameservers are probably the biggest performance problem with CoralCDN, and probably the main reason it seems slow at times. This is partly a choice I made, however, in not becoming a vigilant system operator for which managing CoralCDN becomes a full-time job.

There’s a lesson to be had here, I think, for academic systems that somehow “escape the lab” but don’t become commercial services: either promote lessened expectations for your users (and accept that reality yourself), build up a full-time developer/operations staff (a funding quandary), or expect the project to soon die-off after its initial developers lose interest or incentives.

Anything you’d like to add?
My research group actually just launched a blog. In the next few weeks, I’ll be writing a series about some of the lessons I’ve learned from building and deploying CoralCDN. I want to invite all your readers to look out for those posts and really welcome any comments or discussion around them.

Mar 16, 2009 - 5 minute read - Research academia implementation interviews planetlab san fermin seattle stork testbed

Systems Researchers: Justin Cappos

Justin Cappos received his PhD from the University of Arizona under the supervision of John Hartman. I met Justin several years ago at a PlanetLab Consortium meeting when he was starting to work on Stork, a system to simplify package deployment. He is currently a Post Doc at the University of Washington working with Tom Anderson and Arvind Krishnamurthy.

What did you build?
The most relevant / longest term projects are: Stork, San Fermin, and Seattle.

Stork is a package manager that has security and functionality improvements over existing Linux package managers. Some of the advances we made in Stork have been adapted by APT, YUM, YaST and other popular Linux package managers.

San Fermin is a system for aggregating large quantities of data from computers. San Fermin provides the result faster and with better fault tolerance than existing techniques.

Seattle is an educational testbed built from resources donated by universities all around the world. The universities run a safe, lightweight VM that students from other universities can run code in.

Tell us about what you built it with.
I used Java for San Fermin because I needed to leverage existing Pastry code that was in Java. I used Python for Stork and Seattle. I found Python to be far superior for large projects (other languages I’ve used are Java, C, C++, QBASIC, and Pascal). Python has been a dream come true because it’s great for prototyping, easy to learn, and the resulting code is readable (so long as you have sensible style constraints on the written code).

Perhaps the most useful thing is getting other developers involved. I like to do the initial prototyping myself, but after that it is great to have others helping out.

How did you test your system for correctness?
There are whitebox / unit tests as well as blackbox / integration tests for most parts of the systems. The time that we spent building thorough test cases really paid off because it simplifies debugging.

I like to use my systems in real world environments with outside users, so the system is never done thus correctness is an iterative process. If I’m aware of bugs, we fix them. If I’m not aware of bugs, we’re adding features based upon user requests (and therefore may be adding more bugs for us to fix later). In general, the fact that we have users that rely on the software over long time periods is a testament to its stability which is related to correctness.

To more specifically answer the question you are really asking, I usually run my code by hand and evaluate it in small / constrained environments (turning these test runs into unit tests). I also follow a philosophy where I try to “detect and fail” as much as possible. I care more about correctness than performance (at least initially) and so add many redundant checks in my code to catch errors as soon as possible.

I find that if I’m careful and thorough when writing my code, I spend very little time debugging. I probably spend about 30% of the time writing code, 40% writing comments / docs (which I normally write before / during coding), 20% of the time writing test code for individual modules, and about 10% debugging after the fact. I think part of this is I’m really careful about checking input and boundary conditions and so I can normally pin-point the exact cause of a failure.

In terms of problems when writing code, I generally only use standard libraries and code I’ve written. I don’t depend on third party code because I don’t know what level of support it will have. Also, since I usually code in Python, it’s easy for me to add functionality to do whatever I need.

How did you deploy your system?  How big of a deployment?
Stork has been deployed on PlanetLab for about 6 years. For the majority of the time we’ve been deployed on every working PlanetLab node. Stork has managed > 500K VMs and when I last checked was used daily by users on around two dozen sites. We initially used AppManager to deploy Stork, but since have been using PlanetLab’s initscript functionality.

San Fermin was deployed on PlanetLab for use in combining and managing logs from Stork. However, we found that San Fermin was dysfunctional due to difficulties in getting Pastry to start reliably and work when non-transitive connectivity occurs. As a result, we mainly ended up using San Fermin as a publication vehicle.

Seattle is currently deployed on more than 1100 computers around the world. We have done our initial deployment by encouraging educators to use our platform in networking and distributed systems classes. Our longer term plans involve using Seattle to build a research testbed of 1 million nodes.

How did you evaluate your system?
With San Fermin (and other systems research I haven’t mentioned), there is fairly clear related work to compare against. In some cases, the biggest challenge has been getting the existing research prototype code from another project to run well enough for comparison.

For the work that I’ve done where I’ve focused on impact over publication impact (Seattle / Stork), evaluation is much more difficult because these aren’t incremental improvements over existing models and systems. These systems break the mold in terms of security and / or functionality so sometimes it’s difficult to know how to compare them to existing work.

Mar 14, 2009 - 1 minute read - Research academia implementation interviews tools

Systems Researchers Interview Series

After my last post on implementing Chord, I thought it might be insightful and educational to examine how others are building research distributed systems. I have asked a number of colleagues who have built successful research systems to answer a few questions about their implementations. Here, successful roughly means that the system has seen some broad use or adoption, or (and?) has been presented at a notable peer-reviewed conference. The questions to be answered:

  • What did you build?
  • Tell us about what you built it with. For example, what languages or 3rd party tools/libraries did you use? Why? What was your experience with them?
  • How did you test your system for correctness?
  • How did you deploy your system? How big of a deployment?
  • How did you evaluate your system?
  • Anything you’d like to add?

A few folks have agreed to participate; as I post their responses, I will update this post with their names and links to the interviews. If you are interested in participating, just send me an e-mail!

Participants (last updated 25 Mar 2009):

Mar 1, 2009 - 3 minute read - Research Technology chord dht implementation

On implementing Chord

The Chord protocol dynamically constructs robust and scalable overlay networks that map a given key to an active node. The MIT PDOS Chord implementation has served as a reference implementation of Chord, and over the years has accumulated many tweaks and improvements. While the theoretical highlights have largely been documented in our publications, the implementation details may have been omitted. Given the relative multitude of re-implementations I know of, it seemed worthwhile to try and identify some of these lessons instead of leaving them to bit-rot inside our implementation. A basic understanding of Chord is assumed. It is also important to realize that MIT Chord was largely run over the wide-area, not in a data-center.

MIT Chord is implemented in C++ using libasync. We did not like Java because (at least a few years ago) PlanetLab did not make it easy to push a JRE to all your nodes and would also kill processes with large memory footprints (which seemed to often be Java VMs). So, using C++ made deployment to PlanetLab easier and the code more efficient. Today, I might use the STL and select bits from Boost, which would make the implementation accessible to a broader audience. I’d probably also use threads.

Because of its libasync roots, MIT Chord uses XDR and SunRPC for its inter-node communcation. It would be better today to use something like Protocol Buffers or Thrift; the front-end should definitely support XML-RPC for maximum ease-of-application development. Other than NFS, basically no one uses SunRPC.

Chord uses its own transport called the Striped Transport Protocol (STP). Read the section, think about the problem, decide if STP is right for you. For actual data transfers, depending on your object size, it may be simpler to go with TCP. UDP without flow-control may be just fine for the routing RPCs, though managing retransmission timers and mapping it to node liveness will be lots of fun. Also read about Accordion’s approach to this issue.

Much of a DHT implementation revolves around keeping track of other nodes (e.g., your neighbors). The Chord paper describes the abstract successor list and finger table; our implementation keeps a bucket of nodes (the locationtable) that calculates successors and fingers on the fly from this table. The Chord stabilization protocol runs periodically to keep this bucket filled with relevant live nodes. This approach also let us easily implement Accordion in our codebase. A single reference-counted object is used to proxy remote nodes: this keeps all state about a remote node consistent, even if it is part of multiple in-flight operations.

Nodes often fail and MIT Chord works hard to stop using failed nodes, despite some Chord operations that pass node lists around, by tagging the node wire representation with the time since the node was last contacted directly. If node A has discovered for itself that node B has failed, it will ignore any mention of B from its neighbors that have not directly heard from B more recently. (This doesn’t quite work right in the case of asymmetrically reachable nodes, but it is unclear what to do with such nodes anyway.)

Definitely include tools to view a live node’s state. If you’re feeling fancy, your nodes will speak HTTP on a special port and produce pretty graphics (MIT Chord doesn’t). This will significantly improve your ability to debug weird routing issues (remotely).

These thoughts are a bit haphazard but hopefully they contain some tidbits of interest. I’ve actually had relatively little interaction with other implementors of Chord so if you’re one of them, I’m curious how your experiences compare.

Feb 26, 2009 - 2 minute read - Research mendel rosenblum virtualization vmware

Mendel Rosenblum on Virtualization in Modern Computing Environments

Today, VMware co-founder Mendel Rosenblum gave a Deuterzos Lecture at MIT titled The Impact of Virtualization on Modern Computing Environments. His talk outlined the general function of virtualization as an interposing layer between the hardware and the operating system, and the evolution of functionality in this layer over the past decade—in short, the talk was largely a super high-level summary of VMware’s greatest hits, from server integration to VMotion to high-availability via record replay to virtual appliances. Overall, I was not particularly impressed, though I thought it was well delivered and probably educational for a non-VMware-employee audience.

The most interesting observation Mendel made was in regarding the space between packaged software and software-as-a-service. In the former, the user is responsible for everything outside of software development: from hosting, to configuration, to maintenance. In the latter, everything other than the actual use is handled by the developers. The in-between space of deployment is enabled by virtualization: some of the maintenance/configuration can be handled by virtual appliances, and with the advent of vClouds and virtual data centers, you can transparently host things either internally or externally.

Anant Agarwal closed the QA section by asking Mendel to take his years of successful academic and industry work as a single sentence of advice. To paraphrase, Mendel said to jump on opportunities that present themselves, for even if you land badly, most people find that the experience was worthwhile.

Feb 25, 2009 - 2 minute read - Technology android mobile virtualization mvp n800 nokia vmware vmworld

Mobile virtualization demonstration at VMworld

At today’s VMworld keynote, CTO Steve Herrod included a brief demonstration of the project that I have been working on at VMware since last May: the Mobile Virtualization Platform (MVP). One of the downsides of working in industry as opposed to academia is that you have to wait for big release dates such as this before being able to talk about your work, so I’m excited to have the opportunity to blog a little bit about MVP today.

Steve chatted with Jerry Chen, who showed a Nokia N800 running the MVP bare-metal hypervisor which was hosting a Windows CE guest and an Android guest. The relevant part of the keynote, including the N800 loading and launching the Android VM can be found on Vimeo. The demo was also shown in a booth: Senior R&D Director Julia Austin walked through the demo for ITPro who has posted a 3 minute video of the full demo, which I’ve embedded at the end of this post.

Running Android and WinCE on an N800 is pretty cool in and of itself, but our demo also highlights one of the key use cases for mobile virtualization, namely the separation of work and personal settings. With your “work phone” as a virtual machine, it will be possible to easily backup and transfer your work settings from one phone to another, while keeping these settings cleanly partitioned from your personal settings. Similarly, when you upgrade a device to the next generation, you will be able to port your phone over with relatively little hassle.

By providing a stable virtual platform, MVP will also hopefully simplify application development by allowing developers to target MVP instead of a myriad different devices. Defining the virtual platform and mapping it into different physical platforms is an interesting challenge.

Some of the other challenges and benefits of virtualization in the mobile space are discussed in an eWeek article by one of the founders of VirtualLogix, a competitor in this space. At VMware, we are working hard on addressing these challenges with the aim of providing benefits to all stakeholders, from phone OEMs to enterprises to end-users. Hopefully, you’ll be able to see the results of this work in the palm of your hand in the near future!

Feb 15, 2009 - 4 minute read - Research chord dhash dht history sfs sfslite

The afterlife of systems research code

When a graduate student completes their PhD, the software they wrote for that degree begins an almost inexorable decline into obscurity. Other things become more important and, unless that code serves as a platform for further research or has spun off into a startup, there is precious little time to maintain and further develop code written for the degree. The code has already served a purpose—namely, advancing the state of the art in the form of research papers and a dissertation.

For example, one student I know wanted to test a new Byzantine Fault Tolerant system against an earlier one and was told that the old one only worked on one specific machine in the corner of lab (running some ancient release of RedHat). Another student in my group was trying to compare file system synchronization tools and found that the published source for some tools was hopelessly incompatible with modern compilers and libraries; he wound up finding some old RedHat isos and testing in a virtual machine.

I’m thinking about this today in the context of some recent activity on the Chord mailing list about code I wrote for my PhD. The entirety of my time as a graduate student was spent hacking on the PDOS Chord/DHash implementation, which has served as the reference implementation of Chord in numerous papers. While I hesitate to declare this implementation dead, it has become clear from numerous mailing list posting that it will be increasingly hard for others to use the implementation if it is not more actively maintained.

Chord and DHash are implemented using the libasync C++ library, an obscure and mostly undocumented creation of Prof. David Mazieres, who wrote it as part of his thesis on a Self-Certifying File System (SFS) ten years ago, back when C++ compilers sucked and the STL didn’t work very portably. libasync offers a lot of nice features such as managing TCP buffering for nonblocking I/O and a comprehensive framework for writing user-level NFS servers; this came in handy for early Chord applications such as the Cooperative File System. As a result, the libasync toolkit was adopted by many PDOS research projects.

Unfortunately, as Prof. Mazieres moved on to other research projects, SFS releases and the libasync codebase languished. The libasync library was fortunately adopted by Max Krohn in the form of sfslite. I switched Chord to sfslite, so that it would be possible to point people at a real release instead of asking them to checkout some recent version of SFS from its (now defunct?) CVS server.

As part of sfslite, Max went on to develop tame—a language designed to simplify writing event driven programs with callbacks using libasync, and as part of my thesis, I started to write code in tame to help test it. When sfslite 1.x was released, tame had evolved from its v1 to v2 which changed the syntax. Since I never had the time to take the potential stability hit involved in upgrading, Chord still relies on sfslite 0.8, which Max no longer has time to maintain.

Which brings us back to recent activity on the chord mailing list: as people try to build Chord with ever newer compilers and Linux versions, they run into problems finding a version of sfslite and Chord that build successfully. I still don’t have the time to update the Chord tree to use newer versions of sfslite (though it might not be too hard), nor do I have access to all the resources I used to test it in deployment. So, people are left to patch their own source trees and cobble together working bits.

I wish I had better news for people interested in Chord—there are a lot of useful lessons embedded in our implementation, and many reimplementations that might benefit from those lessons. Perhaps one day soon I will try to write up some of those implementation lessons. Or maybe someone would be willing to take over maintaining Chord. If you volunteer, I’d be happy to help you get started!

Jan 19, 2009 - 1 minute read - Technology Thoughts cellphone flash crowds networks scalability

Capacity planning for cell phone networks

The New York Times has an article today about how the inauguration crowd will test cellphone networks. They wrote:

Sprint Nextel, which said it had been planning for the inauguration since April, has also increased capacity of its cell sites and terrestrial transmission lines to prepare the network to sustain 10 to 15 times the number of users it would serve on its networks during a normal day.

Web services (such as Twitter, also cited in the article) also have to deal with scalability and planning to provide adequate (network) capacity. It’s interesting to see how difficult (and expensive!) it can be to deal with flash crowds in other systems.

Jan 10, 2009 - 2 minute read - Technology agpl hosting social networks tools web2.0

Web tools I wish I had at work

Most companies keep resources associated with on-going but as-yet-unreleased projects hidden from public view. Working at a big company has made me realize that many tools for organizing various bits of data, that I took almost for granted as a grad student, are not available inside the firewall. Here are just a few:

  • delicious, for social information management;
  • iGoogle, to allow per-user customizable homepages, organizing info from other services;
  • Google Reader, a web-based RSS reader (to aggregate internal RSS feeds);
  • Facebook/LinkedIn/Twitter, to develop internal/professional social networks, gaining awareness into what others in the company are working on and what their skills might be;
  • QuickTopic, to create quick impromptu archived discussions;
  • GMane, a central archive of all internal mailing lists, searchable, linkable, easy to use;
  • GitHub, high quality, self-service hosting for source code;
  • SlideShare/Scribd et al., for document sharing (instead of mailing around incompatible PPTX and DOCX files) and management.

Some of these services like Yammer or GitHub that allow you to host functionality externally. However, how much do you trust them with your company’s secrets? Can you convince corporate IT to go along with this? If you can’t review the source code, how do you know if it isn’t vulnerable to trivial password guessing attacks?

Open-source clones exist for many of these services but most are not at the level of polish that one can expect from the public services. Still, this can work well, for example with laconica or ReviewBoard.

Similarly, commercial alternatives exist but set a higher bar for internal deployment because money must be allocated to deploy them (in addition to people). On the plus side, commercial alternatives (e.g., ClearSpace) have support. On the minus side, commercial alternatives are generally less amenable to you hacking on the source for your own purposes.

In any event, the space of open-source/commercial alternatives simply doesn’t cover the space of cool web tools that exist today. The AGPL—which copylefts software used to provide services over a network—can help in the future but its impact is still some years off. I hope in time, these tools like the ones I’ve highlighted here are made open-source or otherwise self-hostable because without them, some things at work are just a bit harder than I’d like them to be.