Thoughts on Systems

Emil Sit

May 13, 2006 - 6 minute read - Research conferences

NSDI 2006, Day 2 Morning Sessions

The second day of NSDI was the longest day, with 4 technical sessions where the last one was an extra long 2 hour session. This post summarizes the first two sessions.

Wide Area Network Services

Mike Freedman, author of the popular Coral content distribution service, opened the day by talking about OASIS. OASIS is designed to answer the following question: if you are building a Internet service with many distributed servers, how can you direct a given user to the closest (or fastest) server? This may involve geographic proximity (minimizing transmission latency) or selecting the least loaded server (minimizing service time). Your service might have broader goals as well, such as maximizing the number of clients served. OASIS solves this using two mechanisms.

  • First, using a set of core OASIS nodes and the machines from your service, it constructs a database mapping network prefixes (as advertised by BGP) to the closest server participating in OASIS. It uses the known latitude and longitude of the core nodes to help make this information shareable between services. OASIS also monitors load/availability information of the service nodes to know which nodes are useful.
  • Second, it provides a DNS (and programmatic RPC) interface so that clients can query the OASIS core to find a close node.

OASIS works very well in reducing end-to-end transfer times compared to other coordinate systems like Meridian (Cornell) and Vivaldi (MIT). I think his end-to-end performance graph is slightly misleading in that it magnifies the impact of poor coordinates; additionally, Vivaldi (for example) seeks to solve a different problem: Vivaldi requires no central infrastructure.

Jeremy Stribling spoke next; perhaps best known for his work in computer generated papers but also the first person to record latency information between PlanetLab sites, Jeremy described OverCite, a system that spreads the load of CiteSeer over a set of cooperating nodes. CiteSeer indexes research papers so distributing it involves solving three problems: storage of the index and cached papers, searching the index, and crawling the web for new papers. Storage of data and meta-data is handled using a DHT (which I am working on); the index is partitioned and replicated over the nodes in the system. Searching thus involves querying a node in each partition and coalescing the results. Crawling was not discussed. Performance measurements suggests that OverCite can scale linearly as more servers act as web front-ends, but more scalability testing is needed.

The last talk of the session focused on Colyseus, a system for distributing the work of managing online multiplayer games in a peer-to-peer fashion. Ashwin Bharambe presented the work. The main goal of Colyseus is to handle the rapid update of location information for players and objects: players have to discover other players and objects in their vicinity and update state. How can this all be kept consistent in a peer-to-peer environment, while minimizing traffic and latency? Colyseus achieves this by using primary-backup replication and placing replicas on nodes that are near the primary in the game world. It takes advantage of game physics and knowledge of the game map to prefetch replicas of objects that are likely to become relevant. Colyseus is at the point where it can support Quake II style maps but not yet World-of-Warcraft scale systems.

End System Design

The second session of the day focused on end-system designs. The first talk was about Na Kika, a system for moving complex application logic from central servers to edge servers. It is designed to enable cooperative applications, such as mashups. Nikolaos Michalakis described the basic architecture and experiences with converting a service to use Na Kika. Users publish scripts to edge servers, which are treated like static content as far as the distribution system is concerned. The trick is that these scripts are then executed and potentially composed on the edge servers in a sandbox. The script is given access to an incoming HTTP request as well as outgoing HTTP responses; it can choose to modify/handle either or both. By allowing the scripts to affect to HTTP responses, Na Kika scripts can be essentially chained together. The questions about the talk focused on the security model: how can it manage potentially antagonistic scripts that are competing for resources. Na Kika’s basic approach is to treat overload as congestion and shed requests, but more work is likely needed. However, at least one system has been converted to use Na Kika with a significant increase in performance.

Closing out the session were two talks by experienced faculty: Vivek Pai spoke first about connection conditioning, a way to use filters to capture common web request handling functionality; Tom Anderson closed the session by discussing a new approach to congestion control, PCP. These talks differed in feel from the other talks at the conference due to perhaps more familiarity between the audience and the speakers. For example, Matt Welsh and Vivek sparred over the performance of their respective well-known web servers.

The first observation of connection conditioning (CC) is that performance of web servers has gotten pretty good; most of the bottlenecks (e.g., linear time algorithms, heavy memory copies) that used to exist in the operating system have been removed. Further, most people stick with Apache, which isn’t the fastest of all servers. The second observation is that many servers have to solve the same problems: for example, security bugs or request scheduling/prioritizing. Thus, maybe we should sacrifice potentially some performance in order to make servers easier to write. CC implements phases of web request handling using short, simple, independent, and reusable filters; it connects them together using unix domain sockets and file descriptor passing. Because the OS does most of the heavy lifting, these simple programs can get pretty good performance while modularizing solutions to common problems and simplifying the core server design. The short version? Flash with connection conditioning kicks butt.

Tom Anderson closed out the session by trying to sell the audience on PCP, billed as the last congestion control paper you’ll need to read. PCP, the Probe Control Protocol, is designed to replace TCP; TCP’s congestion control is well known to be inefficient (e.g., poor utilization under load) and many many papers have attempted to fix various parts of it in the past. Dina Katabi’s XCP, for example, has much better performance than TCP but unfortunately requires explicit support in the network to tell each flow the correct rate to send at (to maximize throughput without loss). Since core network routers don’t get new features too often, PCP emulates the network support to achieve the same results. It does this by probing the network and looking for queues: as Dina and her student Sachin Katti have observed in the past, queues often manifest themselves as dispersion in formerly equally spaced packets. Thus, PCP sends probes at a given rate (i.e., a particular spacing) and looks for changes in the spacing at the receiver. Tom showed many graphs (too many, perhaps) but all basically showing that PCP gets good performance, good utilization and can interact effectively with existing TCP flows. While doubtless not the last word (Nick Feamster asked where the stability analysis was, for example), it sounds pretty cool and a more comprehensive, deployable solution than has been available in the past.

Update: Corrected URL typo and name mispelling.