Thoughts on Systems

Emil Sit

Sep 22, 2006 - 4 minute read - Rants design transportation

MBTA Bus Idiocy

Today’s post is a cautionary tale for usability testing. The MBTA in Boston has been in the process of upgrading the entire T infrastructure to support automated fare collection, in the form of Charlie Tickets. The stated goal of these upgrades, including the new fare boxes being installed on buses, is to provide faster and simpler service. Public transportation is hard enough without slow boarding times. Unfortunately, the current impact of the new fare boxes (pdf) is the exact opposite of the goal: it makes boarding slower.

The old fare collection box allowed monthly pass holders to quickly swipe their card and board, taking less than half a second. The new box treats bus passes as Charlie Tickets which means they need to be inserted, read and then returned. This takes at least one second, maybe two. This means that during rush hour and at crowded stops, it will take at least twice as long to board all the passengers. I don’t take the T (subway) regularly but it may be a problem in subway stations as well. Bus drivers have noticed that this rapidly becomes ridiculous and have begun simply waving through monthly pass holders without requiring that the pass be read by the machine. I hope they weren’t relying on the fare box to measure ridership.

For the occasional rider, the new fare collection boxes only accept one coin at a time. The bus fare today is 90 cents. This takes at least five coins. The old boxes had a little funnel that would let them fall in without waiting for people to meticulously insert the coins. Now you have to carefully insert coins one at a time. Clever, no?

The main benefit today of the new boxes is their automated bill ingesters which makes it easier on the bus drivers who previously had special “stuffing” implements to jam folded up dollar bills into a tiny little slot. Unfortunately, it doesn’t really make up for the delays in getting people on the bus.

This latest upgrade only compounds idiocy related to new buses that the MBTA has acquired and deployed on popular routes. Never mind that their brakes still squeal; the seat layout of these new buses is simply awful. The best way to see why the new layout sucks is to compare it the old one with regards to how well it allows passenger to board the bus. The old buses (some still in use today) consist of a single-level reached by climbing a few steps from the curb. There is a section of facing seats followed by a section of forward facing seats and then a section of seats arranged in a U at the back of the bus. The middle seats are two columns: a single seat separated from a double seat by a wide corridor.

The new buses are bi-level. The boarding area and front two-thirds of the bus are on-level with the curb (up to the rear door). The rear third is up several steps. The first set of seats is about the same, but the bulk of the seats are now all forward facing, forming two columns of double seats separated by a narrow corridor.

When the bus is crowded, say at rush hour, there are two problems. First, the corridor up the middle of the bus is so narrow that it is difficult to squeeze past people with their backpacks, shopping bags, and baby strollers. People stand near their friends, near the rear door and clog up the front of the buss. This is exacerbated by the second problem: the steps in the middle of the bus act as a major disincentive to go to the standing/seating area in the elevated rear third of the bus. The result? The front of the bus is densely and uncomfortably packed while the rear is relatively empty.

There’s perhaps hope that when the Charlie Card becomes available in 2007 it will go back to allowing regular commuters fast and efficient boarding. That is what some Charlie supporters noted in response to a similar article on Bad Transit. Unfortunately, I suspect the buses will be around a lot longer. The MBTA ought to hire usability experts and run usability tests before approving bus (and subway) changes. It’s probably cheaper in the long run than upsetting and losing customers.

Jul 27, 2006 - 4 minute read - Research firefox mozilla programming tools web2.0

Robert O'Callahan visits MIT

Today, Robert O’Callahan stopped by MIT as part of his US Tour. He works for Novell is one of the “super reviewers” at the Mozilla Foundation. If you use FireFox (like 63% of my visitors this week), you probably run code he’s touched. He also wrote TTSSH, an SSH client that I linked to from my homepage for many years. Robert came to give a talk about current fronts in the “browser wars.” It was, though, less about Mozilla versus the world, and more about the way to develop the web for the future as well as some insight into the development of Mozilla.

His thoughts on the evolving the web with standards seems most similar to Mike Davidson’s view: instead of a developing “shiny and new” standards with a focus on validation, he would rather embrace, document and extend existing best practices. New functionality should be developed in backwards compatible manner and take into account problems (like security issues) that can arise when different standards interact in unexpected ways. He disparaged the compulsion of many researchers and parts of the W3C to try and design things from scratch; nothing winds up as clean as originally imagined (see SVG).

The impression I got is that standard writers operate in their own fiefdoms and don’t implement the full spec and so they don’t have the view of the world that a browser company like Mozilla or Opera might have. For example, validation and specs doesn’t help browsers deal with invalid input. Thus companies have banded together to form the WHAT working group to document de facto standards and produce on specifications with developers in mind.

Some interesting nuggets from his talk:

  • Randomized “fuzz testing” is more effective than current tools like those from Coverity. It finds more bugs than they have time to fix. This is perhaps because the code-base has very complex heap invariants that Coverity does not check for. More “custom” static analysis tools would be needed to check for those things.
  • The project’s regression tests are slow and not too complete. When asked about actual coverage, he wasn’t too sure but also indicated that there is a lot of dead code so strict coverage might not be as indicative. They have no automated UI testing.
  • Talkback is a tool that can send stack traces to the Mozilla Foundation when the browser crashes. It turns out that these traces are currently used to debug problems; instead the traces are usually for known bugs and help developers prioritize bugs for repair based on how frequently they happen in the wild.
  • Existing profiling tools aren’t good enough for them to discover the source of performance problems; they vary significantly from run-to-run and there is no good way to compare different runs, i.e., in order to find a change that introduced a performance problem. I wonder if DTrace would be at all useful.
  • The best external contributors to the project tend to be in quality assurance. Developers fall into two camps it seems: those that send one e-mail about helping but don’t make it very far, and those that show up with large patches solving real problems. Rob speculated that the build system served as a filter: if you could get past the build, you can probably produce useful code to address real issues.
  • In response to a question, he plugged microformats over RDF as being more likely to enable the Semantic-Web. I saw Sir Tim Berners-Lee in the audience, but he didn’t say anything during the talk.

It seems that one day Mozilla will soon have improved forms (?), 2D and 3D graphics, better font rendering (ligatures etc). I was interested to hear that they will be moving JavaScript more towards a Python-like flavor (as opposed to Java). He also mentioned the tantalizing possibility of adding type inference to JS2. In giving a talk at MIT, he closed with a number of research directions; his suggestions were largely in the area of tool development: better profilers, better static analysis, better IDEs.

Jul 17, 2006 - 5 minute read - Rants hosting web2.0

Choosing Online Services

Decline in storage costs, Web 2.0, and other trends have led to a profusion of online services clamoring to host your data. At this point, even if you are the most conservative user and a stalwart late adopter of online services, you have likely heard about a wide range of online services: storing and sharing calendars, lists, photos, bookmarks, full system backups and more. Accessible anywhere and having increasingly more functionality, the choices are tempting. How can you evaluate a service to decide if it is for you?

First, you should treat it like any other product you might be committing to. Do you like the interface of the application? You may be using this every day. You don’t have to explore all of its features (though the primary one should be clear and easy to try), but if this is an application you will be using regularly, make sure you can figure out how to do the most basic tasks in just a few minutes. I personally prefer companies that provide log-in free demos of their software up front; for example, calendaring app 30boxes has one of the best examples of a quick demo. Free trials are the next best thing. Least useful are screenshots, though they are better than nothing.

As you evaluate the interface, you will also have started to evaluate its feature set. One way to gauge the features it has (or intentionally omits) is to compare against competing or similar services; each may have a slightly different focus. For example, photo host SmugMug offers image hosting with unlimited storage, generous bandwidth limits and customizable presentation. Flickr is a hugely popular photo host because of its focus on tagging and sharing. IcicleLanding’s main feature is its extremely comprehensive access control system, allowing you to set access to each individual image in a gallery. All of these services work great but one may suit your needs better. Spend some time trying to find the competition: chances are an enterprising blogger has already made up a list comparing the apps you’re considering.

Apart from these traditional concerns, online services raise a new one: how much control you exercise over the data you store with the service? Can you request that your account and all associated data be deleted? Can you somehow extract a copy of all the data you’ve input (i.e., data export)? While RSS feeds are good for sharing, they aren’t especially suited for export. This may be an important consideration if you ever want to migrate to another service, at which point you might be curious about the service’s ability to import data. This may require the existence of a standard data format: easy enough for pictures or calendaring but harder for things like project management data or billing records. The best services offer an API that allows third parties (you!) to develop their own tools for accessing the data. This may solve the import/export problem, if done well.

If you are aiming to store lots of data on a service, you may wonder how reliable their servers and disks are. Unfortunately, it is hard to find real reports of people who actually suffered disk catastrophic failures and were able to restore from an online service. You can settle for learning how they maintain your data. Does the service store more than one copy of your data in geographically distributed locations? (SmugMug keeps four copies of each picture in three different states.) What redundancy technology is used? Something well known like RAID? Or something cooked up internally, like the Google File System or Mozy’s proprietary technology based on “lots of hard math”? Do they rely on a single ISP or are they multi-homed? More redundancy will probably cost more money.

You should also be concerned with the reliability of the company itself. After all, you’re about to entrust them with potentially very personally relevant information. If it fails one day (or gets bought!) your data may disappear. How do you evaluate the character of a company? A quick trip to Google Blogsearch or Icerocket will reveal what others think, though often these are very superficial reports, rebroadcasting buzz.

If you are lucky, the company will host its own web forum or mailing list archives: you can quickly discover if there are any major problems, and the tone and responsiveness of the customer service staff. Beware that often the people with no problems may not frequent the forums; mostly, you will see three classes of people: new users, users with problems, and very experienced fans of the service answering questions. If a company has its own blog, you can see how they write and behave first hand; blogs are more informal and fun than traditional press releases but also often serve to announce important new features and document how the company responds to feedback when the inevitable bugs are found in these new features. Most exciting is to see how a company responds to a catastrophic failure of their own: fastmail.fm posted a complete description of what happened, why, and comped everyone a whole month of service.

New online services have a lot to offer; do your homework so you can best take advantage of them. Do you have your own techniques for evaluating a service? Please share!

Jun 3, 2006 - 2 minute read - Research conferences

Usenix 2006

Though I wasn’t really paying attention, it looks like the annual Usenix Technical Conference is well underway just down the street. If you happen to be in town and want to meet up, send me an e-mail.

Virtualization and large distributed systems continue to be hot topics. Larry Peterson gave the keynote about PlanetLab, the ever popular, sometimes frustrating, Internet-wide test bed. He was followed up by a session on virtualization and performance. Andy Oram has some coverage with some observations about the first day. Unlike NSDI, Usenix is a multi-track event and so he couldn’t get to see all the talks and events; his post touches on Larry’s talk, the following open-source panel and a few technical highlights from the afternoon.

Today (Friday) there was an interesting-looking panel on whether university teaching is relevant to industry: this is a point I discussed in response the interview Werner Vogels had at ACMQueue. At least one blogger didn’t think much of the panel discussion though; he raises a good point about fundamentals. There’s also an invited talk about work being done at DE Shaw—I have detailed notes from a talk Dr. Shaw himself gave at MIT, that I could write up and post if there’s interest (leave a comment!). I’m also very curious about the Pixar talk from the closing plenary session; please post any links you might have to posts about that.

Saturday, there will be a talk about liblog, which I unfortunately won’t be able to attend. However, the paper looks interesting and similar in spirit to DTrace and Pip. I can’t use Pip or DTrace immediately—Pip requires explicit annotation and DTrace requires Solaris (or maybe FreeBSD-CURRENT). However, liblog is based on LD_PRELOAD tricks and might be quite useful; the download page has some binary-only packages that maybe I will take a look at.

There are a few other talks and papers that look interesting—mostly those related to my current and past research. Trevor Blackwell and Paul Graham are also giving talks, which I imagine will receive more blog coverage than the rest.

Jun 1, 2006 - 4 minute read - Research conferences

NSDI 2006, Day 3

Finally, closing out my NSDI 2006 summaries: sessions from the last day. I hope you’ve found these useful; maybe they’ll inspire some interesting NSDI 2007 submissions. Get started today!

Wireless and Sensor Networks

Ming Li had the unenviable position of giving the first talk on the last day; much like my post lunch talk on the first day, people drifted in during his talk. (On a personal note, Ming also had the unenviable position of having to take a red-eye to Newark, NJ, followed by a connection to Hartford, CT and a drive back to Amherst.)

Sensor network research usually focuses on getting data transferred using minimal power while trading off the least amount of accuracy and freshness. Ming’s talk on PRESTO showed his contribution to the area: instead of putting intelligence about what to send or request solely at the collection sites or in the sensors themselves, PRESTO splits that intelligence. Both the collection proxy and the sensors maintain a (simple) model to predict what is expected: the sensor needs only send data to the proxy when it is anomalous. When anomalies are detected, both the proxy and the sensor can update their model if necessary; the proxy can also back-interpolate old data to increase accuracy based on the new data point. As a result, PRESTO is able to provide better accuracy using fewer transmissions and thus less power.

Cheng Tien Ee spoke next about an algorithm called PathDCS. It’s also designed for efficiency in sensor networks in a model where data about specific types of observations (e.g., presence of elephants or tanks) are forwarded to a well-known rendezvous sensor-node for aggregation. Queries can then be sent directly to the rendezvous node. PathDCS makes use of reference points called beacons; data is forwarded towards beacons along existing paths. The details of the work are in how these paths are able to minimize the stretch (number of hops required relative to a direct path, which would be hard to calculate and disseminate).

The third talk focused on geographic routing, where each node attempts to route packets greedily to a node closer to its destination based on the known physical/geographic coordinates of its neighbors and the destination. Ben Leong, soon to be a professor at the National University of Singapore, described an algorithm for getting around the problem of dead-ends in the topology. In a greedy algorithm, you might try to cross a lake by going to the end of a pier, but at that point, you can not get any closer without backtracking. Ben showed an algorithm for figuring out the shorter path around the outside of the lake, without having to do an expensive technique called graph planarization. Instead, two opposing spanning trees are constructed; from the dead-end, you walk towards the root of closer spanning tree until you can see your destination. When you are close enough, you can revert to greedy forwarding. This has worked extremely well in simulation and he hopes to validate it on real networks soon.

File and Storage Systems

The theme of the last session was revision trees for managing multiple versions of filesystems, disk images and directory trees.

Ben Pfaff gave the talk about Ventana, a Virtualization Aware File System. Ventana provides an NFS interface to multiple versions of filesystems in order to support the versioning across virtual machines. The connection seems a little tenuous; however, the paper does explore to some extent the question of how to protect different versions of filesystems by introducing a number of different access control lists.

Olive is the latest extension to FAB, a system for building high-performance disk storage without using RAID. Olive allows administrators to branch disk volumes easily, allowing them to play with changes without worrying about messing up known-good volumes. By operating at the volume level, Olive can be filesystem agnostic. Volumes can be cloned into read-write copies or frozen into read-only snapshots. The cloning mechanism winds up being cheap and provides linearizable crash-consistency (probably the best you can do at the volume level, without knowledge of what FS is running over the volume).

Alex Yip closed the conference by introducing Pastwatch; Pastwatch is a system that provides distributed version control, much like Mercurial, git, and a cast of dozens more. Pastwatch uses revtrees, which are essentially tree snapshots, to automatically track and manage simultaneous writes and conflicts. It’s my understanding that the ideas used in Pastwatch are similar to those used in Mercurial; I suspect that Mercurial is significantly better tested, given that it has been adopted by OpenSolaris. However, a small group of researchers at PDOS do use Pastwatch regularly for production work. I can’t really keep up with all the different features and benefits of the different systems to do a better comparison. An interesting aspect of Pastwatch is Aqua, a system for highly-available hosting of Pastwatch projects. This area might be ultimately more relevant than the version control aspects of Pastwatch.

May 25, 2006 - 5 minute read - Research conferences

NSDI 2006, Day 2 Afternoon Sessions

Following up my summaries of the morning sessions, this post reviews the second half of the second day of NSDI 2006. (This post was updated to correct bogus timestamp and add links to papers in the first session.)

Measurement and Analysis

Haifeng Yu began the afternoon session with his award paper on the availability of multi-object operations (PDF). The main concern is that prior work has largely focused on the availability of single objects. However, it is often the case that there are many objects stored in a distributed system that are related—in fact, an application-level operation likely requires accessing multiple objects at the storage layer. This paper takes a look at how object replica placement policies impact the ability of such operations to complete successfully. The basic trade-off is in whether related objects are clustered on the same hosts (partitioning) or distributed across multiple hosts (random placement). The paper discusses when one or the other may be more appropriate.

Suman Nath followed up with a related paper on correlated failures (PDF). The key points made in this paper affect the accuracy and utility of failure prediction models and what can be done about it. Based on the analysis of some traces, the Nath first argued that existing predictions of failures work well enough but don’t necessarily help: correlated failures of small numbers of nodes tend not to have much impact anyway so prediction doesn’t matter, whereas correlated failures of large numbers of nodes is quite hard to predict but tend to have a huge impact on availability or durability. This means, for example, that additional replicas or fragments quickly offer only diminishing returns. He introduced a model for generating failures based on a bi-exponential distribution that better matched the traces that were analyzed and argued that such considerations are important to keep in mind during design. The question naturally raised by this was how generalizable these results are as different node populations will have very different failure distributions: choose your traces carefully.

Bianca Schroeder closed the session by presenting her experiences with various benchmarking tools (PDF). She observed that despite identical load parameters (e.g., requests per second), two different benchmarking tools could show very different results. The reason for this turns out to be in whether requests are generated using an open or closed system model. An open system model is one where “users” (and hence requests) arrive continuously: there is no linkage from the completion of one request to the arrival of the next. Such a model might accurately capture the effect of Slashdot flash crowds. The closed system model simulates some set number of users executing some browsing workload: here request completions spawn new requests. The paper goes into detail about different web benchmarking tools, the model they use, and when it may be appropriate to use one type of model or the other.

Architectures and Abstractions (and Email)

The final session of the day had two talks about Internet architectures and two about dealing with spam.

Niraj Tolia spoke about data-oriented transfer, an architecture for Internet data transfer designed to make innovation in data transfer protocols easier. Currently many protocols (e.g., HTTP, SMTP, SFTP) involve two basic phases: content negotiation and then content transfer. Negotiation involves selection of content (perhaps based on language, supported encoding or encryption algorithms, etc.); transfer is just moving the actual bits around. There are many mechanisms for moving bits around, from HTTP chunked encoding to straight TCP writes to BitTorrent. However, adding support for a new bit moving mechanism to each protocol is difficult. The paper proposes an architecture that separates negotiation from delivery and shows how it can be easily incorporated. The concept seems similar to that of GSSAPI or PAM, which tries to abstract out security/authentication protocols. I suspect it will be hard for this idea to gain traction though.

Dilip Joseph spoke about Ocala, which is designed to facilitate interoperability between varying overlays. It does this by (naturally) introducing a layer of abstraction between applications and the underlying overlay. There are additional naming tricks used to allow users to cleanly specify overlay which overlay they want to use. The implementation is complete and runs on the major operating systems; Dilip gave a live demo during the talk.

Cleverly, the program committee placed two alternative proposals for handling spam together at the end of the longest day: everyone has a stake in spam and anti-spam proposals always face harsh criticism so two competing proposals was sure to generate some discussion.

Mike Walfish started off with the grand vision of distributed quota enforcement (DQE): a distributed mechanism for ensuring that no one (spammers or not) sends more than an allocated amount of mail per day. Since there is a huge disparity between the number of mail messages a spammer sends per day and the number that you and I send, this concept should work out quite nicely. It would avoid perhaps the worst problem of all: legitimate mail that gets mis-filed as spam and subsequently ignored. Mike carefully deflected criticism of the concept by focusing on the technical aspects of quota enforcement: stamps (not associated necessarily with money) are generated based on the quota and canceled in a loosely managed distributed system. The simple requirements allow for a relatively efficient solution.

Michael Kaminsky followed with a white-list proposal based on social networks, also with the goal of avoiding false positives. Re allows mail through without spam filtering if it is from your friend or a friend of a friend. The details of Re are in how it can provide friend lists and friend-of-friend checking without sacrificing privacy requirements: friend lists should be private and queries should only reveal information about the person sending mail (do we have a friend in common). The authors use a relatively new cryptographic mechanism to perform this matching task while protecting confidentiality; Michael briefly discussed the evaluation against real e-mail traces where false positives are inferred based on other known patterns. Unlike DQE, Re can be deployed at smaller scale but may not be as comprehensive; it also requires more work on the part of the users (specifying friend lists).

May 23, 2006 - 2 minute read - Technology debugging tools

First steps with DTrace

Sometime last week I came across a reference to DTrace; since then I have spent a not insignificant amount of time reading up on DTrace, culminating in downloading the VMware image of Nexenta, the Ubuntu of OpenSolaris, just to play with it. DTrace is quite exciting to me—it is an extension of tools like truss/strace/ktrace and ltrace that I have long-used to trace system activity and debug otherwise mysterious problems. (If you’re not familiar with strace, Self-Service Linux is a book that gives a nice introduction to system debugging using strace and other such tools; you can buy it or download a PDF for free.) It also has the ability to do basic profiling (much like pct) and is also programmable (like Linux’s inotify).

Like ktrace and friends, DTrace allows you to observe operating system user/kernel behavior but it has at least two major benefits: it has a simple programming language that lets you specify (perhaps speculatively!) and aggregate information exactly how you want it, and its extensible architecture allows events to be defined for kernel internals up through user-space middleware frameworks and high-level languages.

After reading some documentation and articles describing the cool things you could do with DTrace, I really wanted to try it: I had visions of hooking it into SFS and Chord to track down the sources of our disk and memory load with great precision. Running Solaris turned out to be easy: VMware’s Server product is “free” and Nexenta offers a pre-installed VMware image. Naturally, the problem turned out to be in getting SFS to compile on Solaris, which I have not yet done successfully.

In the mean time, I’ve just worked through some bits of a DTrace tutorial. The basic language is easy to learn; my only concern so far is that the /self->follow/ idiom for tracing “between” actions seems somewhat onerous. The hard part is internalizing the capabilities of providers and the common D idioms. The authors themselves point out other DTrace tips and caveats. However, these downsides are minor points given what you can do with the architecture.

The power DTrace offers is quite appealing. It is exciting enough that it makes me consider using Solaris, at least on a VMware Server. And perhaps more exciting, in the long-term, DTrace will get incorporated into more operating systems as evidenced by efforts to port DTrace to FreeBSD. Combined with distributed tools like Pip, debugging complex systems might just be getting easier.

May 17, 2006 - 3 minute read - Research academia programming workflow

Werner Vogels on Systems Research

In his interview with Jim Gray, Werner Vogels talks about how Amazon.com structures and builds its internal systems. While many others have noted his comments on web technologies and development methods, I am more interested in a few points he raised at the end about building and testing distributed systems and what those of us in academic systems research can do to help.

Building distributed systems is not easy; Vogels notes:

I have recently seen a few papers of students detailing experiences with building and operating distributed systems in a planet-lab environment. When analyzing the experiences in these papers, the main point appears to be that engineering distributed systems is an art that Ph.D. students in general do not yet possess. And I don’t mean reasoning about hardcore complex technologies—students are very good at that—but building production-style distributed services requires a whole set of different skills that you will never encounter in a lab.

He’s referring to papers written by graduate students around the world as we try to build systems on the largest distributed network we have access to, PlanetLab. We have each spent a huge amount of time developing the infrastructure and algorithms necessary to run and test systems such as Coral, Codeen, CoDoNs or OpenDHT (to highlight a few of the more successful ones). This infrastructure includes tools to monitor and control nodes, rapidly distribute binary updates (Shark, Stork, CoBlitz), and select appropriate nodes (closestnode and Oasis). This work has wound up being a necessary precursor to the real research needed to get a PhD.

Despite the existence of these and other tools, the knowledge is not so well documented and understood that it is easy to build out and test a new distributed system! You merely have to scan the archives of the PlanetLab Users mailing list to see how many researchers are struggling to install basic software into their PlanetLab experiments. Hopefully some of these now-built infrastructure services will help out newer students.

We have probably made less progress in testing. Vogels responds to Jim Gray’s question of “What are the things that are driving you crazy?” by saying:

How do you test in an environment like Amazon? Do we build another Amazon.test somewhere, which has the same number of machines, the same number of data centers, the same number of customers, and the same data sets? […] Testing in a very large-scale distributed setting is a major challenge.

For research, it is in fact difficult to reproduce results on PlanetLab. We are fortunate to have EmuLab which give us a controlled environment to do our prototyping. But researchers often lack any idea of what real application workloads might look like and are only just publishing papers that show the impact of different synthetic workload generators and developing trace archives (like UCRchive, the datapository, or availability traces). The measurement community is relatively young and I think it has been hard to get real traces of real workloads, especially from successful commercial sites that tend to be pretty secretive about their special sauce and workloads. So, I’m excited to read that:

We’re building data sets here at Amazon, however, to provide to academics so that we can get interactions going on some of the issues where they can contribute.

There has been some limited analyses of data from Ebay and Akamai, but hopefully Amazon will make their trace data more publicly available.

The next few years will likely be very exciting in this field, as academia and industry are both facing real problems and working hard to solve them. Any sharing of information and experience between researchers and industry engineers will doubtless play a big role in making progress.

May 13, 2006 - 6 minute read - Research conferences

NSDI 2006, Day 2 Morning Sessions

The second day of NSDI was the longest day, with 4 technical sessions where the last one was an extra long 2 hour session. This post summarizes the first two sessions.

Wide Area Network Services

Mike Freedman, author of the popular Coral content distribution service, opened the day by talking about OASIS. OASIS is designed to answer the following question: if you are building a Internet service with many distributed servers, how can you direct a given user to the closest (or fastest) server? This may involve geographic proximity (minimizing transmission latency) or selecting the least loaded server (minimizing service time). Your service might have broader goals as well, such as maximizing the number of clients served. OASIS solves this using two mechanisms.

  • First, using a set of core OASIS nodes and the machines from your service, it constructs a database mapping network prefixes (as advertised by BGP) to the closest server participating in OASIS. It uses the known latitude and longitude of the core nodes to help make this information shareable between services. OASIS also monitors load/availability information of the service nodes to know which nodes are useful.
  • Second, it provides a DNS (and programmatic RPC) interface so that clients can query the OASIS core to find a close node.

OASIS works very well in reducing end-to-end transfer times compared to other coordinate systems like Meridian (Cornell) and Vivaldi (MIT). I think his end-to-end performance graph is slightly misleading in that it magnifies the impact of poor coordinates; additionally, Vivaldi (for example) seeks to solve a different problem: Vivaldi requires no central infrastructure.

Jeremy Stribling spoke next; perhaps best known for his work in computer generated papers but also the first person to record latency information between PlanetLab sites, Jeremy described OverCite, a system that spreads the load of CiteSeer over a set of cooperating nodes. CiteSeer indexes research papers so distributing it involves solving three problems: storage of the index and cached papers, searching the index, and crawling the web for new papers. Storage of data and meta-data is handled using a DHT (which I am working on); the index is partitioned and replicated over the nodes in the system. Searching thus involves querying a node in each partition and coalescing the results. Crawling was not discussed. Performance measurements suggests that OverCite can scale linearly as more servers act as web front-ends, but more scalability testing is needed.

The last talk of the session focused on Colyseus, a system for distributing the work of managing online multiplayer games in a peer-to-peer fashion. Ashwin Bharambe presented the work. The main goal of Colyseus is to handle the rapid update of location information for players and objects: players have to discover other players and objects in their vicinity and update state. How can this all be kept consistent in a peer-to-peer environment, while minimizing traffic and latency? Colyseus achieves this by using primary-backup replication and placing replicas on nodes that are near the primary in the game world. It takes advantage of game physics and knowledge of the game map to prefetch replicas of objects that are likely to become relevant. Colyseus is at the point where it can support Quake II style maps but not yet World-of-Warcraft scale systems.

End System Design

The second session of the day focused on end-system designs. The first talk was about Na Kika, a system for moving complex application logic from central servers to edge servers. It is designed to enable cooperative applications, such as mashups. Nikolaos Michalakis described the basic architecture and experiences with converting a service to use Na Kika. Users publish scripts to edge servers, which are treated like static content as far as the distribution system is concerned. The trick is that these scripts are then executed and potentially composed on the edge servers in a sandbox. The script is given access to an incoming HTTP request as well as outgoing HTTP responses; it can choose to modify/handle either or both. By allowing the scripts to affect to HTTP responses, Na Kika scripts can be essentially chained together. The questions about the talk focused on the security model: how can it manage potentially antagonistic scripts that are competing for resources. Na Kika’s basic approach is to treat overload as congestion and shed requests, but more work is likely needed. However, at least one system has been converted to use Na Kika with a significant increase in performance.

Closing out the session were two talks by experienced faculty: Vivek Pai spoke first about connection conditioning, a way to use filters to capture common web request handling functionality; Tom Anderson closed the session by discussing a new approach to congestion control, PCP. These talks differed in feel from the other talks at the conference due to perhaps more familiarity between the audience and the speakers. For example, Matt Welsh and Vivek sparred over the performance of their respective well-known web servers.

The first observation of connection conditioning (CC) is that performance of web servers has gotten pretty good; most of the bottlenecks (e.g., linear time algorithms, heavy memory copies) that used to exist in the operating system have been removed. Further, most people stick with Apache, which isn’t the fastest of all servers. The second observation is that many servers have to solve the same problems: for example, security bugs or request scheduling/prioritizing. Thus, maybe we should sacrifice potentially some performance in order to make servers easier to write. CC implements phases of web request handling using short, simple, independent, and reusable filters; it connects them together using unix domain sockets and file descriptor passing. Because the OS does most of the heavy lifting, these simple programs can get pretty good performance while modularizing solutions to common problems and simplifying the core server design. The short version? Flash with connection conditioning kicks butt.

Tom Anderson closed out the session by trying to sell the audience on PCP, billed as the last congestion control paper you’ll need to read. PCP, the Probe Control Protocol, is designed to replace TCP; TCP’s congestion control is well known to be inefficient (e.g., poor utilization under load) and many many papers have attempted to fix various parts of it in the past. Dina Katabi’s XCP, for example, has much better performance than TCP but unfortunately requires explicit support in the network to tell each flow the correct rate to send at (to maximize throughput without loss). Since core network routers don’t get new features too often, PCP emulates the network support to achieve the same results. It does this by probing the network and looking for queues: as Dina and her student Sachin Katti have observed in the past, queues often manifest themselves as dispersion in formerly equally spaced packets. Thus, PCP sends probes at a given rate (i.e., a particular spacing) and looks for changes in the spacing at the receiver. Tom showed many graphs (too many, perhaps) but all basically showing that PCP gets good performance, good utilization and can interact effectively with existing TCP flows. While doubtless not the last word (Nick Feamster asked where the stability analysis was, for example), it sounds pretty cool and a more comprehensive, deployable solution than has been available in the past.

Update: Corrected URL typo and name mispelling.