Thoughts on Systems

Emil Sit

May 17, 2006 - 3 minute read - Research academia programming workflow

Werner Vogels on Systems Research

In his interview with Jim Gray, Werner Vogels talks about how Amazon.com structures and builds its internal systems. While many others have noted his comments on web technologies and development methods, I am more interested in a few points he raised at the end about building and testing distributed systems and what those of us in academic systems research can do to help.

Building distributed systems is not easy; Vogels notes:

I have recently seen a few papers of students detailing experiences with building and operating distributed systems in a planet-lab environment. When analyzing the experiences in these papers, the main point appears to be that engineering distributed systems is an art that Ph.D. students in general do not yet possess. And I don’t mean reasoning about hardcore complex technologies—students are very good at that—but building production-style distributed services requires a whole set of different skills that you will never encounter in a lab.

He’s referring to papers written by graduate students around the world as we try to build systems on the largest distributed network we have access to, PlanetLab. We have each spent a huge amount of time developing the infrastructure and algorithms necessary to run and test systems such as Coral, Codeen, CoDoNs or OpenDHT (to highlight a few of the more successful ones). This infrastructure includes tools to monitor and control nodes, rapidly distribute binary updates (Shark, Stork, CoBlitz), and select appropriate nodes (closestnode and Oasis). This work has wound up being a necessary precursor to the real research needed to get a PhD.

Despite the existence of these and other tools, the knowledge is not so well documented and understood that it is easy to build out and test a new distributed system! You merely have to scan the archives of the PlanetLab Users mailing list to see how many researchers are struggling to install basic software into their PlanetLab experiments. Hopefully some of these now-built infrastructure services will help out newer students.

We have probably made less progress in testing. Vogels responds to Jim Gray’s question of “What are the things that are driving you crazy?” by saying:

How do you test in an environment like Amazon? Do we build another Amazon.test somewhere, which has the same number of machines, the same number of data centers, the same number of customers, and the same data sets? […] Testing in a very large-scale distributed setting is a major challenge.

For research, it is in fact difficult to reproduce results on PlanetLab. We are fortunate to have EmuLab which give us a controlled environment to do our prototyping. But researchers often lack any idea of what real application workloads might look like and are only just publishing papers that show the impact of different synthetic workload generators and developing trace archives (like UCRchive, the datapository, or availability traces). The measurement community is relatively young and I think it has been hard to get real traces of real workloads, especially from successful commercial sites that tend to be pretty secretive about their special sauce and workloads. So, I’m excited to read that:

We’re building data sets here at Amazon, however, to provide to academics so that we can get interactions going on some of the issues where they can contribute.

There has been some limited analyses of data from Ebay and Akamai, but hopefully Amazon will make their trace data more publicly available.

The next few years will likely be very exciting in this field, as academia and industry are both facing real problems and working hard to solve them. Any sharing of information and experience between researchers and industry engineers will doubtless play a big role in making progress.