Difficulties in data analysis

In the course of my research, I tend to do a fair amount of data analysis and reduction. This ranges from simple statistics to in-depth examination of traces. While working on the camera-ready for our NSDI paper, I found myself thinking about Vern Paxson’s IMC paper on Strategies for Sound Internet Measurement (PDF. The paper is filled with good material and I long for the kinds of generalized tools he describes for managing an analysis workflow.

A typical analysis might go something like this:

Track down or collect some data.
Store it someplace accessible: a local directory or SQL database.
Hack up some script or program to grab some numbers.
Through together some numbers in gnuplot.
Discover something odd about the data, resulting perhaps in a change to the collection or analysis programs.
Repeat, as your data and analysis tools grow in complexity…
Assemble your results into a paper.

If you’re disciplined, you’ll be writing down what you do and why in a notebook. You’ll learn never to manually edit any data file to fix problems (because six months from now, you won’t have a clue what edits you made and why). Hopefully, you are storing your source code in a version control system like CVS or darcs.

Section 4 of Vern’s paper talks in some detail about this, recommending that all your analysis flows from data through to results mediated by a single master script, filled with caching to speed repeated runs. This is an excellent idea but thus far involves a fair amount of hacking on your own, tailored to your particular problems.

And, there are still problems. For example,

Which version of the tools did I use to generate a particular set of results? I didn’t have any uncommitted changes in my scripts when I ran it, did I?
Datasets can get pretty large. You probably don’t want to include them in the version control repository you are using for your actual paper; how do you relate versions of the paper to versions of your analysis toolkit and results?
What if I am preparing my camera ready and run some updated scripts. How will I remember to update all the things derived from the old results? (Even supposing that the conclusions didn’t change much.)

I can see some hope for solving these problems. Systems and network researchers tend to work in Unix environments that have tools and infrastructure available to make this easy—I wouldn’t have a clue how to manage a complex paper with multiple authors using MS Word and Excel, though perhaps it could be done. I’ve heard of nice hacks: for example, for one roofnet paper, the authors set up the Makefiles to query a centralized SQL database in order to generate graphs. Nick Feamster and David Andersen have started a project called the Datapository ( paper) which works in this general direction.

However, I’m not aware of any pre-built and general toolkit or framework for hooking together tools to simplify data analysis. But there must be hundreds of little hacks people have developed to get things done. What data analysis and paper writing toolchain have you built that you are most proud of? What was the biggest disaster? Send me e-mail or post a comment. I have some of my own thoughts on how to build such a system, which I’ll try and post soon.