Thoughts on Systems

Emil Sit

Nov 3, 2004 - 10 minute read - Research conferences

Internet Measurement Conference 2004

Some notes from the technical sessions at the 2004 Internet Measurement Conference; the conference was fun, not only for the good papers that I saw, but also for the food and travel opportunity.

Talks that I particularly liked:

  • Walter Willinger’s talk on “A pragmatic approach to dealing with high-variability in network measurements”.
  • Darryl Veitch’s talk about “Robust Synchronization of Software Clocks Across the Internet”.
  • Vern Paxson’s “Strategies for Sound Internet Measurement”

Jeffrey Pang: Availability, Usage, and Deployment of DNS

I thought it was interesting that PlanetLab showed up on the very first slide of the very first talk of the conference. Jeff’s related work slide made me think that I needed to read a SIGCOMM paper about DNS robustness.

I think there was interesting material in this talk, but it was hard to figure out what was going on. There are many different data sets that they analyzed, some of which are raw data, some of which are filtered raw data, some of which are active measurements derived from raw data. Different methodologies used for different parts as well. The amount of data presented made it hard for me to immediately believe the conclusions that were presented on the slide. The paper also wandered away from strict measurement or characterization and talked for a bit about the impacts of their study on “federated systems” such as DHTs. That seemd a little strange.

David Malone: Hints or Slaves?

An interesting question: would massively replicating the root/TLD zone prevent bogus queries to the root?

I wondered why he didn’t directly analyze one trace to calculate the answer; instead he took two traces with different configurations and looked at the difference. The talk didn’t present hardly any numbers though.

He also observed a large number of IPv6 queries in his trace, similar to what Jaeyeon and I observed.

Shaikh: Responsiveness of DNS-based network control

More data from Akamai, courtesy of Bruce Maggs. In particular, data for major sports news hosts, where they measured specific changes of A and NS records. They observed that many clients appear to ignore TTLs on A and NS records, often holding on to them for hours after they should have expired.

Vinod Yegneswaran: Characteristics of Internet Background Radiation

Measurements from three different networks, using a iSink (a new stateless Click-based network listener/responder, see RAID 2004) and also honeyd. Vinod talked about the tricks they used to discern different kinds of attacks to their networks. Their measurements observed that attack behavior across networks seems to vary.

Bruce Maggs: An analysis of live streaming workloads

Here we got to learn a little bit about Akamai’s streaming network, which was of some interest to me following my work at Cisco nee SightPath. Akamai uses approximately 1000 edge servers and it sounded like they could serve about 70 clients on each. They have a simple distribution tree system for live-streaming. Each client-facing machine speaks only one protocol (WMV, QT, Real) at a time.

Some interesting observations such as:

  • Events are bimodal Zipf.
  • Session durations are heavy-tailed.
  • Many clients now use TCP transport; almost as many as use UDP.
  • Significant client diversity (>12 timezones) and varying lifetimes. In particular, more than 10-20% new users per day.

Steve Jobs is a big user of their network, biannually making an on-line speech to the faithful.

Joel Sommers: Harpoon and MACE

Joel gave two talks about traffic generation; by hanging-out with him later, it became clear that he knows a lot about NetFlow.

Harpoon is designed to reproduce specific features of measured traffic, self-configuring based on NetFlow data and is thus app independent. It models file transfer sessions but leaves TCP dynamics to the test-bed: you’ll get the same sort of transfers executed but how it behaves will depend on your test-bed.

MACE generates malicious traffic. It uses a Python-based language to allow users to specify different kinds of attacks or traffic to generate.

Walter Willinger: Pragmatic Approach to Dealing with High Variability

This talk seemed ideal for someone (like me) that observes heavy-tail traffic but doesn’t really have a good formal basis for understanding them. One thing he said was “a straight-line on a log-log plot isn’t enough proof”. Hmm!

His basic theme was to take some lessons from Mandelbrot and to distinguish approximately right from certifiably wrong. To do that you could:

  • Seek internal consistency in your model:
    • Borrow strength from a large data set by ensuring your model works at different sampling levels of your data. e.g. increasingly larger samples.
    • You want the parameters of your model to converge and the confidence intervals to nest.
  • Seek external verification of your model:
    • Look across layers to make sure that what you get makes sense (e.g. IP -> TCP -> HTTP sessions)

The paper also has formal definitions of high variability, heavy-tailed, etc. A good overall reference that was easy to follow, and I should do the work to understand it in depth.

Darryl Veitch: Robust Synchronization of SW Clocks

Darryl presented a talk with lots of technical graphs that I didn’t understand but yet was very excited about. The main observation of the work is that NTP was designed in a world where local clocks were not very good, and hence a master was needed to instruct local clocks what to think. Now, you can build an accurate local clock by looking at the TSC register, which they show is fairly stable. This is also useful for measuring differences in time, which can be hard to do precisely when using NTP since NTP will slew the clock.

The paper (and his talk) presents algorithms for maintaing a difference clock (easily derivable from the difference in value of TSC and the known period p), and an absolute clock, which combines the difference clock with a known offset. The latter is easier to maintain in the absence of external synchronization. The former requires a close NTP source in order to limit error.

G. Varenni: Toward 10Gbps with Commodity Hardware

The key idea of this work was that current systems process incoming packets sequentially and are thus unable to take advantage of today’s MP machines. They propose an architecture involving a ring buffer, a monitor and a scheduler to multiplex packets to different user-level apps efficiently.

David Maltz: Structure Preserving Router Config anonymization

This is a tool — based on a list of safe words and a set of 26 odd hand-tuned heuristics — for taking Cisco router configurations and producing anonymized versions. Numbers and strings such as IP addresses or AS numbers are anonymized. A large number (> 20k) of AT&T router configs have been processed using this tool, using some 200 different versions of Cisco IOS.

The goal is to allow companies to feel safe releasing router configurations to the research community so that we can figure out how routers are really being used.

H. Dreger: Packet Trace Manipulation Framework for Test Labs

Sounded like a set of tools to extract flows, merge traces, time-dilate/compress packet traces for replaying onto test networks. There are some tricks involved in avoiding artifacts (such as exceeding real link capacities) as a result of these operations.

Vladimir Brik: Debugging DHCP Performance

A tool that monitors DHCP servers, mirroring their state internally, observing client behavior, to find problems and potential optimizations.

R. Kompella: Scalable Attack Detection

If only we could detect attacks deep in the core of the network, we could do something… well, it’s hard to detect it at high rates. This work proposes calculating the ratio of TCP SYN/FIN packets to observe something that looks like an attack.

Anukool Lakhina: Characterization of network-wide traffic flows

The interesting part of this talk (to me) was the subspace method that was used to identify anomalies; unfortunately, that was work presented at SIGCOMM that I didn’t see. The talk actually focused on some interesting results that came about from using the method. They had access to a large traffic matrix, broke it down by flows, bytes, and packets per OD-pair. They were able to see shifts in traffic, and other things.

Their tool identified the presence of anomolies, but they hand inspected all of the results.

Robert Schweller: Reversible sketches

Sketches (k-ary sketches) have been proposed as a way to find interesting events compactly but there isn’t a good way to map from the sketch (and identified event) back to the host(s) that caused it. Robbie presented a technique using modular hashing (i.e. hashing byte by byte) to reduce the complexity of reversing the sketch and obtaining the IP address. To preserve good hashing properties, an IP mangling function (a bijection that shuffles bits of IP address) is needed.

Vern Paxson: Sound Internet Measurement

This is a talk that I wish I had gotten when I started working on DNS analysis in Hari’s class. Vern gave all of us advice on what to be aware of and tips for achieving sound measurement.

Mostly, it boils down to discipline.

You should:

  • Keep meta-data. Lots of meta-data, more than you think you need, so that your future self will be able to know what you did in the past. For example, what filter was used with tcpdump? That’s not recorded in the trace itself and can affect conclusions that you draw.
  • Check your methodology with someone else, before you start. Better they tell you what to fix than the reviewers of your paper. Are you measuring what you want?
  • Calibrate your measurements. Understand the outliers.
  • Automate all production of output from raw data. This ensures that you can reproduce something later when you are working on the camera-ready — you’ll have lost all your mental context but hopefully, it’ll be in your source code. Also, cache intermediate results. This will give you performance normally but allow you to be sure of your results by blowing away your caches and starting over.
  • Have a way to visualize changes in your results after your scripts change. (This seems hard.)

Vern gave some other examples too.

Estimating Interdomain Web Traffic

This talk is not really about the amount of HTTP traffic that transits between Level 3 and BBN. It’s about end-points (e.g. people clicking on links) to publishers.

More tricks using Akamai data. With logs for 3 2-hour periods for (almost) all of Akamai, and complete traces from the border of a large (German) organization, can calculate the ratio of data served by Akamai (per Akamai customer) to data served by each Akamai customer directly. Assuming ratio holds for other sites, can use the Akamai logs of what data was sent by Akamai to estimate the amount of traffic from the origin servers to the client machines.

They need more calibration, i.e. confirmation that the ratios hold at other sites.

A. Medina: Interactions between Transport Protocols and Middleboxes

Interesting examination as to what servers seem to support SACK, IP options, ECN, PathMTU, etc. Use a special prober that crafts special packets to a lot of servers and observes the results.

Lots of things presented but the big take-home message is that many places filter/drop ICMP and also unknown IP options. So, not much hope for deploying cool tricks that use IP options or ICMP.

K. K. Ramakkrishnan: Characterizing IP VPNs

At some point, AT&T moved from forcing the customer to characterize the precise needs of their VPNs (and hence making it easy to provision), they switched to making it hard to provision but easy for the customer. How to provision? They try and estimate something per VPN, solve some equations, figure out where the traffic is going. It seems to work better for large or busy links.

Jussara Almeida: Characterizing Spam

As part of a greater effort to understand spam traffic and thus be able to build better tools to detect and stop spam, they analyzed a lot of mail at their ingress SMTP point.

I found it amazing (given the work Jaeyeon and I are working on) that they don’t have the IP address of the sender — instead, they characterize based on the domain of the From address. Also, they (with little choice, I suppose) focus on things that are caught by SpamAssassin. Also, if mail was black-listed at the ingress, they don’t analyze that either.

With that in mind, spam has less diurnal variability than non-spam. And it appears that 50% of spam is sent by 3% of spam only senders.