Thoughts on Systems

Emil Sit

Dec 21, 2005 - 4 minute read - Hacking hosting

Colophon, the beginning

I’ve been thinking about having a blog for a while and now here I have one with my own domain name running on outsourced hosting. That doesn’t really seem like me–I usually like to have more control–but I think Dan Sandler has really summarized the whole argument: it’s a lot of work to have featureful blog software and if someone else has done it, well, that’s less time spent building scaffolding.

This blog is powered by Typo, blog software that can be accessed via multiple XML-RPC APIs, store data in multiple database backends, and seems pretty easy to theme. I came across it while searching for non-PHP based blog software that supported SQLite. While trying to learn a little bit about it, I entered the Typo Theme Contest and wound up being one of the first ten entrants, scoring myself a free year of hosting. A few rounds with GoDaddy and here we are!

Some early impressions of the outsourced hosting process:

  • PlanetArgon’s sign-up process could use some more documentation. It wasn’t immediately clear to me whether or not they would handle acquiring a domain name for me (they didn’t) and what the management of those domains might be like.
  • GoDaddy’s ordering process and management UI is not great for beginners.
    • They advertised a free proxy feature (for privacy) when registering three domains but it didn’t automatically get added to my cart. (I’ve decided not to try and argue or fight about this.)
    • The normal search process for domains does not seem to allow adding multiple domain names to your cart without going through and entering some customer information. In contrast, their quick-search feature does allow you to just add a name to the cart. This made things a bit confusing when I was trying to sign up for multiple names.
    • The management UI has no-less than three ways to try and get at different features: the top, green nav bar which includes links to both product descriptions and management options via dropdowns, the white nav bar which does not have drop downs despite having the same visual arrow cue has the green, and a text-based list that is on the bottom third of the first page. I could write more about the inconsistencies of the UI but I was able to figure out how to change my nameservers and such to point to PlanetArgon’s servers. What amazed me was that the time between purchasing the names and having the show up in the root zone was basically instantaneous.
  • It’s nice not to have to worry about maintaining my own hardware and software but perhaps next time I should go for a Xen-based hosting solution. If the machines don’t have some tool I want (e.g. darcs or adnsresfilter), I have to ask someone instead of just running apt-get or portinstall. And while having IRC access to PlanetArgon’s staff during business hours is convenient, it’d also be nice if they had a support ticket system.
  • The boundaries between what the hosting staff will do as opposed to what I will do inside my hosting domain is not entirely clear to me. I haven’t asked yet but I don’t know if they will handle upgrading Typo (which appears to be from svn) especially if I customize the themes and other stuff. I don’t think I have bits to change the top-level of my site to be something other than Typo.
  • Does VHCS2 (PlanetArgon’s management system) offer a non-FTP method of getting a dump of my account’s files?

It’s all part of the learning process. We’ll see what happens.

Update: I ran some experiments overnight to learn a bit more about backups. The nightly backup runs at around 2am ET and appears to tar the contents of the virtual host directory (not the entire home directory), excluding the backups and logs directories. The script does not delete other files in the backups directory but only keeps a single nightly backup. The MySQL database does not appear to be backed up in a user accessible way. You don’t need to use FTP (though they do run ProFTPd); sftp works fine.

Nov 3, 2004 - 10 minute read - Research conferences

Internet Measurement Conference 2004

Some notes from the technical sessions at the 2004 Internet Measurement Conference; the conference was fun, not only for the good papers that I saw, but also for the food and travel opportunity.

Talks that I particularly liked:

  • Walter Willinger’s talk on “A pragmatic approach to dealing with high-variability in network measurements”.
  • Darryl Veitch’s talk about “Robust Synchronization of Software Clocks Across the Internet”.
  • Vern Paxson’s “Strategies for Sound Internet Measurement”

Jeffrey Pang: Availability, Usage, and Deployment of DNS

I thought it was interesting that PlanetLab showed up on the very first slide of the very first talk of the conference. Jeff’s related work slide made me think that I needed to read a SIGCOMM paper about DNS robustness.

I think there was interesting material in this talk, but it was hard to figure out what was going on. There are many different data sets that they analyzed, some of which are raw data, some of which are filtered raw data, some of which are active measurements derived from raw data. Different methodologies used for different parts as well. The amount of data presented made it hard for me to immediately believe the conclusions that were presented on the slide. The paper also wandered away from strict measurement or characterization and talked for a bit about the impacts of their study on “federated systems” such as DHTs. That seemd a little strange.

David Malone: Hints or Slaves?

An interesting question: would massively replicating the root/TLD zone prevent bogus queries to the root?

I wondered why he didn’t directly analyze one trace to calculate the answer; instead he took two traces with different configurations and looked at the difference. The talk didn’t present hardly any numbers though.

He also observed a large number of IPv6 queries in his trace, similar to what Jaeyeon and I observed.

Shaikh: Responsiveness of DNS-based network control

More data from Akamai, courtesy of Bruce Maggs. In particular, data for major sports news hosts, where they measured specific changes of A and NS records. They observed that many clients appear to ignore TTLs on A and NS records, often holding on to them for hours after they should have expired.

Vinod Yegneswaran: Characteristics of Internet Background Radiation

Measurements from three different networks, using a iSink (a new stateless Click-based network listener/responder, see RAID 2004) and also honeyd. Vinod talked about the tricks they used to discern different kinds of attacks to their networks. Their measurements observed that attack behavior across networks seems to vary.

Bruce Maggs: An analysis of live streaming workloads

Here we got to learn a little bit about Akamai’s streaming network, which was of some interest to me following my work at Cisco nee SightPath. Akamai uses approximately 1000 edge servers and it sounded like they could serve about 70 clients on each. They have a simple distribution tree system for live-streaming. Each client-facing machine speaks only one protocol (WMV, QT, Real) at a time.

Some interesting observations such as:

  • Events are bimodal Zipf.
  • Session durations are heavy-tailed.
  • Many clients now use TCP transport; almost as many as use UDP.
  • Significant client diversity (>12 timezones) and varying lifetimes. In particular, more than 10-20% new users per day.

Steve Jobs is a big user of their network, biannually making an on-line speech to the faithful.

Joel Sommers: Harpoon and MACE

Joel gave two talks about traffic generation; by hanging-out with him later, it became clear that he knows a lot about NetFlow.

Harpoon is designed to reproduce specific features of measured traffic, self-configuring based on NetFlow data and is thus app independent. It models file transfer sessions but leaves TCP dynamics to the test-bed: you’ll get the same sort of transfers executed but how it behaves will depend on your test-bed.

MACE generates malicious traffic. It uses a Python-based language to allow users to specify different kinds of attacks or traffic to generate.

Walter Willinger: Pragmatic Approach to Dealing with High Variability

This talk seemed ideal for someone (like me) that observes heavy-tail traffic but doesn’t really have a good formal basis for understanding them. One thing he said was “a straight-line on a log-log plot isn’t enough proof”. Hmm!

His basic theme was to take some lessons from Mandelbrot and to distinguish approximately right from certifiably wrong. To do that you could:

  • Seek internal consistency in your model:
    • Borrow strength from a large data set by ensuring your model works at different sampling levels of your data. e.g. increasingly larger samples.
    • You want the parameters of your model to converge and the confidence intervals to nest.
  • Seek external verification of your model:
    • Look across layers to make sure that what you get makes sense (e.g. IP -> TCP -> HTTP sessions)

The paper also has formal definitions of high variability, heavy-tailed, etc. A good overall reference that was easy to follow, and I should do the work to understand it in depth.

Darryl Veitch: Robust Synchronization of SW Clocks

Darryl presented a talk with lots of technical graphs that I didn’t understand but yet was very excited about. The main observation of the work is that NTP was designed in a world where local clocks were not very good, and hence a master was needed to instruct local clocks what to think. Now, you can build an accurate local clock by looking at the TSC register, which they show is fairly stable. This is also useful for measuring differences in time, which can be hard to do precisely when using NTP since NTP will slew the clock.

The paper (and his talk) presents algorithms for maintaing a difference clock (easily derivable from the difference in value of TSC and the known period p), and an absolute clock, which combines the difference clock with a known offset. The latter is easier to maintain in the absence of external synchronization. The former requires a close NTP source in order to limit error.

G. Varenni: Toward 10Gbps with Commodity Hardware

The key idea of this work was that current systems process incoming packets sequentially and are thus unable to take advantage of today’s MP machines. They propose an architecture involving a ring buffer, a monitor and a scheduler to multiplex packets to different user-level apps efficiently.

David Maltz: Structure Preserving Router Config anonymization

This is a tool — based on a list of safe words and a set of 26 odd hand-tuned heuristics — for taking Cisco router configurations and producing anonymized versions. Numbers and strings such as IP addresses or AS numbers are anonymized. A large number (> 20k) of AT&T router configs have been processed using this tool, using some 200 different versions of Cisco IOS.

The goal is to allow companies to feel safe releasing router configurations to the research community so that we can figure out how routers are really being used.

H. Dreger: Packet Trace Manipulation Framework for Test Labs

Sounded like a set of tools to extract flows, merge traces, time-dilate/compress packet traces for replaying onto test networks. There are some tricks involved in avoiding artifacts (such as exceeding real link capacities) as a result of these operations.

Vladimir Brik: Debugging DHCP Performance

A tool that monitors DHCP servers, mirroring their state internally, observing client behavior, to find problems and potential optimizations.

R. Kompella: Scalable Attack Detection

If only we could detect attacks deep in the core of the network, we could do something… well, it’s hard to detect it at high rates. This work proposes calculating the ratio of TCP SYN/FIN packets to observe something that looks like an attack.

Anukool Lakhina: Characterization of network-wide traffic flows

The interesting part of this talk (to me) was the subspace method that was used to identify anomalies; unfortunately, that was work presented at SIGCOMM that I didn’t see. The talk actually focused on some interesting results that came about from using the method. They had access to a large traffic matrix, broke it down by flows, bytes, and packets per OD-pair. They were able to see shifts in traffic, and other things.

Their tool identified the presence of anomolies, but they hand inspected all of the results.

Robert Schweller: Reversible sketches

Sketches (k-ary sketches) have been proposed as a way to find interesting events compactly but there isn’t a good way to map from the sketch (and identified event) back to the host(s) that caused it. Robbie presented a technique using modular hashing (i.e. hashing byte by byte) to reduce the complexity of reversing the sketch and obtaining the IP address. To preserve good hashing properties, an IP mangling function (a bijection that shuffles bits of IP address) is needed.

Vern Paxson: Sound Internet Measurement

This is a talk that I wish I had gotten when I started working on DNS analysis in Hari’s class. Vern gave all of us advice on what to be aware of and tips for achieving sound measurement.

Mostly, it boils down to discipline.

You should:

  • Keep meta-data. Lots of meta-data, more than you think you need, so that your future self will be able to know what you did in the past. For example, what filter was used with tcpdump? That’s not recorded in the trace itself and can affect conclusions that you draw.
  • Check your methodology with someone else, before you start. Better they tell you what to fix than the reviewers of your paper. Are you measuring what you want?
  • Calibrate your measurements. Understand the outliers.
  • Automate all production of output from raw data. This ensures that you can reproduce something later when you are working on the camera-ready — you’ll have lost all your mental context but hopefully, it’ll be in your source code. Also, cache intermediate results. This will give you performance normally but allow you to be sure of your results by blowing away your caches and starting over.
  • Have a way to visualize changes in your results after your scripts change. (This seems hard.)

Vern gave some other examples too.

Estimating Interdomain Web Traffic

This talk is not really about the amount of HTTP traffic that transits between Level 3 and BBN. It’s about end-points (e.g. people clicking on links) to publishers.

More tricks using Akamai data. With logs for 3 2-hour periods for (almost) all of Akamai, and complete traces from the border of a large (German) organization, can calculate the ratio of data served by Akamai (per Akamai customer) to data served by each Akamai customer directly. Assuming ratio holds for other sites, can use the Akamai logs of what data was sent by Akamai to estimate the amount of traffic from the origin servers to the client machines.

They need more calibration, i.e. confirmation that the ratios hold at other sites.

A. Medina: Interactions between Transport Protocols and Middleboxes

Interesting examination as to what servers seem to support SACK, IP options, ECN, PathMTU, etc. Use a special prober that crafts special packets to a lot of servers and observes the results.

Lots of things presented but the big take-home message is that many places filter/drop ICMP and also unknown IP options. So, not much hope for deploying cool tricks that use IP options or ICMP.

K. K. Ramakkrishnan: Characterizing IP VPNs

At some point, AT&T moved from forcing the customer to characterize the precise needs of their VPNs (and hence making it easy to provision), they switched to making it hard to provision but easy for the customer. How to provision? They try and estimate something per VPN, solve some equations, figure out where the traffic is going. It seems to work better for large or busy links.

Jussara Almeida: Characterizing Spam

As part of a greater effort to understand spam traffic and thus be able to build better tools to detect and stop spam, they analyzed a lot of mail at their ingress SMTP point.

I found it amazing (given the work Jaeyeon and I are working on) that they don’t have the IP address of the sender — instead, they characterize based on the domain of the From address. Also, they (with little choice, I suppose) focus on things that are caught by SpamAssassin. Also, if mail was black-listed at the ingress, they don’t analyze that either.

With that in mind, spam has less diurnal variability than non-spam. And it appears that 50% of spam is sent by 3% of spam only senders.

Jan 1, 0001 - 4 minute read