Thoughts on Systems

Emil Sit

Aug 10, 2008 - 4 minute read - Technology authentication bruce schneier cookies e-mail encryption FastMail GMail ING Kim Cameron openid password randall stross security session Vanguard

Improving web authentication

You use passwords, possibly dozens of passwords, to authenticate to websites daily. Passwords are a useful authentication tool because they function as a “thing-you-know” (a shared secret between you and the server) and because passwords can be changed (in case of loss, unlike say, your fingerprints).

In a diatribe against OpenID titled, “Goodbye, Passwords. You aren’t a good defense”, Randall Stross argues (?) that the time for passwords has gone and (password-based) single sign on systems like OpenID are not going to fly. Let’s ignore the fact that while he claims “no security expert [he] could reach” thought passwords were a good idea, he names no actual experts in his column—could he not, for example, get a comment from Bruce Schneier, who has written extensively about the subject? By ignoring that, we would be less likely to conclude that perhaps that his column is just a front piece fed to him by the source he does cite, Kim Cameron. (The OpenID blog contains a somewhat more objective defense of the issue.) OpenID is still a long way from mainstream, however, and a site can do many things to improve their authentication security without it.

Session management allows users to participate in detection of password theft. For example, Google Mail now lets you manage authenticated sessions. Not only does GMail now explicitly inform you when and from where your account has been accessed recently (which many banking websites do as well and Unix login has done for years), it also lets you explicitly log-out those other sessions. This is great news for detecting and then dealing with password compromises. Google Mail also added a setting to ensure that your mail itself always goes over an encrypted connection. This lets you trade-off between the computational overhead of secrecy via encryption and performance—for users, the computational overhead of a single SSL connection is minimal and easily amortized over a day-long GMail connection.

Google’s approaches, however, still rely on your password. What if you are at an Internet café and want to check your e-mail but not risk losing the password that protects your AdSense account? FastMail has the solution: One-time and SMS passwords. This is a brilliant feature that I am surprised is not more widely available. Basically, FastMail offers a variety of options for generating temporary, disposable passwords. You can pre-create a list of single-use passwords that you keep safely in your wallet: even if the password is captured by a key-logger or shoulder-surfer, it can never be used again to authenticate you. You can also create, on-demand, a single-use password that is sent via SMS to your cellphone. These are great ways to protect your account while still being able to access it from anywhere.

Any web system must also deal with the inevitable forgotten password. ING Direct demonstrates how this can be handled efficiently and safely. Instead of the moronic question/answer systems that demand that you remember exact, case-sensitive answers to short answer questions, ING appears to ask you specific questions about your current billing address, and then some things from your credit report (like where you used to live or work). I remember the addresses of places I used to live and work; I can’t remember if I used to own a Subaru or an Outback. ING often feeds you wrong information to clue you in to a potential problem—enter an invalid saver ID and it will happily make up a name for you that’s not yours. And after jumping through these questions, you also have to prove access to a verified e-mail address. These are familiar, repeatable tasks that I feel work quite well.

Finally, the display of personalized phrases and images at login time help reduce the risk of phishing attacks, by authenticating the website to you. Yahoo! sets a cookie for this purpose, displaying text in a color of your choosing, for each computer you use—the browser policy of returning cookies only to the domains that set them ensures that you are connecting to the proper site. Vanguard and ING both link the custom image to your username. They trade-off the convenience of not having to worry about cookies on every computer for the potential risk of a man-in-the-middle attack. I’d imagine they’ve done the risk analysis studies to determine that this works out best.

It would greatly improve the security of most websites if they supported user session management, forced SSL, provided one-time passwords/SMS passwords, authenticated users using intelligent questions, and authenticated to users explicitly. While they may be foreign to users today, as they become more common and uniformly adopted, they will become as familiar as captchas but infinitely more useful.

Aug 9, 2008 - 2 minute read - Personal lifehacks mastery selfimprovement

Become a master

Masters make things look easy. A master photographer can pick up a disposable camera and take a beautiful picture; a master bodyworker sees patterns that cause pain in your body and efficiently corrects them; a master programmer rapidly produces working systems and debugs existing ones. There’s something appealing about being a master of any skill.

Tim Ferriss argues that we can all approach mastery of many areas, that mastery can be achieved in less time than we think. He writes:

Generalists recognize that the 8020 principle applies to skills: 20% of a language’s vocabulary will enable you to communicate and understand at least 80%, 20% of a dance like tango (lead and footwork) separates the novice from the pro, 20% of the moves in a sport account for 80% of the scoring, etc.

Of course, it takes more than 21 days to master a skill and daily practice is critical. You must constantly challenge yourself to do something difficult and to learn something new.

I find these challenges incredibly rewarding. Since I graduated college, I’ve taken up hobbies and skills outside my professional work—from yoga to photography; I continue to learn about about areas within my field, from tools like the hot version control system of the day to how to write an operating system. Some things I still want to learn include massage, menu planning, haircutting, jazz piano, … not to mention continuing to improve upon what I’ve already learned. But it is hard to find time to learn new things and keep up with old ones. Time management is something I have yet to fully master. I’m working on it.

Are you a master? No? Why not?

Jun 25, 2008 - 1 minute read - Hacking howto programming tools workflow

Reduce your context switch delay

Sometimes, simple shell scripts can save a lot of time. Recently, I noticed myself waiting for various unit tests to complete by surfing the web: a surefire way to be distracted for more than the time it takes for the tests to complete (or fail). Enter the following script, which I call notify:

xmessage -center "$(basename $1) done, status $status"
exit $status

You run it like this:

notify make all

at which point make runs along merrily. Of course, you replace make with whatever command you run that takes a long time to complete. When make exits, a small window appears in the middle of your screen that says “make done, status 0”. This immediately notifies you to stop surfing and get back to work.

So… get back to work!

Apr 30, 2008 - 1 minute read - Research Technology failures hosting

Characterizing failures in a data center

Part of my research has been investigating how to build storage systems that can provide availability and durability despite failures. It’s been interesting to see recent papers that characterize failures, such as Ethan Katz-Bassett’s NSDI paper about Hubble, or last year’s papers about drive failure characteristics from Google and from several high performance computing facilities. Today, while catching up on reading High Scalability, I came across a Jeff Dean presentation about scalability at Google, which includes fascinating anecdotal tidbits about failures over a year in a typical Google data center, with frequency, impact and recovery time, such as:

  • 0.5 overheating events: power down most machines in <5 mins, ~1–2 days to recover
  • 1 PDU failure: 500–1000 machines suddenly disappear, ~6 hours to recover.
  • 1 network rewiring: rolling 5% of machines down over a 2 day span.
  • 5 racks go wonky: 40–80 machines see 50% packet loss.

His presentation includes several other classes of rack, network, and machine failures that you can expect to see with real hardware, and that scalable, distributed systems have to cope with and hopefully mask. For the full list of failures, you can view the presentation (1.2 Mbyte PDF) or the video. I wonder how well Chord/DHash would fare in such an environment…

Apr 1, 2008 - 3 minute read - Photography Technology camera dslr Photography tools

Choosing a camera for your small business

If you are in a small business that needs the occasional picture—for record-keeping, documenting events, or including in promotional material—having a digital camera on hand is definitely useful. A friend recently wrote: > We want something that isn’t too complex, but takes decent shots for print > and web. Possibly with a good zoom so we can get wide-angle […] > as well as portraits. Nothing crazy on either end, we don’t need to have > multiple lenses and all that. > > Would you recommend an slr?

For my friend’s uses, I have no hesitation recommending a digital SLR over a compact, ultra-compact, or superzoom. Normally, the main reasons for getting a smaller camera are price and portability. While DSLRs have come down significantly in price—for almost the same cost as a high-end superzoom, you can purchase a quality, entry-level DSLR—there are definitely functional options available for under $200. DSLRs do not fit into your pants pocket, unlike ultra-compacts; then again, a more functional compact or superzoom will also not fit in your pocket. A small-business, however, is unlikely to require a camera that can be carried off in someone’s pants pocket.

Digital SLRs currently have numerous advantages, including:

  • high quality sensors,
  • fast focus and shutter response time, and
  • flexibility in terms of lens, lighting and processing.

This means excellent image quality, never missing a shot because of the camera, and room to grow. Much has already been written about the importance of a quality sensor—pixel for pixel, DSLRs will have better sensors, capable of taking clearer pictures with less noise in low-light conditions (e.g., the interior of a business). Similarly, even a basic lens will have higher quality optics than the average compact camera. The ergonomics and usability of digital SLRs is excellent: they are comfortable to hold, turn on instantly, and take pictures when you click the shutter. Modern DSLRs have excellent auto-exposure (“program”) modes, allowing them to function as point and shoots, but with the option of additional photographer control.

The reason I highly recommend a digital SLRs for a small business in particular, however, is flexibility. First, digital SLRs almost always offer the option of RAW capture, which allows for great latitude in image-processing after the fact. Second, with a known brand like Canon or Nikon, it easy to incrementally improve the capability of the camera by using additional lenses and off-camera lighting equipment. You may not want to own a plethora of lenses, but you may occasionally want to rent the highest quality professional gear. With rates starting from $25/weekend (from a local store like Calumet) or $50/week (from a mail-order shop like, you can get what you need for a specific project, while having a quality camera around for regular use.

How to pick the camera for you? For a small business, don’t worry about perusing the many specs at An entry-level (or one-generation-old medium-level) camera from Canon or Nikon, purchased at a reputable store like Amazon, B&H Photo or Adorama, will serve you well. If you know any photographers, choose the brand that they own, in case you have any questions or want to borrow lenses or flashes. A more expensive camera is generally unnecessary—you will know when you are ready to use one.

Mar 17, 2008 - 2 minute read - Hacking Technology pipes tools twitter web2.0

Clean up a Twitter feed with a Yahoo Pipe

Twitter provides RSS/Atom feeds of your posts; with these feeds, your posts can be easily tracked in news readers like Google Reader, monitored in aggregators like FriendFeed or SocialThing!, and cross-posted into other blog services such as Tumblr. This idea works fine, except for the fact that Twitter has been co-opted to be not only an ambient intimacy service, but also a chat service. This can create noise in other people’s view of your feed—consider the chat versus status/micro-blog updates on, for example, Adam Darowski’s blog:

Twitter Microblog example

Lacking context for the replies, the individual message may be hard to follow. Using Yahoo! Pipes, we can generate a clean RSS feed that can be used in FriendFeed or Tumblr. Seeing that the service (that filters out #hashtags from a Twitter feed) was built in part with Pipes, I built a simple pipe called Twitter Feed without Replies that anyone can parameterize and use to filter their replies. Simply visit the pipe’s information page, enter your username, and get the results as RSS (under “More options”). The main downside of the current implementation is that the feed description and title can not be parameterized as well.

Incidentally, Yahoo! Pipes were really easy to use and seem nicely designed for easy integration with other services. The above pipe took an hour to build and it was my first experience with the service. With a little more work it would probably be possible to build a pipe that parses/generates JSON, for use in programs such as the WordPress Twitter Widget, as well as RSS for feed readers. On the other hand, for those cases, it is probably easier in those cases to take Twitter JSON output and filter that directly.

Do @replies in microblogs bother you? Would you care enough to remove them?

Mar 14, 2008 - 3 minute read - Technology howto question security ssh tools vpn

How to use ssh to securely access the net

Public wireless networks can be scary; you never know who might be sniffing your traffic, recording your GMail authentication cookies, or worse. Ideally, all of your net activity should be end-to-end authenticated and encrypted. Fortunately, since this is not always feasible, ssh makes it easy to use an untrusted network by routing your traffic through a trusted end-point. All you need is an ssh client (OpenSSH, standard on most Linux/Mac systems or PuTTY for Windows), an HTTP/HTTPS proxy (optional), and clients that support SOCKS5 (most software these days). These techniques are new but I didn’t really learn them until I started working at cafés so it may be worth re-summarizing.

The steps are pretty straightforward.

  1. Enable dynamic port forwarding for ssh. This creates a SOCKS proxy on your localhost at a port you specify; this proxy will handle the connection forwarding, over the secure (authenticated and encrypted) ssh connection.

I connect to our trusted server at work; if you don’t have a trusted server, you can try getting a free shell account. You can automatically enable dynamic port forwarding by setting DynamicForward in your ssh_config file (or creating a PuTTY profile) for your shell host.

  1. (Optional) Set up Polipo with a configuration file that points its parent proxy at the port you used for dynamic forwarding. I like using a separate web proxy so I can switch easily switch between tunneling through ssh or direct connection by just switching out the web proxy configuration without reconfiguring all my applications individually. A proxy also ensures that your DNS requests are not visible to the local insecure network.

  2. Configure all of your network applications to use the SOCKS proxy (or HTTP proxy). For application-specific instructions, you can view the Torify HOWTO; the “anonymizing” Tor network’s interface also uses an HTTP or SOCKS proxy, so the same instructions apply. (Unfortunately, Tor is neither secure (it has untrusted exit points) nor really anonymous (see any of Steven Murdoch’s papers about Tor) so I can’t recommend it. It’s slow too.) I tunnel Firefox, my Twitter client, and my IM client through the web proxy. If you choose not to use an HTTP proxy, Firefox and Pidgin both support directly talking to the SOCKS proxy.

Also, if you do not use a webmail sevice like GMail, make sure you configure your mail client to both read mail over SSL/TLS (e.g., secure IMAP) and to authenticate the outgoing mail server as well. I have been in a hotel that transparently redirected all outgoing mail traffic (port 25) into the void.

The result: all traffic to and from your laptop is secure from prying eyes. A side benefit is knowing that your traffic is exiting the Internet from a trusted host.

Mar 12, 2008 - 4 minute read - Technology im tools twitter usability web2.0 zephyr

Twitter needs better message tracking options

Twitter is the hot messaging platform of choice for many discerning technologists and early adopters. (If you don’t know what Twitter is, check out the CommonCraft intro video for a quick overview.) In short, Twitter provides laconic insight into what people are doing, with a diversity of client interfaces to satisfy (almost) every need. While Twitter is nominally for providing ambient intimacy, recent research shows that many are using Twitter as a way to publish information—as an op-ed/news venue—and subsequent discussion forum. This usage is mixed together with more “traditional” updates of a personal nature. While personal updates provide insight into people’s non-work activities, sometimes it may be more than desired. Twitter would benefit from the ability to isolate such messages and subscribe to updates more selectively.

I propose taking a lesson from the Zephyr instant messaging system. In college, Zephyr was where ambient intimacy and information publication/discussion occurred. Like Twitter, Zephyr was originally intended for one purpose—notifying users of upcoming outages—but was co-opted for another. In its heyday at MIT, Zephyr was the place to catch up with your friends, chat with the community about the latest on current events, and get help with esoteric technical problems (or homework) from the top people in the community. Zephyr is still in use, though the community of users has declined since the growth of AIM and ICQ.

Unlike Twitter, Zephyr provides an advanced subscription mechanism, instead of forcing you to follow all messages of a given user. Each Zephyr message has a class, an instance and a recipient. The class acts as a namespace, typically used by different communities to separate their messages from others. Instances act like tags, marking the subject of conversation. For private messages, a specific recipient can be specified as well. Subscriptions are required to specify a class/namespace but can match any tag or any recipient (though the system would only deliver messages to the intended recipient).

The result of this mechanism is tremendous flexibility in usage and community isolation. Classes (namespaces) can be established to partition discussion by student groups (e.g., -c sipb) or by function (-c help). Discussions in those groups will not be delivered to anyone not specifically subscribed to the class. Instances/tags further sub-divide discussions—for example, SIPB members seeking dinner might coordinate a food order over the food instance (-c sipb -i food). A default class (called message) is provided for general simple discussions, where users can subscribe to all instances/tags. Typically, discussions about particular classes (e.g., -i 6.033 for the instance about the intro computer systems class), news and politics (-i white-magic), or local technical issues (-i network) are on this default class. Zephyr’s default client also allows for customization of formatting and placement of messages based on the <class,instance,recipient> tuple.

This sort of isolation and organization is difficult to achieve on Twitter currently. Imagine how nice it would be if all Twitter messages about South by Southwest were isolated to a sxsw class, with rich differentiation for those present (e.g., <sxsw,zuckerberg,*>) and completely ignorable for those of un-interested; at the same time, you could still get Scoble’s non-SXSW related tweets.

The best proposal I am aware of uses hashtags, which look a little like #channels from IRC, to tag tweets. Such tweets are indexed by various services like or Unfortunately, these proposals require embedding metadata in the Twitter message itself, which can be aesthetically unappealing enough to warrant removal. Further, it is difficult to track tags (or even @replies from strangers) from within API clients, since the track command is limited to IM and SMS, and appears to ignore punctuation.

There have been some demands for groups in the Twitter support forums, so it does appear that a feature like this may be in our future. As a simple start, if Twitter’s track command caused tracked tweets to appear in your timeline (for the API) and also supported stemming (e.g., I think I would be happy. However, even better would be to take recipient and tagging information out of the tweet and make it explicitly metadata (much like d is used for direct messages). Whatever is implemented, I hope it will allow for Zephyr-like organization of discussions. Until then, #hashtags and third-party extensions will have to do.

Feb 26, 2008 - 3 minute read - Technology friendfeed web2.0

What value does FriendFeed add?

FriendFeed, which launched today, starts with a simple premise: it aggregates content you produce from various popular sites. On top of that, it allows you, your friends, and possibly strangers, to comment on your content. As a further social aspect, it provides some basic discovery/social-networking services. This is all viewable via a (currently) basic web interface or RSS. Given the various hype about this service, what’s the new value?

Discovery sounds much like what StumbleUpon offers. This feature is probably aimed towards virally spreading FriendFeed but is not part of its core functionality. Perhaps it can introduce you to content that you can’t already get by simply subscribing to your friends’ link blogs or feeds.

Commenting is a slightly troublesome feature: most existing sites allow users to submit comments and hold discussions directly on the site itself. In this sense, comments and diggslikes on FriendFeed are more of a hassle than a convenience. FriendFeed doesn’t import discussion that appear in comments of, for example, your imported Flickr images. It also doesn’t provide any mechanism for taking comments made on FriendFeed back to the original content host. Jeremy Zawodny called for an API to extract comments but it would be better if FriendFeed pushed comments back into services on your behalf (perhaps via trackbacks, at least for blogs), rather than creating yet another place where you will have to moderate comments. An excellent example of comment integration is PhotoPhlow, an irc for Flickr which makes it extremely easy to comment on Flickr images without forcing you to load Flickr’s web interface. On the other hand, FriendFeed’s model is perhaps no worse than reddit, and allows you to chat with a community of people you know.

Simple feed aggregation is not new: you can aggregate feeds without FriendFeed by sharing a tag in Google Reader, making a new Yahoo! Pipe, or using some other feed blending service. However, FriendFeed makes this tremendously easy to do—aggregation and display are its core functions, not an add-on. The community of feeds is what makes FriendFeed possibly addictive, in the way that Facebook’s News Feed can be. I can imagine FriendFeed becoming even more powerful, by adding filtering capabilities, so that you can exclude Twitter @replies or only import Flickr photos from a particular set.

FriendFeed may turn out to simply be an aggregation mechanism dominated by A-list bloggers, or it may turn out to be a place where you can track and chat with your friends. They will do best if they can make their site a place you visit every day, or even many times per day—will you replace your feed reader (and FaceBook and Twitter client and …) with FriendFeed? The service definitely has the potential to be very popular, especially if they improve their integration with data sources. It will be interesting to see what happens; for now, you can follow some of my activities on my friendfeed.

Feb 23, 2008 - 2 minute read - Technology howto planetlab python question tools

How to extract PlanetLab geographic data

During the course of a given week, I answer a lot of technical questions. They range from the friend asking, “What laptop should I buy?” to strangers with very specific questions about the source code used in my research. I rather enjoy solving technical questions and taking a line from Jon Udell’s “Too busy to blog?” post, I’m going to start posting some useful answers on this blog. If you have a question, please send it along!

This week’s question comes by way of the planetlab-users mailing list. PlanetLab is “an open platform for developing, deploying, and accessing planetary-scale services.” It consists of now over 800 distributed noes, (over)used by systems and networking researchers to approximate real wide-area deployments and validate research ideas. For example, if you’ve ever used the Coral content distribution network or the CoDeen web proxy, then you have used PlanetLab.

Jeff Sedayao asked: > I’d like geographical data on nodes - I know that there is lat > long data in the PLC, but I don’t seem to be able to find an API > for getting it out. Ideally I like it to be queryable through > comon so I can do queries like “find nodes that are usable within > the following geographical areas” but I don’t see a queryable > hook for that. If anyone can help with a pointer at getting this > data out, I’d appreciate it.

The following code (here shortened for the web) does not use the PLC API; rather it makes use of an XML dump of the node database that PlanetLab publishes periodically. The node database dump groups nodes by site: this code extracts the geographic coordinates for each site using Python’s included XML parser, calculates the distance from center using geopy, filters out those that are further than range miles away. It then prints out the hosts at any remaining sites.

#!/usr/bin/env python
import xml.dom.minidom
from urllib import urlopen
# Requires geopy from
from geopy import distance

# Print things that are range miles from the center.
center = (32.877, -117.237)
range  = 100

fh = urlopen("")
sites = xml.dom.minidom.parse (fh)

for e in sites.childNodes[1].childNodes:
        lat = float (e.getAttribute ("LATITUDE"))
        long = float (e.getAttribute ("LONGITUDE"))
    except: continue
    a = (lat, long)

    d = distance.distance (center, a).miles
    if d > range: continue

    for h in e.childNodes:
        try: name = h.getAttribute("NAME").lower ()
        except: continue
        print name

The resulting list is suitable for passing to tools like vxargs.

PlanetLab’s list archives have the full thread with some other options presented as well.