Thoughts on Systems

Emil Sit

Jan 18, 2008 - 3 minute read - Personal distractions

Finding distractions that really distract

Once in a while, someone will tell me that there’s nothing good on the Internet. However, while “there’s nothing good on TV” may be true, the long tail on the Internet means there are plenty of time-wasting opportunities available, if you take a few minutes to look. For example, for any hobby, there is undoubtedly a thriving Internet community to while away your work hours. Do you knit? There are knitting blogs, e-zines, and community sites. Collect flashlights? More candle power to you. Want to improve your productivity? File away.

The trick to finding good distractions is to find something that can occupy you for a long time, for those days when there’s just no good celebrity gossip or political scandal to follow, no stimulating intellectual articles to read, and no award winning websites you haven’t visited. Hobby how-tos only get you so far, since you need to put in the time to practice your hobby in order to benefit from what you read (and you can’t do that at work). What you need is the web equivalent of finding a TV show you hadn’t heard of, renting the DVDs and watching the entire series in a single sleep-deprived week. This means something that not only has on-going new content, but also archival content that you can spend hours perusing.

Fortunately, over the last few years, people have created a cornucopia of content that’s fun to read: intellectually stimulating enough to draw you in but not so much that it tires you out. I’ve found three main categories: personal writing, fiction, and web comics. I haven’t found a lot of personal writing, though I do enjoy dooce’s stories of family life and hearing the anonymous adventures of a waiter. Many kinds of fiction can be found if you look: fan fiction can be good, short stories give you good stimulation for less time.

Web comics are a bit newer to me but I’m finding them quite enjoyable. Comics can make anything exciting, even an intellectual board game. A recent Washington Post article observed that web comics are “a little edgier, a little quirkier and much much funnier” than your typical Sunday paper strip. Imagine…

These are all beautifully illustrated and in some cases, you can really see the artist grow along with the comic as you peruse the archives. If short and not-pc is more your style, check out A Softer World or Perry Bible Fellowship. I haven’t even begun to explore all the entries listed at Top Web Comics; surely you can find something for you, from romance to zombies. Comics often have thousands of past panels for you to read. Hours of fun!

If nothing here appeals to you, you may really be in a surfing slump. Try having someone else review your distraction list; they may see things you’re missing. Ask others for recommendations: photographer Chase Jarvis got many answers to a request for a new favorite blog, and Howard French uses contacts to find cool photos on Flickr. Link blogs (like mine) are another great way to take advantage of others to filter content for you. With so many options, there’s really no need to be in a slump. All you have to do is look. Maybe you can take a minute to share your distractions in a comment. But wait, shouldn’t you be working?

Note to employers: I work as hard and thoroughly as I procrastinate.

Jan 2, 2008 - 3 minute read - Personal facebook privacy

A tale of Facebook: privacy and community

New year’s day was my 98th birthday, at least according to my Facebook profile. It was a good day to learn a lesson about privacy and community.

For a long time, I simply ignored Facebook’s requirement that you enter your birthday “as a security measure.” How can Facebook knowing your birthday make anyone more secure? Perhaps it is related to the Child Online Protection Act (which is of questionable constitutionality). Or for use as some sort of poor verifier in case of a forgotten password? I remain skeptical; I try to only put information on Facebook that can be found elsewhere which would preclude putting my birthday on there. Information incontinence and identity theft are bad enough without entrusting additional parties with any personal information, much less one with Facebook’s terrible privacy record.

However, someone at Facebook obviously takes age information very seriously and a new feature in December asked for my birthday in a content-obscuring pop-up. So, I entered a fake birthday with the intention of simply hiding it afterward. Unfortunately, Facebook does not allow you to change basic profile preferences, including hiding your birthday, if it deems your birthday to be fake. You are not allowed to change your birth year without engaging customer support, despite the fact that it keeps insisting that you enter a real birthday without reference to its customer support policy. (Oddly, it thinks any birth date in 1910 is fake—though they offer 1910 as possible birth year, they don’t believe that anyone born that year (e.g., grandmothers) could possibly want to use Facebook.)

As a result of this mix-up, I spent much of yesterday receiving wall posts and e-mails from wonderful friends offering me new years and birthday wishes (though some were rightly somewhat suspicious). Though Facebook reminders are not quite the same as people remembering your birthday themselves, it is still a nice gesture to receive and I appreciated every one. With some embarrassment, I wrote back to them to explain the situation.

Some improved design on Facebook’s part could have helped avoid my embarrassment—a clearer explanation of their “requirements”, a more unified and logical handling of birthday verification, or simply displaying people’s upcoming age in the birthday notification area (e.g., “Emil Sit is turning 98 tomorrow”). Design notwithstanding, Facebook raises difficult privacy and usage questions (the subject of much research). But it also provides surprising utility. So, until they (and I) figure out how best to manage online social interactions, such mistakes will happen. For those of you who were misled about my age, I apologize. Happy new year!

Mar 9, 2007 - 3 minute read - Personal privacy

Privacy, the Internet, and me

I’ve always been careful not to reveal much personal information online and often distrust online vendors. Some of my friends are thus surprised that I have a homepage, a blog and now a tumblelog. However, compared to other friends just a few years younger than me, my online self is decidedly modest; those friends enthusiastically fill out their profiles on the latest social networking platforms and blog about their everyday experiences in great detail.

This difference was recently discussed in a New York Magazine article titled “Say Everything”; describing the fear of having personal information online, the author Emily Nussbaum notes:

[…] the standard response I’ve gotten when I’ve spoken about this piece with anyone over 39: “But what about the perverts?” For teenagers, who have grown up laughing at porn pop-ups and the occasional instant message from a skeezy stranger, this is about as logical as the question, “How can you move to New York? You’ll get mugged!”

A comment thread over on Bruce Schneier’s security blog is filled with more Constitutional concerns over the erosion of privacy; an anonymous commenter writes:

People lose their privacy because they have no idea how much they used to have. I bet less than one in ten (and I think I’m being extremely generous) are even aware of any of the above threats, and thus don’t see the encroachment. Or worse, know but think the authorities are fully justified in taking advantage of them, and have no idea how much they’ve lost.

Of course, having a blog or a homepage is not what makes it possible to find out about more me. Unlike the John Smiths of the world, it is very easy to find out more about me; I participate in many public activities that leave their mark online, from mailing lists to research conferences to the occasional ultimate team. By publishing my own information, I can choose what people see first and what I find important. This is better than the alternatives: letting other people decide what is important or simply not doing anything at all (just as the most secure computer is one that is powered off and locked in a vault).

Putting up a homepage, a blog, and participating in social networks are just examples of how we use technology to make connections that we may have lost in Western society. In past centuries, I would have had the privacy spoken of by the commenter on Bruce Scheneier’s blog; my information wouldn’t be aggregated in a giant database. But I also would have had stronger relationships with my neighbors and community.

I worry sometimes that the people with tubes aren’t passing laws that comprehensively and uniformly protect the use of our personal information. That there aren’t yet economic incentives for credit card companies to develop protections against identity theft. But, people are working on things on the legal front as well as exploring technical ideas like examining our online surface area, managing our attention, and managing and sharing data anonymously. And things will improve when today’s teenagers become senators and professors.

Until then, we experiment with managing our online selves, learning to be careful about what we share and how we share it. We have learned etiquette for using mobile phones and made them the new garden fence. The use of blogs to document and share personal experience and practical knowledge with others is becoming more mainstream. Those on the cutting edge use twitter and campfire to provide ambient intimacy and virtual context.

This blog is part of my explorations of this spectrum. Say hello!

Mar 7, 2007 - 1 minute read - Technology openid security

OpenID: the future lies in consumption

OpenID has been generating a lot of buzz this past month: OpenID is a decentralized authentication mechanism that allows a consuming web-site to verify that “you” can authenticate to a particular identity provider (keyed by a URL). Big names from AOL to SmugMug to WordPress have recently announced that they are being OpenID providers.

Why so many providers? For one, it is pretty easy to become an OpenID provider—for example, I am my own provider using phpMyID. But I suspect that a major motivation for these big players is that OpenID allows them to position themselves as the URL of choice for their users. It’s a great way for them to get mindshare of the http://openid.aol.com/screenname-kind and retain customers.

However, provider mindshare is just a nice incentive to get the chicken-and-egg problem solved. Now, we must wait for more OpenID consumers to make the vision of OpenID a reality and really benefit you and me. 37signals getting on board is a nice step and it’ll be nice when WP, MT, Blogger and more support commenting with OpenID!

Feb 6, 2007 - 2 minute read - Technology customer-service hosting Photography

Excellent customer service from SmugMug

Excellent customer service speaks for itself. SmugMug’s CEO, Don MacAskill gave an interview last month where he said:

We also provide really great customer service, which is sort of unheard of on the net. Not only is it unheard of, it’s almost expected that you get the opposite.

Don is not kidding. Here’s an example of great customer service: in December, a customer and SmugMug community member posted a wedding photo horror story where the paid photographer had failed to meet expectations but was willing to provide the original from-camera images. This customer asked if anyone would be willing to help him try to re-touch and salvage any of the images from his wedding. In three hours, Andy Williams, SmugMug’s in-house pro responded with an offer to manually re-touch the images and comp an initial order of proofs. The work was done in two weeks and completed on New Year’s Day. I find that pretty amazing. Read or skim the thread for yourself.

In addition to excellent customer support, SmugMug is a good feature-rich service. While the perfectionist in me could find things to improve (mostly relating to improving its already very useful access control) I also don’t have to worry about a thing. My images are safe, I can control who can see what and how much, and people who prefer prints to computers can order them, satisfaction guaranteed. Beat that.

Jan 31, 2007 - 3 minute read - Research mercurial programming tools

Tools for moving from CVS to Mercurial

When switching to a new version control system, it is important to be able to bring along all the past history of a project. There are several tools capable of converting a CVS repository to Mercurial; I have considered cvs20hg, tailor and Mercurial’s own convert-repo. While these all do the conversion, careful testing of the results is necessary. My experience was not so bad, but I did uncover a problem much later than I would have liked.

The Chord CVS conversion was done using cvs20hg, a tool which parses the ,v files used by CVS internally to reconstruct a changeset oriented history out of the individual file modifications. To test cvs20hg, I set up a nightly sync of the CVS tree to a read-only Mercurial repository: this ran without problems. Thus emboldened, I took advantage of a pause in development along a CVS branch to make the official conversion.

Unfortunately, as I later discovered, cvs20hg seems to have a problem with branches (at least, our branch). While I was attempting to trace back the history of a particular file, I found that the history abruptly terminated at the point where the file had been moved between two directories. Older checkouts did not include the file at all! After spending a little time debugging, I found that the problem seemed to be that cvs20hg incorrectly identified the branch start date as being the initial repository creation date and, as a result, ignored the history of all files that were ever deleted.

After a few hours of experimentation, I was able to re-migrate the repository cleanly by first migrating up to the actual branch creation date and the continuing along the branch for the remainder of the migration. The new transplant extension was able move over a month’s worth of commits from the old migration into the new migration without any problems.

I did not wind up choosing convert-repo or tailor, though I did consider them. convert-repo in the Mercurial tip now supports CVS using cvsps to generate the changesets. When deciding how to recover from my initial mis-migration, I experimented briefly with this tool. However, it makes use of the new named branches feature of Mercurial which perhaps is still a bit young; it seemed premature and overly complex to include named branches in my conversion.

tailor is a more mature tool and I have used tailor in the past to convert to other systems with success; however, compared to the special cvs20hg tool, tailor is harder to use. On the other hand, tailor has many more features and can be used in more configurations. It is also actively maintained. Others have reported successful experiences mirroring repositories where cvs20hg would not work.

While these tools are all functional, there is room for improvement: these tools are certainly not yet ready to handle migrating complex projects like Mozilla. Fortunately for me, they do work well enough for simple projects. Many thanks to those early adopters who came before and got these tools working.

Jan 23, 2007 - 3 minute read - Research mercurial programming tools

Choosing Mercurial for Chord

Over the past few years, distributed version control systems have flourished; there are now so many that is hard to choose between them. Each choice offers an evolution beyond CVS including, among other things, whole-tree views with atomic commits, complete and transparent offline operation, and excellent branching support. Always on the look out for better tools, I have played with many of these systems over the past few years and watched them evolve. Last month, I chose one: I migrated the source for Chord, my research project, from CVS to Mercurial.

Why Mercurial and not other distributed version control tools like darcs, git, and Arch (tla/bzr)? Since these tools all have similar features, the distinguishing factors for me were Mercurial’s ease of use, performance, and modular design. This means to me that it won’t be hard for me or others to use and it is likely to cleanly develop additional functionality as it matures.

First, Mercurial is easy to use. The interface (e.g., hg ci) is readily accessible to long-time CVS users. You don’t have to learn new names for familiar commands (e.g., darcs record). There aren’t any arcane incantations or odd naming conventions required to keep the repository compact and manageable. Offline access is completely transparent and quite convenient for working at airports or cafes without free wireless. In addition, it is easy to set up both anonymous read-only access (via HTTP using the supplied hgweb CGI scripts) and shared read-write access (via ssh). Not that other systems haven’t improved over the years (e.g., the Bazaar version of Arch), but Mercurial makes a very favorable first impression for usability.

Usability is enhanced by the fact that Mercurial is not slow; it’s actually pretty fast. It doesn’t really matter whether or not it is the fastest; Mercurial is fast enough while being quite space efficient. Choice of repository format has a big impact on this and Mercurial’s authors have clearly thought about how to achieve functionality along with performance: Mercurial’s design (PDF) is laid out in a paper that explains why and how it achieves scalability and performance.

Finally, Mercurial’s design includes an API for extension modules. A modular design (with documentation) means that innovation can be cleanly added on by external developers. There are already a nice set of extensions that provide features like GPG-signing commits, managing patch queues, and cherry-picking commits from one repository and moving them to another. A commonly requested feature is to be able to check-out only portions of repositories: the forest extension is one approach to doing so.

Overall, I am quite pleased with Mercurial and have recommended it to several people. In my next post, I’ll talk about how I managed the conversion from CVS and some early experiences with using Mercurial in my research group.

Oct 25, 2006 - 3 minute read - Research grid-computing

Observations on SunGrid Customer Care

I haven’t used the SunGrid this week. In fact, no one has: there was a four day outage from last Saturday morning through this morning. I received a notification about this last Wednesday evening. In compensation, Sun has credited me (and presumably everyone) with 100 additional free CPU hours, which was thoughtful. However, if the IPTPS submission deadline was this week as opposed to next, the 100 free hours probably wouldn’t do me much good.

What company gives 3 days of notice for a complete outage of a utility for 4 days? NStar, my electricity company, occasionally (though rarely) needs to shut down power to do infrastructure work so they send out a notice at least two weeks in advance by mail, call and leave a message on my phone, and schedule it for a three or four hour window in the middle of the night. And people in the next town (or block) still have power. See the difference? If Sun wants the Grid to be used as a utility and seen as truly dependable, they should act more like one; the fact that they can afford a four day outage plus $100 per user suggests they are not yet ready to be a utility.

It turns out that Sun was consolidating some data centers, which I deduced from some SSL error messages delivered by their login front-end over the week-end. This is not something you decide to do three days before the event. (Is it?) Send out a notice in August! And deploy one of those fancy Blackbox machines somewhere and run with reduced capacity for a few days. A more “Web 2.0” company would even tell you what they’re doing up front (maybe after a bit of prodding)—and it is much appreciated by customers. The openness of a company about these details is very important in choosing any online service.

In the meantime, I did have an opportunity to interact with Sun’s customer service over e-mail and they were very responsive and effective. I tried to get a job in Friday evening before the outage, and had some issues which were acknowledged and handled very quickly, even though it was after business hours. (Maybe they had more people on call because of the outage, maybe not; the help pages indicate that normal hours exclude weekends and holidays.) I like responsive and competent customer service; too often, I understand the problem better than the first-line support and it takes forever to find someone who can actually observe and fix the problem. Sun so far has had good customer service and open access to their engineers.

As an idea, the SunGrid is a fast and easy way to get parallelism and performance flexibly. But Sun has to continue to improve the user interface (e.g., beyond the clever hack for job monitoring suggested to me by a Sun engineer) and reliability of their infrastructure. Unless they do, people without CPU grants are going to start looking at alternatives like using Amazon’s hosted EC2 or running their own DigiPede.

Oct 24, 2006 - 2 minute read - Rants security

Nexenta insecure by default

The concept of providing operating systems that are secure by default should be second nature to OS vendors. All major operating systems vendors have been affected by exploits that allow remote attackers to take over the computer and have realized that it is a bad thing: much better to reduce the possible avenues of attack as much as possible without relying on the user to do the right thing. This practice has been adopted by vendors of operating systems from Apple to Debian. Even Microsoft has a secure by default story called SD3+C. Unfortunately, the Nexenta GNU Solaris developers don’t pay as much attention to security.

In May, I submitted a high priority ticket indicating that it is possible to remotely log in to the Nexenta VMWare image without a password, using ssh or telnet. This seemed especially risky to me given the prevalence of attacks aimed at ssh. Ignored for five months, it was recently closed and marked as “wontfix”.

This reflects poorly on Nexenta. Though I’m excited about the possibility of a DTrace-enabled system with Debian-style package maintenance, I am skeptical of development team that lets a a security bug submitted as high priority sit for 5 months and then summarily dismisses it.

A simple solution would be to simply disable SSH and telnet by default in all installs of Nexenta. Further, ssh could be configured to disallow root logins and passwordless logins. Now, if only I could figure out how to append a comment to my ticket…

Oct 15, 2006 - 4 minute read - Research grid-computing tools

First steps with the SunGrid

The SunGrid is an on-demand grid computing infrastructure: you pay per CPU-hour as you need it, Sun provides the hardware. I recently got access to the SunGrid as part of a generous grant of CPU hours by Sun to my research lab, CSAIL, and I’m mostly quite pleased with it.

John Powers rightly notes that it is not trivial to adapt most applications to run on the SunGrid: > It appears to me that there’s a long hard hill to climb to get > applications onto SunGrid, and until that problem is fixed, few > will care if the price is a buck or a penny per CPU-hour, even if > the racks are full of nice hardware.

However, it is pretty easy to get started with embarrassingly parallelizable problems (like parameter space exploration) that run the same basic code with different inputs. The machines are better equipped and faster than the ones I normally have access to. I’ve used several hundred CPU hours so far running simulations to explore some new research ideas and the degree of parallelism available is quite gratifying.

Access to the SunGrid is via a web interface. You package up your application and its data files in a set of compressed ZIP files (called “resources”), and upload them to the Grid. You create a job by selecting which of resources to unpack and tell the Grid which executable/script to run. All of the nodes you use share a network file system which holds the freshly unpacked contents of the resources you picked for the job. After the job completes, the Grid handles collecting any new files created, packages them into a new ZIP file for you to download. This sounds simple and you can see it in action in Jon Udell’s SunGrid screencast.

My application is in written in C++ with no external library dependencies (e.g., no Boost or libasync or even the STL). This means it is easy to compile with g++ and ship over to the Grid. The trick of course is that I needed an x64 machine to build on. Fortunately, Sun will cleverly give you one almost for free; in my case, one was provided for me by the Infrastructure Group.

I have two gripes about SunGrid right now. First, there is no way to obtain job-specific status beyond number of CPU hours consumed. If you have an infinite loop sitting in some seldom exercised code path, you might not notice it until you’ve consumed quite a few hours. If you suspect a job has gone rogue, there is no way to inspect its state by logging in to a machine somewhere: you have to cancel the job and download the output. It would be much better if you could specify some sort of status to be displayed in the UI, much like the existing running CPU hour usage. Even a single integer could be useful (e.g., number of sub-jobs remaining) though obviously a short text string would be more flexible.

Second, you must interact with the SunGrid via a JavaScript-heavy web interface. This is not always convenient: for example, you may generate large input files on a well-connected server while working remotely. In order to load this resource, you are forced to transfer the inputs to your local machine and then upload them to the Grid. I would much prefer some sort of API (e.g., XML-RPC over HTTPS) that would allow me to submit resources, define jobs, and manage runs. For larger corporations, it would take humans out of the loop for any periodic tasks.

That aside, my SunGrid experience has been rather enjoyable. If you are a grad student and find yourself needing CPU power for simulations, buying an x64 box from Sun and getting a SunGrid account for the heavy lifting is probably way cheaper than buying a cluster and powering it, not to mention having to maintain all those machines. It’s also probably less time consuming than writing your own tools to cannibalize spare cycles on the workstations of your fellow students. Give it a try!