Thoughts on Systems

Emil Sit

May 12, 2006 - 4 minute read - Research policy privacy

Alan Davidson on Internet Regulation and Design

As part of the Technology and Policy Program’s 30th Anniversary Celebration, Alan Davidson gave a talk today titled “Internet Regulation and Design” from the point of view of Google, where he works as the Washington Policy Counsel (aka Chief Lobbyist). Google currently has a small (three person) office in DC, representing their interests.

The first part of Davidson’s talk was the standard Google spiel: much like faculty on a common research grant, it seems senior Google staffers draw from a common pool of slides. Davidson’s talk focused more on people than Udi Manber’s NSDI keynote (that I’ve already summarized), which was more technical. He included things like the number of employees (6000ish, doubling every 18 months or so), number of offices (20ish), and the number of languages supported (100ish). The push for Google now is beyond web pages: the next frontier is things like making the world’s offline information available. While Book Search is still in the 20% of their 70-20-10% rule, they are investing not-insignificant efforts in that direction.

The bulk of his talk was in examples about how Google is trying two keep two basic goals (principles?) in sight with respect to policy: first, they like the free end-to-end net with no gatekeepers, and second, they in general simply want the freedom to innovate (and thus don’t try to get (yet) any special favors from regulation). Davidson discussed this by summarizing three basic issues: content regulation, net neutrality, and copyright. His presentation of these issues was at a basic level, introducing the concepts to the audience and Google’s basic approach. Nothing too new here.

The more interesting points came out in the 20 minutes of discussion following. When asked about the broadcast/webcast rights treaty, he suggested that this would probably be a bad thing for the net, making yet another hurdle that had to be cleared before using certain media.

There were a number of questions about net neutrality. One person asked whether Google considered forming a coalition with other companies to try and prevent telcos and last mile providers from taking advantage of them; Google hasn’t thought about that and probably would rather prevent the law allowing telcos to charge providers. David Clark asked a question about market power and historical precedence for regulation with respect to spectrum scarcity; I think this was one of the more interesting questions but I lacked the policy background to fully understand it and the answer.

Not specifically related to the talk, one person raised privacy as another concern: how can we trust that Google is really not secretly giving your data to the government (for example)? Davidson replied on several fronts. First, Google strives not to keep personal information unless you explicitly allow it to. If you don’t like cookies, don’t enable them. If you don’t like Google Mail or Google Search History, don’t use them. Secondly, he argued that the law (e.g., the fourth amendment) hasn’t kept pace: your computer in your own house or your paper calendar is protected and requires that a warrant be shown to you, but if you keep your data on an ISP’s server, you don’t necessarily get to see that warrant or challenge it. Finally, he noted that Google in general does not base its business model on constructing profiles of users; things like targeted ads are based on your search keywords (and your IP address proxying for your geographical location, I imagine).

With regards to China, Davidson restated Google’s standard answer (e.g., notification of removal and no personal information on Chinese soil). When asked what he would like the government to do with respect to China, he said that it would be nice if the US laid out ground rules or best practices for US corporations to follow. For example, in any country where Google removes search results to comply with local laws (e.g., hate speech in Germany or regulations in China), it would be nice if they only had to comply if formally served with a notice indicating the law that was being violated and the content to be removed. As one company, they have no power to demand or request this: China would probably be happy to see Google leave. But, if 100 companies followed this policy and it had the backing of the US government, it would carry more weight.

I asked him afterwards about whether Google will be increasing its lobbyist efforts; a single Bell has more lobbyists working for it than Google, Microsoft and Yahoo combined. He was very interested in getting more people with good technology background down in DC. Google will definitely be expanding their efforts there. Let’s hope it helps.

May 12, 2006 - 5 minute read - Research conferences

NSDI 2006, Day 1

This week was the Third Symposium on Networked Systems Design and Implementation, sponsored by Usenix and held this year in San Jose. As with any conference, there were many opportunities for networking and meeting other researchers: I met and caught-up with students (mostly) from from CMU, Cornell, NYU, UCSB, UCLA, UMass Amherst, UT Austin as well as old MIT colleagues who are now out taking over the world.

There were a lot of talks that were interesting to me. I liked talks that presented systems with a neat underlying idea, such as Evan Cooke’s system for expanding the view of honeypots or Kyoung-Soo Park’s way of leveraging OS optimizations for web server performance while cleanly factoring out common functionality (presented by Vivek Pai); there were also impressive systems that resulted from clean design and solid engineering, like Dilip Joseph’s overlay network layer (which he demoed live during his talk!) and Ashwin Bharambe’s multi-player game system. This post summarizes the talks from the first day; I’ll try and post my notes from the next few days soon.

Udi Manber, Google: Keynote

The conference opened with a keynote by Udi Manber, now of Google. He recycled many standard slides about the growth of Google and their goals. He presented a summary of the Google architecture: a dispatch server integrating results from index servers and document servers, and a number of services on the side like spell-checking and AdSense. The main point of his talk was to show why search is still a hard and interesting problem. He gave three main reasons: first, users ask questions using one or two words that are hard to interpret; second, websites don’t always make it easy to find answers; third, there is no real curriculum nor research agenda for “search” in academia.

The first two Google works very hard on. He gave some examples of how hard it is to understand users (e.g., you need domain knowledge to correctly interpret a query for “Cofee Anan” and some luck to handle a request for “Briney Spears” from someone looking for pickles) and how websites don’t always make it easy to index information (e.g., a website that’s just a scanned image of product brochures). Google’s index and spell-checker try to take into account context and domain knowledge to solve the first case, and they’ve produced some technologies like sitemaps for the second. He did say that Google does not yet try OCR on all images. The third point is something that conference attendees could help with. However, he did not offer any concrete suggestions about curricula or research directions in the talk or in response to questions. (The talk was mostly advertising Google and focusing on Google’s features.)

In terms of teaching users how to search, there are books like O’Reilly’s Google Hacks. Off-hand there are a few areas that I would focus on as a graduate student if I wanted to build large search systems: natural language processing, sub-linear algorithms, and distributed systems; Mr. Manber’s point is that there’s probably no school that explicitly groups those two things together for a single degree!


Unfortunately, I missed the first session of the conference because I was doing last minute polishing of my own talk. I opened the second session by presenting our paper on efficient replica maintenance.

Jiandan Zheng from UT Austin gave a talk about PRACTI replication which is a framework that can capture a wide range of replication algorithms. PRACTI captures three dimensions of distributing and maintaining up-to-date replicas: partial replication (as opposed to full replication), arbitrary consistency (allowing clients to see stale or out of order updates), and topology independence (allowing any structure of node synchronizations. They have an implementation of basic replication algorithms and one of their goals is to implement the algorithms of existing systems like Bayou or Coda within their framework.

James Mickens gave a talk about predicting node availability based on past observations and whether it was possible to take advantage of any predictability. He analyzed a number of traces and classified the types of node availability—for example, unstable nodes versus always on nodes. The talk did not discuss the possibility of automatic classification perhaps using basic machine learning. PlanetLab appeared to be relatively hard to predict.


The last session of the first day presented three tools. Time dilation was introduced as a technique for evaluating the potential impact of future trends using current hardware by using virtual machines and telling them that time is going slower than reality, allowing the VM to do more in the same amount of virtual time.

Evan Cooke presented the Dark Oracle which is a clever idea for expanding the view of honeynets and network telescopes. These have traditionally relied on measuring traffic to small and contiguous unused address spaces. However, these may be easy for attackers to avoid. Fortunately, based on their analysis of announced eBGP routes and looking internally at some iBGP/OSPF routes, there are many addresses unused within allocated and active address spaces. Dark Oracle allows dynamic discovery of these addresses and redirects incoming traffic to those unused/dark addresses to a honeynet or analysis tool. It can also be used to trap outbound traffic for unused addresses as a way to detect malicious hosts on a local network.

Finally, Patrick Reynolds from Duke closed the day by presenting Pip, a system for detecting anamolies in distributed system. It works by allowing the programmer to annotate the system with events (or extending the middleware to automatically annotate, like I’d probably want to do for dm’s libasync); events are then collected from nodes in the system and analyzed centrally. The programmer can specify expectations for system behavior in a domain specific language; Pip identifies when these expectations are violated and can present a trace of actual behavior. It also includes a graphical for exploring traces. Pip has been successfully used to find bugs in the implementation of a number of published systems. I’m excited about the idea of this tool but it looks like it would be a bit of work to get it usable with my current development toolchain.

May 2, 2006 - 3 minute read - Personal academia television

Grad School Television

What’s on prime time television these days? Looking at the dramas, offhand there are crime dramas (Law and Order, CSI), medical dramas (House, Grey’s Anatomy), mystery thrillers (24, Alias), family stories (Gilmore Girls, 7th Heaven), and people having sex (The OC, One Tree Hill). Maybe there’s something missing.

How about a drama about the lives of graduate students and professors? After all, Jorge Cham has shown there’s plenty of interesting material there with a hit comic strip about the highs and lows of academic life. But, while grad school has an obvious parallel to the world of hot, up-and-coming doctors, the life of a typical grad student lacks the same kind of fast-paced and will-the-patient-die scenarios that these fictional doctors see. What then might make a good format?

Maybe try to identify a high excitement event: a major paper deadline, structured into a 24-style show. Errors in data or experimental method discovered hours or days before the deadline, frantic re-runs of experiments, re-plotting of graphs, adviser re-writes of major sections of text, and running pdflatex up until the last minute. Unfortunately, there’s not a really satisfying conclusion there. You don’t find out what happens for a few months, and by then it’s just Season 2: do the same thing all over to address all the reviewer comments and submit the camera-ready.

What about a more crime dramas approach? In most crime dramas, each episode compresses the weeks or months of real time needed to catch the bad guy into a neat 42 minute story. Shows like CSI and Bones have shown that the science science approach is just as appealing as the traditional detective oriented approach (like Columbo). Each episode could focus on a particular scientific result and the ups and downs in validating it.

A more Veronica Mars-style approach might also work well. Each season of our hypothetical academia drama could focus on the study of an important research question. As the series opens, hypothetical lab at hypothetical top school has just won a major grant to investigate some this question; we follow the progress of a few new students in the group, and their interactions with older students and faculty. As the seasons progress, individual experiments are completed, hypotheses created and tested, papers are submitted, rejected, re-submitted, published, theses are signed, tenure is granted, and the series can end with the heros and heroines going off to top academic and industry jobs with their fresh new PhDs.

From X-Files to Alias, long-running mysteries can work if done right. That Numb3rs was renewed for a second season shows that complex technical material can be explained in a way that’s accessible enough for the public to understand and find cool.

Character development is also important in any show. Fortunately, academia is full of interesting, quirky people. They do interesting things on the side: they bike, do yoga, play ultimate, dance, travel, take photographs, do pottery, and even automatically generate research papers. PhdComics demonstrates ample opportunity for developing such characters. Developing the characters’ personal lives would help give a show depth and realism (and of course, allow for the opportunity for sex on TV).

This is looking more promising. There are numerous spin-off and cross-over potentials, each with their own unique flavor: imagine the “Grad School: CS” spinning off “Grad School: Biology” or “Grad School: Astrophysics”. Combine the excitement of each discipline with a mystery driven plot and interesting characters and we could get a fun, high-energy show about the lives of professors and grad students. On a serious note, such a show might demystify grad school and academia and make it more accessible, just as CSI has done for forensics.

What do you think?

Update: The May 2nd re-print of PhD Comics in The Tech reveals that this idea has already been explored.

Apr 22, 2006 - 4 minute read - Research cryptography e-mail encryption privacy security

Proxy cryptography

Susan Hohenberger defended her thesis Friday at MIT. Susan’s thesis work is on developing secure algorithms for proxy cryptography. These are new cryptographic constructions that are designed to allow a third party, the proxy, to take a cryptographic object produced for (or by) a particular key and transform it so that it is a valid object for (or by) a different key. Susan presented new definitions for security of proxy re-encryption and re-signatures and algorithms that meet these definitions.

Why are these constructions interesting? Susan gave an example where proxy re-encryption might help with secure distributed storage systems; proxy re-signatures could be used to provide proof of flow in, say, an immigration and customs scenario. I’ll present another example here where proxy re-encryption could help improve encrypted mailing lists.

Mailing lists present encryption of e-mail with one basic problem: regular encrypted e-mail requires that the sender (usually called Alice) know the public key of the recipient (Bob). Mailing lists would require that Alice know the public key of all members of the mailing list: a membership that might fluctuate with time and may even contain people that Alice has never heard of. For example, suppose Alice has a sensitive bug that she wishes to send to the security team at CERT. CERT today has a single key for receiving incoming encrypted mail and presumably multiple people have online access to the secret key in order for them to read encrypted messages. This structure allows Alice to only worry about a single key for reaching CERT while CERT as an organization has the flexibility (and burden) of managing which of its employees have access to that key.

Of course, each of CERT’s employees presumably have their own personal encryption keys. An alternative to allowing each employee access to the master secret key would be to have a single person decrypt an incoming message, re-encrypt it with the keys of the persons on call and then send these newly encrypted messages to the final recipients. This would reduce the vulnerability of the key to a single person.

In 2002, I supervised David Chau in a UROP project to develop a program that acts as this trusted person. While David built a functioning prototype, I never took it to completion or published it. In the interim, at least two alternate (but basically identical functionally) solutions have been developed: one for ezmlm and one for simple /etc/aliases-style lists. However, these software solutions have the problem that they require a computer to have programmatic access to the decryption key. It also exposes the unencrypted message to the computer that is performing the transformation. For sensitive data, this may be an unacceptable risk. Proxy re-encryption can remove this risk.

In the world of proxy re-encryption, Alice (with her secret key ka) can create a special proxy key kab that the proxy can use to transform a message encrypted for ka to a message encrypted for kb. The proxy does not ever get to see the unencrypted message, and Alice’s real key does not need to be available during any transformation. In the CERT example, a trusted person would create proxy keys to allow messages encrypted for the master CERT key to be proxy re-encrypted to each person actually authorized to view those sensitive messages. These proxy keys do not compromise the security of any of the end recipients, and compromise of the mail server and loss of the proxy keys would not result in a compromise of the master CERT key.

This idea has been implemented by Himanshu Khurana at UIUC: his Secure E-mail List Server does not make use of the algorithms Susan helped develop but uses a separate proxy re-encryption scheme with slightly different features and requirements. For example, it requires that the list server generates a user/list-specific decryption key. Susan’s thesis shows how proxy re-encryption can be achieved efficiently using user’s regular key and improves on previous results by demonstrating a construction that allows only unidirectional re-encryption.

This work is new and is still maturing. For example, it is not yet available for day to day use in programs like GnuPG. Susan’s defense also highlighted additional important research questions in this field that should be investigated to improve the understanding of and confidence in these algorithms. However, it sounds like there is already parties adopting the algorithms into domain specific applications and a significant amount of interest exploring in the theory underlying her work. I suspect that in the next few years, we will be seeing some more significant applications of these ideas. Congratulations to Susan!

Apr 19, 2006 - 2 minute read - Personal abstraction education math programming

Exploring math curricula

There are some articles making the rounds today on reddit about math education.

Seattle allows great diversity in its math curricula. This is not without risk: > In Seattle, schools have a lot of autonomy in how they teach math. The > district has adopted textbooks and provides guidelines and timelines for > teachers to follow, but doesn’t require them to do so. […] > As a result, math education across the district is a patchwork of reform and > traditional math, varying sometimes even grade to grade within the same > school. […] > That, coupled with the district’s school-choice system, means there’s no > guarantee that students will be able to pick up where they left off in math > if they transfer to a new school.

Consistency in methodology per student seems important; the problem here sounds like one of too many cooks. The trick is to allow flexibility in conjunction with standards without the problem of teaching to the test, an issue often raised in Massachusetts.

I still think teaching the ability to abstract is what’s ultimately needed. Mark Shuttleworth has the same idea; he blogged yesterday about his project to help develop curriculum that enable students to learn analytical skills through programming.

That sounds like the goal of the Seattle system: > The goal is to give students a better grasp of the underlying > mathematical concepts, make lessons more relevant and help build > a better foundation for understanding more advanced math. > […] > Reform math also emphasizes estimating and being able to analyze > whether the answer derived is correct and reasonable.

This part at least sounds pretty reasonable, though I wouldn’t have chosen to call the whole thing “reform math.” However, the article also mentions things like having students “reason it out themselves” and encouraging the use of calculators because that’s what adults do. These sound less reasonable and reminds me of the parable of Myron Aub, by Isaac Asimov.

I’m very curious what will come of the “where’s the math” conference, where educators and parents will have a chance to discuss this.

Apr 17, 2006 - 2 minute read - Hacking dokuwiki tools

Saving bandwidth with DokuWiki

I recently installed DokuWiki on NearlyFreeSpeech; while I love DokuWiki’s features, I quickly noticed that I was being charged for more bandwidth than seemed necessary for the few pages I was viewing and editing.

A quick check of access logs revealed two things. First, DokuWiki does not compress its output using gzip. Second, it does not send appropriate cache control headers to allow essentially static data (e.g. style sheets) to be cached.

Google reveals that it’s easy to actually compress output from PHP. For example, Jan-Piet Mens added one line to doku.php to turn on gzip output compression. I borrowed a snippet from WordPress’s gzip_compression function and added it to inc/init.php (after the init session code):

// Hack: enable gzip output compression -ES
if ( extension_loaded('zlib') ) {

This has the benefit of affecting any file that generates output, including CSS and JS files. (DokuWiki recently introduced its own bizarre CSS/JS compression scheme that breaks Monobook for DokuWiki; gzip compression seems simpler and less error prone.)

I also observed that my browser was repeatedly requesting lib/exe/css.php and lib/exe/js.php; it turns out that others have raised this issue in just the past few weeks. On 10 April 2006, a set of patches was committed that properly generates ETags and Last-modified headers and allows the resulting output to be cached without checking for at most one hour. I manually applied these patches (with this helper patch); where I used to transfer 11k worth of CSS and 70k of JS for each page view, now I send about 2k of CSS and 17k of JS once an hour. My pages load quicker too!

Apr 12, 2006 - 3 minute read - Research cryptography security tools

Automatically verifying security properties

Today a few of us had lunch with Yoshi Kohno who is visiting MIT and gave a talk about his research on Monday. An important aspect of Yoshi’s research is the problem of translating theoretical security results into secure implementations. He gave an example of how the way that WinZip employed the theoretically secure encrypt-then-MAC paradigm of authenticated encryption resulted in a system that was actually insecure. At lunch, we discussed how this problem might have been avoided: should WinZip’s consultant have looked beyond the encryption component to see how it was used? Should it have been the product manager’s responsibility?

Wouldn’t it be powerful if we could statically check that a piece of software met a particular system-wide security goal? You’d run a “compiler” on your software, along with some sort of annotation describing the security requirements; the tool would combine static analysis with knowledge of relevant theoretical results to validate the security of the program.

This sounds a lot like Coverity, a Stanford spin-off that provides tools for doing static checking of software for a variety of bugs. Coverity and its products has garnered a fair amount of press coverage as they have been auditing open source projects for various security and concurrency problems. However, while Coverity’s approach is able to find a large class of problems, they focus more on low-level problems like buffer overlows or SQL injection. Their current tools would not catch problems like the one Yoshi found in WinZip which required an understanding of the whole system.

To build this tool, more work would need to go into understanding the difficulty of precisely specifying the desired security properties and how real implementations can be parsed into a format amenable to automated analysis. Are all security properties easy to verify? We know that it is possible to check confidentiality and detecting/preventing information leaking. For example, there is on-going work at MIT, building on prior information flow work:

  • Stephen McCamant gave a talk at our group meeting today about dynamically tracking information leakage; his prototype is based on the memory checker in Valgrind and is able to estimate the number of bits of information leaked, given appropriate annotations.
  • A group of my advisor’s current and former students are working on Asbestos, an operating system that enforces information flow rules.

But what about properties like replay resistance? Or perfect forward secrecy? Or resistance to data injection? How can we test for these? Daniel Jackson’s group has developed a tool called Alloy that can help model protocols at an abstract level and find problems; he and his students have also done work on applying lightweight formal methods to find correctness and security problems in real implementations.

Coverity has shown that static analysis can be successfully realized to detect security defects in real systems in an automated fashion. Work like Yoshi’s and the projects I’ve highlighted will hopefully lead to further tools that reduce the reliance on human oversight to provide correctness and security. I wouldn’t mind working towards that goal one day.

Apr 3, 2006 - 2 minute read - Research tools workflow

Tools for repeatable research

Tim Daly, one of the developers of Axiom, has a vision for solving the following problem:

Computational science seems to be in a state where people work independently. They develop whole systems from scratch just to support their research work. Many make an effort to distribute their system but fail for lack of interest. Worse yet, the research that gets published often cannot be used by others or verified for correctness because the supporting code is based on a specialized system and gets lost. There is currently no community expectation that software should accompany research results. The end result is a loss of significant scientific wealth.

He is focusing on mathematical research, where the tools to do the research and present it could ideally be packaged up onto a single CD and distributed with conference proceedings. Their prototype implementation of this is called Doyen.

In systems research and network measurement, this might be a bit more difficult. For example, data sets are too large to fit on CDs and probably have privacy implications. Real world experiments (e.g. on PlanetLab) are difficult to reproduce. However, it would be nice to repeat research in systems, as students of Jeanna Matthews have done with “Xen and the Art of Repeated Research”.

These thoughts line up nicely with the need to script the data analysis workflow. My advisor has also been a proponent of releasing working systems, which enables repeated research. For example, you can track the development of my work in DHash by viewing our CVS. But that is not so easy to build and deploy. Unfortunately, there’s rarely the time to make the data and analysis tools presentable enough to use when paper deadlines are near, and DHash’s code is no exception. I’m hoping that better discipline and experience will lead to the refinement of my own tools to the point where they will eventually be useful to others.

Apr 2, 2006 - 3 minute read - Hacking analytics usability

Design priorities for

I came across Performancing while browsing the various winners of the Web 2.0 awards: they are ranked second in Web Development and Design. Their description sounded interesting for a new blogger like myself so I paid them a visit. I was surprised at how hard it was to find out about Performancing’s goals and products. Inspired perhaps by Jakob Nielsen’s recent Alertbox about basic design priorities, I thought I would do a little usability analysis of Performancing’s site.

On my 1024x768 display, the Performancing homepage looks like this: Performancing Frontpage A few things stand out:

  • They offer a free service called Performancing Metrics that provides real time statistics aimed at blogs.
  • You can login.
  • There appears to be a Slashdot like news section.
  • There is a link to a plugin for Firefox.
  • There are no visible links to more information like About, FAQ, or contact; those require scrolling to find.

From what’s there, I infer that the primary goal of this company is to help bloggers succeed by providing them with metrics. Yet, it is surprisingly difficult to get more information about their Metrics product. The only thing you can really get to from the front page is their sign up page, which wastes over half of their screen real-estate on showing you the ad that you just clicked on.

Performancing Login

Wouldn’t it help to link directly to the Metrics start page? Unfortunately, the most expedient way to that page is to scroll down to the 6th sidebar box for Search, enter “metrics” and then follow the first hit from the results page.

Even after I found this page, I don’t really have a feel for exactly what their system looks like. The Guided Tour requires a separate click to go to each of five short pages with screenshot fragments. There’s no live demo. It’s unlikely they thought about the design of their blank slate.

Contrast this to the third place entry, Mint. Mint is a similar looking service for performing web site statistics analysis. However, their front page contains on a single page the same level of detail that requires a search and multiple clicks to find from the Performancing home page. Mint even offers a link on the left sidebar to jump to a demo of Mint. (Well, it’s not a direct link like HipCal, but better than nothing.)

It sounds like Metrics is pretty successful (> 10000 users) despite the difficulty in finding information. I wonder how well they’d be doing if they improved their website or even just clarified their tagline. For now, it’s not so much their website that keeps me away, but my lack of traffic to analyze; Awstats more than satisfies my curiosity about my visitors. Perhaps one day, I’ll need something more advanced. That might be nice.

Mar 26, 2006 - 3 minute read - Hacking bug hosting wordpress

WordPress ETag bug

My hosting provider charges by the byte and so that motivates me to try and keep track of my bandwidth usage. Right now, most of my traffic comes from search engines (like MSNbot) and RSS aggregators (like Bloglines). The former could be managed probably by improving my URL structure and judicious instructions in my robots.txt; the latter ultimately requires a more intelligent dissemination mechanism, perhaps the way Usenet does things or with something like FeedTree. However, in the interim, we rely on the If-Modified-Since and If-None-Match HTTP headers to ensure that polling at least only transfers data when something has changed.

In perusing my access logs, I realized that Bloglines was always retrieving the full contents of my RSS feed, even when it hadn’t changed. Quick manual testing revealed that if only If-Modified-Since was specified, the data was correctly suppressed. However, Bloglines (rightly) uses both headers to detect changes. The problem appears to be one of quoting.

Quoting is used to escape characters that may be potentially dangerous from being interpreted: for example, the right hand side of the If-Modified-Since header is a string called the entity tag and is provided by the HTTP client (such as the Bloglines poller). There is the risk that this string could somehow be fed into a database or shell command. If this string contains characters that have special meaning to the database or shell, an attacker could use that to gain access to the system. Thus, WordPress takes care to escape dangerous characters, such as quotation marks, from the string to prevent this from happening.

Unfortunately, a change made in 2005 that handles quoting appears to interact poorly with send_headers, the code that checks whether the feed has changed relative to what the HTTP client (Bloglines) last knew about. In particular, entity tags are quoted strings, in the sense that it is a string of characters that appears in quotation marks. PHP already quotes these strings (in the sense of escaping dangerous characters), which is why send_headers took care to stripslashes. However, the quoting introduced in change 2699 causes the quoted string to be quoted again so the match obviously fails and the 304 response code (no change) is not sent.

The hackish fix is to simply call stripslashes twice, which is what I’ve done for now. The more permanent fix probably involves something about how WordPress deals with quoting. I wanted to submit a ticket to the WordPress trac server but their login database hasn’t yet been updated with my new account. I’ll update this post with a link to the ticket when it gets created.

Update: Someone else noticed and was able to file a bug. Their comments led me to realize it was a more recent change, part of the 2.0.2 upgrade, that probably caused the problem. Where are the regression tests?