Thoughts on Systems

Emil Sit

WordPress ETag Bug

My hosting provider charges by the byte and so that motivates me to try and keep track of my bandwidth usage. Right now, most of my traffic comes from search engines (like MSNbot) and RSS aggregators (like Bloglines). The former could be managed probably by improving my URL structure and judicious instructions in my robots.txt; the latter ultimately requires a more intelligent dissemination mechanism, perhaps the way Usenet does things or with something like FeedTree. However, in the interim, we rely on the If-Modified-Since and If-None-Match HTTP headers to ensure that polling at least only transfers data when something has changed.

In perusing my access logs, I realized that Bloglines was always retrieving the full contents of my RSS feed, even when it hadn’t changed. Quick manual testing revealed that if only If-Modified-Since was specified, the data was correctly suppressed. However, Bloglines (rightly) uses both headers to detect changes. The problem appears to be one of quoting.

Quoting is used to escape characters that may be potentially dangerous from being interpreted: for example, the right hand side of the If-Modified-Since header is a string called the entity tag and is provided by the HTTP client (such as the Bloglines poller). There is the risk that this string could somehow be fed into a database or shell command. If this string contains characters that have special meaning to the database or shell, an attacker could use that to gain access to the system. Thus, WordPress takes care to escape dangerous characters, such as quotation marks, from the string to prevent this from happening.

Unfortunately, a change made in 2005 that handles quoting appears to interact poorly with send_headers, the code that checks whether the feed has changed relative to what the HTTP client (Bloglines) last knew about. In particular, entity tags are quoted strings, in the sense that it is a string of characters that appears in quotation marks. PHP already quotes these strings (in the sense of escaping dangerous characters), which is why send_headers took care to stripslashes. However, the quoting introduced in change 2699 causes the quoted string to be quoted again so the match obviously fails and the 304 response code (no change) is not sent.

The hackish fix is to simply call stripslashes twice, which is what I’ve done for now. The more permanent fix probably involves something about how WordPress deals with quoting. I wanted to submit a ticket to the WordPress trac server but their login database hasn’t yet been updated with my new account. I’ll update this post with a link to the ticket when it gets created.

Update: Someone else noticed and was able to file a bug. Their comments led me to realize it was a more recent change, part of the 2.0.2 upgrade, that probably caused the problem. Where are the regression tests?

Comments