My hosting provider charges by the byte and so that motivates me to
try and keep track of my bandwidth usage. Right now, most of my
traffic comes from search engines (like MSNbot) and RSS
aggregators (like Bloglines). The former could be managed
probably by improving my URL structure and judicious instructions
in my robots.txt
; the latter ultimately requires a more intelligent
dissemination mechanism, perhaps the way Usenet does things or
with something like FeedTree. However, in the interim, we rely on
the If-Modified-Since and If-None-Match HTTP headers
to ensure that polling at least only transfers data when something has
changed.
In perusing my access logs, I realized that Bloglines was always retrieving the full contents of my RSS feed, even when it hadn’t changed. Quick manual testing revealed that if only If-Modified-Since was specified, the data was correctly suppressed. However, Bloglines (rightly) uses both headers to detect changes. The problem appears to be one of quoting.
Quoting is used to escape characters that may be potentially dangerous from being interpreted: for example, the right hand side of the If-Modified-Since header is a string called the entity tag and is provided by the HTTP client (such as the Bloglines poller). There is the risk that this string could somehow be fed into a database or shell command. If this string contains characters that have special meaning to the database or shell, an attacker could use that to gain access to the system. Thus, WordPress takes care to escape dangerous characters, such as quotation marks, from the string to prevent this from happening.
Unfortunately, a change made in 2005 that handles
quoting appears to interact poorly with
send_headers
, the
code that checks whether the feed has changed relative to what
the HTTP client (Bloglines) last knew about. In particular,
entity tags are quoted strings, in the sense that it
is a string of characters that appears in quotation marks.
PHP already quotes these strings (in the sense of
escaping dangerous characters), which is why send_headers
took
care to stripslashes
. However, the quoting introduced in change
2699 causes the quoted string to be quoted again so the
match obviously fails and the 304 response code (no change)
is not sent.
The hackish fix is to simply call stripslashes
twice, which is
what I’ve done for now. The more permanent fix probably involves
something about how WordPress deals with quoting.
I wanted to submit a ticket to the WordPress
trac server but their login
database hasn’t yet been updated with my new account. I’ll update this post
with a link to the ticket when it gets created.
Update: Someone else noticed and was able to file a bug. Their comments led me to realize it was a more recent change, part of the 2.0.2 upgrade, that probably caused the problem. Where are the regression tests?