RSS and Conditional GETS

Trying to clear up some of the info on Drupal, RSS and Conditional GETs.

Gary has been having some problems with his Drupal RSS feeds over on Teledyn. He has blogged about it several times:

Now I get mentioned a bit in the comments, and there are a few things I feel to clarify for Gary, and others. Since the If-Modified-Since and If-None-Match logic in Drupal is something I wrote this is a subject I hopefully know a little about.

The current implementation in Drupal simply matches the If-Modified-Since header against the timestamp saved in the database, and note that this applies not just to RSS feed but all content that Drupal serves. Most (all?) web browsers in common usage store the value of the Last-Modified header and use the same one when doing an If-Modified-Since, apparently most RSS aggregators don't. Using other timestamps is valid according to RFC 2616, but it is problematic so even the RFC recommends that the timestamp specified by the server be used, and not a timestamp generated on the client side. The quick reasoning for this is:

  1. Timezones and daylight savings don't need to be factored in.
  2. Clock desync isn't an issue. Is your PC synced to an accurate time server? Is the server?
  3. Less chance of mal-formed timestamps. What you got from the server it should be well equipped to parse.

RFC 2616 has more fleshed out reasons under section 14.25.
So to avoid the previous issues, it was simply convenient to not convert the If-Modified-Since header back to a timestamp and perform a numeric comparison instead of a string comparison. No browser I have tested has any problems with this. I, however, don't use any aggregators and they don't seem to have the same level of robust support for old standards. Since I have modified the logic in Drupal for Gray to convert the timestamp to a numeric and doing the proper checks it will probably end up being in Drupal, just needs more error checking and tuning. Since RSS traffic is likely to rise this will probably be necessary.
I would still recommend aggregators store the Last-Modified or ETag header values and use those to do a new GET instead of creating their own, as most server implementation I have seen in PHP and Python seem to do the string comparison check. It isn't like storing 30 bytes extra per feed will eat up all the disc space? Odds are that not doing so will cause a lot more bandwidth being sucked up by RSS providers.
The RSS distribution model will probably need to undergo some form of change if it continues to grow. Even if all the conditional GET stuff is sorted out the number of useless requests will grow, but the problem is that RSS feeds are usually collected on the hour so traffic will have really high spikes at those moment. Normal web browsing is more smooth as users generally don't wait till the clock read 00 before going to your site. Maybe as the tools mature this will be less of a problem, but I don't have any real data on the matter so time will tell.
Speaking of data I wonder if any popular sites have analyzed the requests for the RSS feeds? Not just the number of requests, but how many result in a 304 response, and when is a feed requested on average, HTTP 1.0 or 1.1, If-None-Match of If-Modified-Since, etc? Maybe I'll get around to making a Drupal patch that stores this type of data and asking nicely if some of the more popular Drupal sites will run it for a months time to gather data from several places. Could be an interesting experiment.

Comments

304 statistics

I run diveintomark.org and I get 25,000 hits a day on my feed, which is a static file served and negotiated by Apache. The feed usually changes once a day. All told these hits consume about 8 MB of bandwidth a day. Over 75% of my hits are 304, which saves about 30 MB a day. Obviously these numbers fluctuate depending on how often I post, but these were the numbers for last Friday (when I posted once) and they are representative of a normal day for me.

Thanks

Thanks You
_________________________

Sohbet & Muhabbet