Monday, September 8, 2008

HTTPS should not require special treatment by browsers

Google Chrome was recently criticized for indexing information transferred from HTTPS pages such as Internet banking.

While the concerns, private information being indexed is valid, the best solution is not to exclude HTTPS from indexing by default. Many useful sites are served over HTTPS that do not contain private data. Users benefit from having this information easily available. Some browsers even exclude HTTPS from caching by default.

The HTTP standard specifies a much better way to ensure that certain data is excluded from caching (it is probably a good idea to exclude it from indexing as well in such cases).

The HTTP 1.1 standard states "Unless specifically constrained by a cache-control directive, a caching system MAY always store a successful response as a cache entry..."

Unless a response from a server is specifically marked not to be cacheable, any browser (or proxy for normal HTTP) should try to cache the response in order to improve the user experience.

How sensitive data should be protected
Even though caches improve the user experience, some data should never be stored. The data mentioned in the linked articles fall within that category. This data is usually transferred over HTTPS in order to ensure its privacy and integrity while being transported between the server and user.

HTTP 1.1 provides a mechanism in order to ensure that this data is protected at the end points (and caches for normal HTTP). It specifies a "Cache-Control" header. This header allows the data to be tagged with several levels of cacheability. Anything marked with anything other than a no-store Cache-Control header should be expected to be cached at least in a limited way by the browser. (Other headers are indented to ensure that a cache do not return out of date data and no to ensure its privacy on the user's computer)

Most browsers since the days of Internet Explorer 4 supports enough HTTP 1.1 to understand Cache-Control headers.

Banks and other sites should therefore ensure that they include the correct headers in the responses from their severs. They should not prevent non-sensitive content such as static style-sheets, scripts and images from being cached, since reloading this data each time degrades the user experience and wastes bandwidth. Depending on browsers to be more paranoid than the standards require them to be is irresponsible.

If sensitive data leaks, the party responsible for the disclosure should be held responsible. This can be the user, if his/her system's security was breached (due to his/her negligence), the browser vendor, if the browser does not follow the standards and caches data that is marked no-store, or the party serving the content if they do not mark their content properly.

Interoperability with HTTP 1.0
HTTP 1.0 do not provide the Cache-Control header. In most such cases a Pragma: no-cache over HTTPS should be enough to exclude the page totally from caching. (This seems to be the common behaviour) When HTTP 1.1 is used, the finer-grained Cache-Control header should be used, if present. (HTTP 1.0-like behaviour as fall back in its absence is probably a safe option)

Deja Vu
The outcry over Chrome indexing page transferred over HTTPS reminds of the reaction after Google started indexing pages hosted on HTTPS in 2002.

An article written then sums it up nicely: "The misconception that Google is going where it shouldn't comes partly from the somewhat vague definition of "secure." The SSL protocol is simply a transmission protocol. It has nothing to do with whether an individual page should be considered "secure" or not."