There have been a number of (possibly still theoretical) attacks to HTTPS during which an adversary can guess which pages are being visited merely by looking at the response length. This has been mitigated by other websites in various ways, e.g. Twitter profile pictures are put into certain specific size buckets.
We are particularly affected as our pages' content is a) all public and explorable via dumps, making it easier for the attacker to experiment and precompute, b) static and identical in most cases (anonymous users), b) text, assets, images are split into separate IPs and hence different TLS sessions, which both removes a randomized factor and creates even more unique combinations of traffic patterns.
To mitigate this kind of an attack we have to pad our responses up to certain (unguessable) sizes. There's a number of considerations that need to explored before doing so:
- As @csteipp points out, even a bucket classification won't be enough, as there are still enough bits of information there to make educated guesses based on click path behavior.
- Padding the HTML with e.g. zeros will be ineffective, as gzip compression will remove most of it from there. We could pad the HTML with random garbage, though, that wouldn't be defeated by gzip.
- Padding the HTML means that we'd have to pad other resources separately, some of which aren't even being served from MediaWiki (e.g. images/Swift).
- Padding to specific bucket sizes removes the precomputation-from-dumps factor but does not insert any randomness into the process. A padded text page + its associated, padded, images could still provide enough bits of information to figure out the page visited.
- Padding obviously increases the content size and comes with obvious performance issues; it's essentially a security/performance tradeoff. Depending on which piece of infrastructure it actually happens, it might also increase storage and/or cache size needed.
So far, it seems more likely that something that we could apply on the edge (either Varnish or nginx) and, potentially make it both bucket-based but with random placement, per request, would be the best strategy. It remains unknown if a) it's possible to pad a gzip response with zeros or garbage but still have it being parsed properly by UAs, b) if it's feasible to pad with HTTP headers and how many/lengthy these should be.