-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Description
Per @jeremyroman, it appears that Chromium uses a significantly different algorithm than the spec's prescan a byte stream to determine its encoding. In particular, Chromium uses the HTML tokenizer results, not the raw byte stream.
Two concrete cases he noticed:
-
Chromium will ignore
<meta>tags in the<body>for encoding scanning. (I'm not sure how this works with implicit-<body>documents off the top of my head; does the implicit open-body tag get taken care of in the tokenizer stage?) -
Chromium will ignore cases like
<script> // It would be terrible if this document contained <meta charset=shift_jis> but fortunately it does not. </script>
Apparently how Chromium works, at a high level, is that there's a feedback loop between the tokenizer and the byte-to-string decoder. The byte-to-stream decoder dynamically switches its encoding as it goes, and to do so, it uses the tokenizer code on the decoded-bytes it's seen so far. This lets it make more sophisticated decisions than the spec's purely-byte-based prescan.
Technically this is all allowed due to the caller encoding sniffing algorithm's escape clause:
The user agent may attempt to autodetect the character encoding from applying frequency analysis or other algorithms to the data stream.
However folks (such as @hsivonen) might be interested in better interop/spec structure for this. And it also impacts the specification of other meta-related opt-ins which might be parsed early, e.g. https://github.com/jeremyroman/alternate-loading-modes#declaration.