Charset scanner does not match at least Chromium

@jeremyroman

Per @jeremyroman, it appears that Chromium uses a significantly different algorithm than the spec's prescan a byte stream to determine its encoding. In particular, Chromium uses the HTML tokenizer results, not the raw byte stream.

Two concrete cases he noticed:

Chromium will ignore <meta> tags in the <body> for encoding scanning. (I'm not sure how this works with implicit-<body> documents off the top of my head; does the implicit open-body tag get taken care of in the tokenizer stage?)

Chromium will ignore cases like

<script>
// It would be terrible if this document contained <meta charset=shift_jis> but fortunately it does not.
</script>

Apparently how Chromium works, at a high level, is that there's a feedback loop between the tokenizer and the byte-to-string decoder. The byte-to-stream decoder dynamically switches its encoding as it goes, and to do so, it uses the tokenizer code on the decoded-bytes it's seen so far. This lets it make more sophisticated decisions than the spec's purely-byte-based prescan.

Technically this is all allowed due to the caller encoding sniffing algorithm's escape clause:

The user agent may attempt to autodetect the character encoding from applying frequency analysis or other algorithms to the data stream.

However folks (such as @hsivonen) might be interested in better interop/spec structure for this. And it also impacts the specification of other meta-related opt-ins which might be parsed early, e.g. https://github.com/jeremyroman/alternate-loading-modes#declaration.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Charset scanner does not match at least Chromium #5913

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Charset scanner does not match at least Chromium #5913

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions