8000 Charset scanner does not match at least Chromium · Issue #5913 · whatwg/html · GitHub
[go: up one dir, main page]

Skip to content

Charset scanner does not match at least Chromium #5913

@domenic

Description

@domenic

Per @jeremyroman, it appears that Chromium uses a significantly different algorithm than the spec's prescan a byte stream to determine its encoding. In particular, Chromium uses the HTML tokenizer results, not the raw byte stream.

Two concrete cases he noticed:

  • Chromium will ignore <meta> tags in the <body> for encoding scanning. (I'm not sure how this works with implicit-<body> documents off the top of my head; does the implicit open-body tag get taken care of in the tokenizer stage?)

  • Chromium will ignore cases like

    <script>
    // It would be terrible if this document contained <meta charset=shift_jis> but fortunately it does not.
    </script>

Apparently how Chromium works, at a high level, is that there's a feedback loop between the tokenizer and the byte-to-string decoder. The byte-to-stream decoder dynamically switches its encoding as it goes, and to do so, it uses the tokenizer code on the decoded-bytes it's seen so far. This lets it make more sophisticated decisions than the spec's purely-byte-based prescan.

Technically this is all allowed due to the caller encoding sniffing algorithm's escape clause:

The user agent may attempt to autodetect the character encoding from applying frequency analysis or other algorithms to the data stream.

However folks (such as @hsivonen) might be interested in better interop/spec structure for this. And it also impacts the specification of other meta-related opt-ins which might be parsed early, e.g. https://github.com/jeremyroman/alternate-loading-modes#declaration.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0