8000 Unicode literal in BufferedStream._readFromBuffer causes failure in HTMLBinaryInputStream · Issue #67 · html5lib/html5lib-python · GitHub
[go: up one dir, main page]

Skip to content
Unicode literal in BufferedStream._readFromBuffer causes failure in HTMLBinaryInputStream #67
Closed
@niklasl

Description

@niklasl

In html5lib/inputstream.py, unicode_literals is imported from __future__. This causes html5lib.inputstream.BufferedStream to misbehave, specifically the _readFromBuffer method, which ends with return "".join(rv). Due to this being a unicode literal, any read from after the first becomes a chunk of unicode instead of a chunk of bytes.

An example of the problem caused:

from urllib2 import Request, urlopen
from html5lib.inputstream import HTMLBinaryInputStream

req = Request(url='http://example.org/')
source = urlopen(req)
HTMLBinaryInputStream(source)

Causing:

Traceback (most recent call last):
  File "<stdin>", line 6, in <module>
  File ".../html5lib/inputstream.py", line 411, in __init__
    self.charEncoding = self.detectEncoding(parseMeta, chardet)
  File ".../html5lib/inputstream.py", line 448, in detectEncoding
    encoding = self.detectEncodingMeta()
  File ".../html5lib/inputstream.py", line 535, in detectEncodingMeta
    assert isinstance(buffer, bytes)
AssertionError

(That is, when HTMLBinaryInputStream is used with a file-like object (such as the result of urllib2.urlopen), it wraps it in a BufferedStream, which then fails (at line 535) with the assert isinstance(buffer, bytes).)

This can be fixed by using a byte literal in _readFromBuffer, instead, i.e. return b"".join(rv). (There are at least three places in inputstream.py where string literals are used like this: at lines 117, 318 and 348.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0