Description
In html5lib/inputstream.py, unicode_literals
is imported from __future__
. This causes html5lib.inputstream.BufferedStream
to misbehave, specifically the _readFromBuffer
method, which ends with return "".join(rv)
. Due to this being a unicode literal, any read from after the first becomes a chunk of unicode instead of a chunk of bytes.
An example of the problem caused:
from urllib2 import Request, urlopen
from html5lib.inputstream import HTMLBinaryInputStream
req = Request(url='http://example.org/')
source = urlopen(req)
HTMLBinaryInputStream(source)
Causing:
Traceback (most recent call last):
File "<stdin>", line 6, in <module>
File ".../html5lib/inputstream.py", line 411, in __init__
self.charEncoding = self.detectEncoding(parseMeta, chardet)
File ".../html5lib/inputstream.py", line 448, in detectEncoding
encoding = self.detectEncodingMeta()
File ".../html5lib/inputstream.py", line 535, in detectEncodingMeta
assert isinstance(buffer, bytes)
AssertionError
(That is, when HTMLBinaryInputStream
is used with a file-like object (such as the result of urllib2.urlopen
), it wraps it in a BufferedStream
, which then fails (at line 535) with the assert isinstance(buffer, bytes)
.)
This can be fixed by using a byte literal in _readFromBuffer
, instead, i.e. return b"".join(rv)
. (There are at least three places in inputstream.py where string literals are used like this: at lines 117, 318 and 348.)