Unicode literal in BufferedStream._readFromBuffer causes failure in HTMLBinaryInputStream

In html5lib/inputstream.py, `unicode_literals` is imported from `__future__`. This causes `html5lib.inputstream.BufferedStream` to misbehave, specifically the `_readFromBuffer` method, which ends with `return "".join(rv)`. Due to this being a unicode literal, any read from after the first becomes a chunk of unicode instead of a chunk of bytes.

An example of the problem caused:

``` python
from urllib2 import Request, urlopen
from html5lib.inputstream import HTMLBinaryInputStream

req = Request(url='http://example.org/')
source = urlopen(req)
HTMLBinaryInputStream(source)
```

Causing:

``` pytb
Traceback (most recent call last):
  File "<stdin>", line 6, in <module>
  File ".../html5lib/inputstream.py", line 411, in __init__
    self.charEncoding = self.detectEncoding(parseMeta, chardet)
  File ".../html5lib/inputstream.py", line 448, in detectEncoding
    encoding = self.detectEncodingMeta()
  File ".../html5lib/inputstream.py", line 535, in detectEncodingMeta
    assert isinstance(buffer, bytes)
AssertionError
```

(That is, when `HTMLBinaryInputStream` is used with a file-like object (such as the result of `urllib2.urlopen`), it wraps it in a `BufferedStream`, which then fails (at line 535) with the `assert isinstance(buffer, bytes)`.)

This can be fixed by using a byte literal in `_readFromBuffer`, instead, i.e. `return b"".join(rv)`. (There are at least three places in inputstream.py where string literals are used like this: at lines 117, 318 and 348.)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions