8000 Fixes checking of declared encodings in get_encoding. · Harry0201/python-readability@386e48d · GitHub
[go: up one dir, main page]

Skip to content

Commit 386e48d

Browse files
committed
Fixes checking of declared encodings in get_encoding.
In PYthon 3 .decode() on bytes requires the name of the encoding to be a str type which means we have to convert the extracted encoding before we can use it.
1 parent 046d2c1 commit 386e48d

File tree

1 file changed

+13
-7
lines changed

1 file changed

+13
-7
lines changed

readability/encoding.py

Lines changed: 13 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
import re
22
import chardet
3+
import sys
34

45
def get_encoding(page):
56
# Regex for XML and HTML Meta charset declaration
@@ -12,13 +13,18 @@ def get_encoding(page):
1213
xml_re.findall(page))
1314

1415
# Try any declared encodings
15-
if len(declared_encodings) > 0:
16-
for declared_encoding in declared_encodings:
17-
try:
18-
page.decode(custom_decode(declared_encoding))
19-
return custom_decode(declared_encoding)
20-
except UnicodeDecodeError:
21-
pass
16+
for declared_encoding in declared_encodings:
17+
try:
18+
if sys.version_info[0] == 3:
19+
# declared_encoding will actually be bytes but .decode() only
20+
# accepts `str` type. Decode blindly with ascii because no one should
21+
# ever use non-ascii characters in the name of an encoding.
22+
declared_encoding = declared_encoding.decode('ascii', 'replace')
23+
24+
page.decode(custom_decode(declared_encoding))
25+
return custom_decode(declared_encoding)
26+
except UnicodeDecodeError:
27+
pass
2228

2329
# Fallback to chardet if declared encodings fail
2430
text = re.sub(b'</?[^>]*>\s*', b' ', page)

0 commit comments

Comments
 (0)
0