forked from timbertson/python-readability
-
Notifications
You must be signed in to change notification settings - Fork 354
Closed
Description
Hi there!
Extracting doesn’t work anymore when you predecode the strings. This looks pretty trivial though. enc
could be initialized with None
, unless that would cause any problems in other parts of the code.
By the way, I would discourage the use of the old chardet library. The range of encodings it can detect is very limited and it’s slow on top. I’ve found cchardet to be a lot better, but really there is the excellent UnicodeDammit
library in BeautifulSoup that first tries to extract various explicit encoding specifications and then falls back on such implicit methods. Thanks to their latest refactoring, I could even remove a number of ugly hacks I needed to use the older version.
/home/telofy/.buildout/eggs/readability_lxml-0.3.0.1-py2.7.egg/readability/readability.pyc in summary(self, html_partial)
152 ruthless = True
153 while True:
--> 154 self._html(True)
155 for i in self.tags(self.html, 'script', 'style'):
156 i.drop_tree()
/home/telofy/.buildout/eggs/readability_lxml-0.3.0.1-py2.7.egg/readability/readability.pyc in _html(self, force)
117 def _html(self, force=False):
118 if force or self.html is None:
--> 119 self.html = self._parse(self.input)
120 return self.html
121
/home/telofy/.buildout/eggs/readability_lxml-0.3.0.1-py2.7.egg/readability/readability.pyc in _parse(self, input)
121
122 def _parse(self, input):
--> 123 doc, self.encoding = build_doc(input)
124 doc = html_cleaner.clean_html(doc)
125 base_href = self.options.get('url', None)
/home/telofy/.buildout/eggs/readability_lxml-0.3.0.1-py2.7.egg/readability/htmls.pyc in build_doc(page)
15 page_unicode = page.decode(enc, 'replace')
16 doc = lxml.html.document_fromstring(page_unicode.encode('utf-8', 'replace'), parser=utf8_parser)
---> 17 return doc, enc
18
19 def js_re(src, pattern, flags, repl):
Unparseable: local variable 'enc' referenced before assignment
Metadata
Metadata
Assignees
Labels
No labels