10000 Unparseable: local variable 'enc' referenced before assignment · Issue #41 · buriy/python-readability · GitHub
[go: up one dir, main page]

Skip to content
Unparseable: local variable 'enc' referenced before assignment #41
@Telofy

Description

@Telofy

Hi there!

Extracting doesn’t work anymore when you predecode the strings. This looks pretty trivial though. enc could be initialized with None, unless that would cause any problems in other parts of the code.

By the way, I would discourage the use of the old chardet library. The range of encodings it can detect is very limited and it’s slow on top. I’ve found cchardet to be a lot better, but really there is the excellent UnicodeDammit library in BeautifulSoup that first tries to extract various explicit encoding specifications and then falls back on such implicit methods. Thanks to their latest refactoring, I could even remove a number of ugly hacks I needed to use the older version.

/home/telofy/.buildout/eggs/readability_lxml-0.3.0.1-py2.7.egg/readability/readability.pyc in summary(self, html_partial)
    152             ruthless = True
    153             while True:
--> 154                 self._html(True)
    155                 for i in self.tags(self.html, 'script', 'style'):
    156                     i.drop_tree()

/home/telofy/.buildout/eggs/readability_lxml-0.3.0.1-py2.7.egg/readability/readability.pyc in _html(self, force)
    117     def _html(self, force=False):
    118         if force or self.html is None:
--> 119             self.html = self._parse(self.input)
    120         return self.html
    121 

/home/telofy/.buildout/eggs/readability_lxml-0.3.0.1-py2.7.egg/readability/readability.pyc in _parse(self, input)
    121 
    122     def _parse(self, input):
--> 123         doc, self.encoding = build_doc(input)
    124         doc = html_cleaner.clean_html(doc)
    125         base_href = self.options.get('url', None)

/home/telofy/.buildout/eggs/readability_lxml-0.3.0.1-py2.7.egg/readability/htmls.pyc in build_doc(page)
     15         page_unicode = page.decode(enc, 'replace')
     16     doc = lxml.html.document_fromstring(page_unicode.encode('utf-8', 'replace'), parser=utf8_parser)
---> 17     return doc, enc
     18 
     19 def js_re(src, pattern, flags, repl):

Unparseable: local variable 'enc' referenced before assignment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0