html5lib is a pure-python library for parsing HTML. It is designed to conform to the HTML specification, as is implemented by all major web browsers.
Python 2.6 and above as well as Python 3.0 and above are supported. Implementations known to work are CPython (as the reference implementation) and PyPy. Jython is known not to work due to various bugs in its implementation of the language. Others such as IronPython may or may not work; if you wish to try, you are strongly encouraged to run the testsuite and report back!
The only required library dependency is six
, this can be found
packaged in PyPi.
Optionally:
datrie
can be used to improve parsing performance (though in almost all cases the improvement is marginal);lxml
is supported as a tree format (for both building and walking) under CPython (but not PyPy where it is known to cause segfaults);genshi
has a treewalker (but not builder); andchardet
can be used as a fallback when character encoding cannot be determined (note currently this is only packaged on PyPi for Python 2, though several package managers include unofficial ports to Python 3).
html5lib is packaged with distutils. To install it use:
$ python setup.py install
Simple usage follows this pattern:
import html5lib with open("mydocument.html", "r") as fp: document = html5lib.parse(f)
or:
import html5lib document = html5lib.parse("<p>Hello World!")
More documentation is available in the docstrings.
Please report any bugs on the issue tracker.