html5lib is a pure-python library for parsing HTML. It is designed to conform to the WHATWG HTML specification, as is implemented by all major web browsers.
Simple usage follows this pattern:
import html5lib with open("mydocument.html", "rb") as f: document = html5lib.parse(f)
import html5lib document = html5lib.parse("<p>Hello World!")
By default, the
document will be an
xml.etree element instance.
Whenever possible, html5lib chooses the accelerated
xml.etree.cElementTree on Python 2.x).
Two other tree types are supported:
lxml.etree. To use an alternative format, specify the name of
import html5lib with open("mydocument.html", "rb") as f: lxml_etree_document = html5lib.parse(f, treebuilder="lxml")
When using with
urllib2 (Python 2), the charset from HTTP should be
pass into html5lib as follows:
from contextlib import closing from urllib2 import urlopen import html5lib with closing(urlopen("http://example.com/")) as f: document = html5lib.parse(f, transport_encoding=f.info().getparam("charset"))
When using with
urllib.request (Python 3), the charset from HTTP
should be pass into html5lib as follows:
from urllib.request import urlopen import html5lib with urlopen("http://example.com/") as f: document = html5lib.parse(f, transport_encoding=f.info().get_content_charset())
To have more control over the parser, create a parser object explicitly. For instance, to make the parser raise exceptions on parse errors, use:
import html5lib with open("mydocument.html", "rb") as f: parser = html5lib.HTMLParser(strict=True) document = parser.parse(f)
When you’re instantiating parser objects explicitly, pass a treebuilder
class as the
tree keyword argument to use an alternative document
import html5lib parser = html5lib.HTMLParser(tree=html5lib.getTreeBuilder("dom")) minidom_document = parser.parse("<p>Hello World!")
More documentation is available at https://html5lib.readthedocs.io/.
html5lib works on CPython 2.7+, CPython 3.4+ and PyPy. To install it, use:
$ pip install html5lib
The following third-party libraries may be used for additional functionality:
datriecan be used under CPython to improve parsing performance (though in almost all cases the improvement is marginal);
lxmlis supported as a tree format (for both building and walking) under CPython (but not PyPy where it is known to cause segfaults);
genshihas a treewalker (but not builder); and
chardetcan be used as a fallback when character encoding cannot be determined.
Unit tests require the
mock libraries and can be
run using the
py.test command in the root directory.
Test data are contained in a separate html5lib-tests repository and included as a submodule, thus for git checkouts they must be initialized:
$ git submodule init $ git submodule update
If you have all compatible Python implementations available on your
system, you can run tests on all of them using the
which can be found on PyPI.