html5lib Package

HTML parsing library based on the WHATWG HTML specification. The parser is designed to be compatible with existing HTML found in the wild and implements well-defined error recovery that is largely compatible with modern desktop web browsers.

Example usage:

import html5lib
with open("my_document.html", "rb") as f:
    tree = html5lib.parse(f)

For convenience, this module re-exports the following names:

constants Module

exception html5lib.constants.DataLossWarning

Bases: UserWarning

Raised when the current tree is unable to represent the input data

html5parser Module

class html5lib.html5parser.HTMLParser(tree=None, strict=False, namespaceHTMLElements=True, debug=False)

Bases: object

HTML parser

Generates a tree structure from a stream of (possibly malformed) HTML.

__init__(tree=None, strict=False, namespaceHTMLElements=True, debug=False)
Parameters:
  • tree – a treebuilder class controlling the type of tree that will be returned. Built in treebuilders can be accessed through html5lib.treebuilders.getTreeBuilder(treeType)
  • strict – raise an exception when a parse error is encountered
  • namespaceHTMLElements – whether or not to namespace HTML elements
  • debug – whether or not to enable debug mode which logs things

Example:

>>> from html5lib.html5parser import HTMLParser
>>> parser = HTMLParser()                     # generates parser with etree builder
>>> parser = HTMLParser('lxml', strict=True)  # generates parser with lxml builder which is strict
documentEncoding

Name of the character encoding that was used to decode the input stream, or None if that is not determined yet

parse(stream, *args, **kwargs)

Parse a HTML document into a well-formed tree

Parameters:
  • stream

    a file-like object or string containing the HTML to be parsed

    The optional encoding parameter must be a string that indicates the encoding. If specified, that encoding will be used, regardless of any BOM or later declaration (such as in a meta element).

  • scripting – treat noscript elements as if JavaScript was turned on
Returns:

parsed tree

Example:

>>> from html5lib.html5parser import HTMLParser
>>> parser = HTMLParser()
>>> parser.parse('<html><body><p>This is a doc</p></body></html>')
<Element u'{http://www.w3.org/1999/xhtml}html' at 0x7feac4909db0>
parseFragment(stream, *args, **kwargs)

Parse a HTML fragment into a well-formed tree fragment

Parameters:
  • container – name of the element we’re setting the innerHTML property if set to None, default to ‘div’
  • stream

    a file-like object or string containing the HTML to be parsed

    The optional encoding parameter must be a string that indicates the encoding. If specified, that encoding will be used, regardless of any BOM or later declaration (such as in a meta element)

  • scripting – treat noscript elements as if JavaScript was turned on
Returns:

parsed tree

Example:

>>> from html5lib.html5libparser import HTMLParser
>>> parser = HTMLParser()
>>> parser.parseFragment('<b>this is a fragment</b>')
<Element u'DOCUMENT_FRAGMENT' at 0x7feac484b090>
exception html5lib.html5parser.ParseError

Bases: Exception

Error in parsed document

html5lib.html5parser.parse(doc, treebuilder='etree', namespaceHTMLElements=True, **kwargs)

Parse an HTML document as a string or file-like object into a tree

Parameters:
  • doc – the document to parse as a string or file-like object
  • treebuilder – the treebuilder to use when parsing
  • namespaceHTMLElements – whether or not to namespace HTML elements
Returns:

parsed tree

Example:

>>> from html5lib.html5parser import parse
>>> parse('<html><body><p>This is a doc</p></body></html>')
<Element u'{http://www.w3.org/1999/xhtml}html' at 0x7feac4909db0>
html5lib.html5parser.parseFragment(doc, container='div', treebuilder='etree', namespaceHTMLElements=True, **kwargs)

Parse an HTML fragment as a string or file-like object into a tree

Parameters:
  • doc – the fragment to parse as a string or file-like object
  • container – the container context to parse the fragment in
  • treebuilder – the treebuilder to use when parsing
  • namespaceHTMLElements – whether or not to namespace HTML elements
Returns:

parsed tree

Example:

>>> from html5lib.html5libparser import parseFragment
>>> parseFragment('<b>this is a fragment</b>')
<Element u'DOCUMENT_FRAGMENT' at 0x7feac484b090>

serializer Module

exception html5lib.serializer.SerializeError

Bases: Exception

Error in serialized tree

html5lib.serializer.serialize(input, tree='etree', encoding=None, **serializer_opts)

Serializes the input token stream using the specified treewalker

Parameters:
  • input – the token stream to serialize
  • tree – the treewalker to use
  • encoding – the encoding to use
  • serializer_opts – any options to pass to the html5lib.serializer.HTMLSerializer that gets created
Returns:

the tree serialized as a string

Example:

>>> from html5lib.html5parser import parse
>>> from html5lib.serializer import serialize
>>> token_stream = parse('<html><body><p>Hi!</p></body></html>')
>>> serialize(token_stream, omit_optional_tags=False)
'<html><head></head><body><p>Hi!</p></body></html>'
html5lib.serializer.xmlcharrefreplace_errors()

Implements the ‘xmlcharrefreplace’ error handling, which replaces an unencodable character with the appropriate XML character reference.