html5lib Package

HTML parsing library based on the WHATWG HTML specification. The parser is designed to be compatible with existing HTML found in the wild and implements well-defined error recovery that is largely compatible with modern desktop web browsers.

Example usage:

import html5lib
with open("my_document.html", "rb") as f:
    tree = html5lib.parse(f)

For convenience, this module re-exports the following names:

html5lib.__version__ = '1.2-dev'

Distribution version number.

constants Module

exception html5lib.constants.DataLossWarning[source]

Bases: UserWarning

Raised when the current tree is unable to represent the input data

html5parser Module

class html5lib.html5parser.HTMLParser(tree=None, strict=False, namespaceHTMLElements=True, debug=False)[source]

Bases: object

HTML parser

Generates a tree structure from a stream of (possibly malformed) HTML.

__init__(tree=None, strict=False, namespaceHTMLElements=True, debug=False)[source]
  • tree – a treebuilder class controlling the type of tree that will be returned. Built in treebuilders can be accessed through html5lib.treebuilders.getTreeBuilder(treeType)
  • strict – raise an exception when a parse error is encountered
  • namespaceHTMLElements – whether or not to namespace HTML elements
  • debug – whether or not to enable debug mode which logs things


>>> from html5lib.html5parser import HTMLParser
>>> parser = HTMLParser()                     # generates parser with etree builder
>>> parser = HTMLParser('lxml', strict=True)  # generates parser with lxml builder which is strict

Name of the character encoding that was used to decode the input stream, or None if that is not determined yet

parse(stream, *args, **kwargs)[source]

Parse a HTML document into a well-formed tree

  • stream

    a file-like object or string containing the HTML to be parsed

    The optional encoding parameter must be a string that indicates the encoding. If specified, that encoding will be used, regardless of any BOM or later declaration (such as in a meta element).

  • scripting – treat noscript elements as if JavaScript was turned on

parsed tree


>>> from html5lib.html5parser import HTMLParser
>>> parser = HTMLParser()
>>> parser.parse('<html><body><p>This is a doc</p></body></html>')
<Element u'{}html' at 0x7feac4909db0>
parseFragment(stream, *args, **kwargs)[source]

Parse a HTML fragment into a well-formed tree fragment

  • container – name of the element we’re setting the innerHTML property if set to None, default to ‘div’
  • stream

    a file-like object or string containing the HTML to be parsed

    The optional encoding parameter must be a string that indicates the encoding. If specified, that encoding will be used, regardless of any BOM or later declaration (such as in a meta element)

  • scripting – treat noscript elements as if JavaScript was turned on

parsed tree


>>> from html5lib.html5libparser import HTMLParser
>>> parser = HTMLParser()
>>> parser.parseFragment('<b>this is a fragment</b>')
<Element u'DOCUMENT_FRAGMENT' at 0x7feac484b090>
exception html5lib.html5parser.ParseError[source]

Bases: Exception

Error in parsed document

html5lib.html5parser.parse(doc, treebuilder='etree', namespaceHTMLElements=True, **kwargs)[source]

Parse an HTML document as a string or file-like object into a tree

  • doc – the document to parse as a string or file-like object
  • treebuilder – the treebuilder to use when parsing
  • namespaceHTMLElements – whether or not to namespace HTML elements

parsed tree


>>> from html5lib.html5parser import parse
>>> parse('<html><body><p>This is a doc</p></body></html>')
<Element u'{}html' at 0x7feac4909db0>
html5lib.html5parser.parseFragment(doc, container='div', treebuilder='etree', namespaceHTMLElements=True, **kwargs)[source]

Parse an HTML fragment as a string or file-like object into a tree

  • doc – the fragment to parse as a string or file-like object
  • container – the container context to parse the fragment in
  • treebuilder – the treebuilder to use when parsing
  • namespaceHTMLElements – whether or not to namespace HTML elements

parsed tree


>>> from html5lib.html5libparser import parseFragment
>>> parseFragment('<b>this is a fragment</b>')
<Element u'DOCUMENT_FRAGMENT' at 0x7feac484b090>

serializer Module

exception html5lib.serializer.SerializeError[source]

Bases: Exception

Error in serialized tree

html5lib.serializer.serialize(input, tree='etree', encoding=None, **serializer_opts)[source]

Serializes the input token stream using the specified treewalker

  • input – the token stream to serialize
  • tree – the treewalker to use
  • encoding – the encoding to use
  • serializer_opts – any options to pass to the html5lib.serializer.HTMLSerializer that gets created

the tree serialized as a string


>>> from html5lib.html5parser import parse
>>> from html5lib.serializer import serialize
>>> token_stream = parse('<html><body><p>Hi!</p></body></html>')
>>> serialize(token_stream, omit_optional_tags=False)

Implements the ‘xmlcharrefreplace’ error handling, which replaces an unencodable character with the appropriate XML character reference.

class html5lib.serializer.HTMLSerializer(**kwargs)[source]

Bases: object


Initialize HTMLSerializer

  • inject_meta_charset

    Whether or not to inject the meta charset.

    Defaults to True.

  • quote_attr_values

    Whether to quote attribute values that don’t require quoting per legacy browser behavior ("legacy"), when required by the standard ("spec"), or always ("always").

    Defaults to "legacy".

  • quote_char

    Use given quote character for attribute quoting.

    Defaults to " which will use double quotes unless attribute value contains a double quote, in which case single quotes are used.

  • escape_lt_in_attrs

    Whether or not to escape < in attribute values.

    Defaults to False.

  • escape_rcdata

    Whether to escape characters that need to be escaped within normal elements within rcdata elements such as style.

    Defaults to False.

  • resolve_entities

    Whether to resolve named character entities that appear in the source tree. The XML predefined entities &lt; &gt; &amp; &quot; &apos; are unaffected by this setting.

    Defaults to True.

  • strip_whitespace

    Whether to remove semantically meaningless whitespace. (This compresses all whitespace to a single space except within pre.)

    Defaults to False.

  • minimize_boolean_attributes

    Shortens boolean attributes to give just the attribute value, for example:

    <input disabled="disabled">


    <input disabled>

    Defaults to True.

  • use_trailing_solidus

    Includes a close-tag slash at the end of the start tag of void elements (empty elements whose end tag is forbidden). E.g. <hr/>.

    Defaults to False.

  • space_before_trailing_solidus

    Places a space immediately before the closing slash in a tag using a trailing solidus. E.g. <hr />. Requires use_trailing_solidus=True.

    Defaults to True.

  • sanitize

    Strip all unsafe or unknown constructs from output. See html5lib.filters.sanitizer.Filter.

    Defaults to False.

  • omit_optional_tags

    Omit start/end tags that are optional.

    Defaults to True.

  • alphabetical_attributes

    Reorder attributes to be in alphabetical order.

    Defaults to False.

render(treewalker, encoding=None)[source]

Serializes the stream from the treewalker into a string

  • treewalker – the treewalker to serialize
  • encoding – the string encoding to use

the serialized tree


>>> from html5lib import parse, getTreeWalker
>>> from html5lib.serializer import HTMLSerializer
>>> token_stream = parse('<html><body>Hi!</body></html>')
>>> walker = getTreeWalker('etree')
>>> serializer = HTMLSerializer(omit_optional_tags=False)
>>> serializer.render(walker(token_stream))