html5lib Package

HTML parsing library based on the WHATWG HTML specification. The parser is designed to be compatible with existing HTML found in the wild and implements well-defined error recovery that is largely compatible with modern desktop web browsers.

Example usage:

import html5lib
with open("my_document.html", "rb") as f:
    tree = html5lib.parse(f)

For convenience, this module re-exports the following names:

html5lib.__version__ = '1.2-dev'

Distribution version number.

constants Module

exception html5lib.constants.DataLossWarning[source]

Bases: UserWarning

Raised when the current tree is unable to represent the input data

html5parser Module

class html5lib.html5parser.HTMLParser(tree=None, strict=False, namespaceHTMLElements=True, debug=False)[source]

Bases: object

HTML parser

Generates a tree structure from a stream of (possibly malformed) HTML.

__init__(tree=None, strict=False, namespaceHTMLElements=True, debug=False)[source]
Parameters:
  • tree – a treebuilder class controlling the type of tree that will be returned. Built in treebuilders can be accessed through html5lib.treebuilders.getTreeBuilder(treeType)
  • strict – raise an exception when a parse error is encountered
  • namespaceHTMLElements – whether or not to namespace HTML elements
  • debug – whether or not to enable debug mode which logs things

Example:

>>> from html5lib.html5parser import HTMLParser
>>> parser = HTMLParser()                     # generates parser with etree builder
>>> parser = HTMLParser('lxml', strict=True)  # generates parser with lxml builder which is strict
documentEncoding

Name of the character encoding that was used to decode the input stream, or None if that is not determined yet

parse(stream, *args, **kwargs)[source]

Parse a HTML document into a well-formed tree

Parameters:
  • stream

    a file-like object or string containing the HTML to be parsed

    The optional encoding parameter must be a string that indicates the encoding. If specified, that encoding will be used, regardless of any BOM or later declaration (such as in a meta element).

  • scripting – treat noscript elements as if JavaScript was turned on
Returns:

parsed tree

Example:

>>> from html5lib.html5parser import HTMLParser
>>> parser = HTMLParser()
>>> parser.parse('<html><body><p>This is a doc</p></body></html>')
<Element u'{http://www.w3.org/1999/xhtml}html' at 0x7feac4909db0>
parseFragment(stream, *args, **kwargs)[source]

Parse a HTML fragment into a well-formed tree fragment

Parameters:
  • container – name of the element we’re setting the innerHTML property if set to None, default to ‘div’
  • stream

    a file-like object or string containing the HTML to be parsed

    The optional encoding parameter must be a string that indicates the encoding. If specified, that encoding will be used, regardless of any BOM or later declaration (such as in a meta element)

  • scripting – treat noscript elements as if JavaScript was turned on
Returns:

parsed tree

Example:

>>> from html5lib.html5libparser import HTMLParser
>>> parser = HTMLParser()
>>> parser.parseFragment('<b>this is a fragment</b>')
<Element u'DOCUMENT_FRAGMENT' at 0x7feac484b090>
exception html5lib.html5parser.ParseError[source]

Bases: Exception

Error in parsed document

html5lib.html5parser.parse(doc, treebuilder='etree', namespaceHTMLElements=True, **kwargs)[source]

Parse an HTML document as a string or file-like object into a tree

Parameters:
  • doc – the document to parse as a string or file-like object
  • treebuilder – the treebuilder to use when parsing
  • namespaceHTMLElements – whether or not to namespace HTML elements
Returns:

parsed tree

Example:

>>> from html5lib.html5parser import parse
>>> parse('<html><body><p>This is a doc</p></body></html>')
<Element u'{http://www.w3.org/1999/xhtml}html' at 0x7feac4909db0>
html5lib.html5parser.parseFragment(doc, container='div', treebuilder='etree', namespaceHTMLElements=True, **kwargs)[source]

Parse an HTML fragment as a string or file-like object into a tree

Parameters:
  • doc – the fragment to parse as a string or file-like object
  • container – the container context to parse the fragment in
  • treebuilder – the treebuilder to use when parsing
  • namespaceHTMLElements – whether or not to namespace HTML elements
Returns:

parsed tree

Example:

>>> from html5lib.html5libparser import parseFragment
>>> parseFragment('<b>this is a fragment</b>')
<Element u'DOCUMENT_FRAGMENT' at 0x7feac484b090>

serializer Module

exception html5lib.serializer.SerializeError[source]

Bases: Exception

Error in serialized tree

html5lib.serializer.serialize(input, tree='etree', encoding=None, **serializer_opts)[source]

Serializes the input token stream using the specified treewalker

Parameters:
  • input – the token stream to serialize
  • tree – the treewalker to use
  • encoding – the encoding to use
  • serializer_opts – any options to pass to the html5lib.serializer.HTMLSerializer that gets created
Returns:

the tree serialized as a string

Example:

>>> from html5lib.html5parser import parse
>>> from html5lib.serializer import serialize
>>> token_stream = parse('<html><body><p>Hi!</p></body></html>')
>>> serialize(token_stream, omit_optional_tags=False)
'<html><head></head><body><p>Hi!</p></body></html>'
html5lib.serializer.xmlcharrefreplace_errors()

Implements the ‘xmlcharrefreplace’ error handling, which replaces an unencodable character with the appropriate XML character reference.

class html5lib.serializer.HTMLSerializer(**kwargs)[source]

Bases: object

__init__(**kwargs)[source]

Initialize HTMLSerializer

Parameters:
  • inject_meta_charset

    Whether or not to inject the meta charset.

    Defaults to True.

  • quote_attr_values

    Whether to quote attribute values that don’t require quoting per legacy browser behavior ("legacy"), when required by the standard ("spec"), or always ("always").

    Defaults to "legacy".

  • quote_char

    Use given quote character for attribute quoting.

    Defaults to " which will use double quotes unless attribute value contains a double quote, in which case single quotes are used.

  • escape_lt_in_attrs

    Whether or not to escape < in attribute values.

    Defaults to False.

  • escape_rcdata

    Whether to escape characters that need to be escaped within normal elements within rcdata elements such as style.

    Defaults to False.

  • resolve_entities

    Whether to resolve named character entities that appear in the source tree. The XML predefined entities &lt; &gt; &amp; &quot; &apos; are unaffected by this setting.

    Defaults to True.

  • strip_whitespace

    Whether to remove semantically meaningless whitespace. (This compresses all whitespace to a single space except within pre.)

    Defaults to False.

  • minimize_boolean_attributes

    Shortens boolean attributes to give just the attribute value, for example:

    <input disabled="disabled">
    

    becomes:

    <input disabled>
    

    Defaults to True.

  • use_trailing_solidus

    Includes a close-tag slash at the end of the start tag of void elements (empty elements whose end tag is forbidden). E.g. <hr/>.

    Defaults to False.

  • space_before_trailing_solidus

    Places a space immediately before the closing slash in a tag using a trailing solidus. E.g. <hr />. Requires use_trailing_solidus=True.

    Defaults to True.

  • sanitize

    Strip all unsafe or unknown constructs from output. See html5lib.filters.sanitizer.Filter.

    Defaults to False.

  • omit_optional_tags

    Omit start/end tags that are optional.

    Defaults to True.

  • alphabetical_attributes

    Reorder attributes to be in alphabetical order.

    Defaults to False.

render(treewalker, encoding=None)[source]

Serializes the stream from the treewalker into a string

Parameters:
  • treewalker – the treewalker to serialize
  • encoding – the string encoding to use
Returns:

the serialized tree

Example:

>>> from html5lib import parse, getTreeWalker
>>> from html5lib.serializer import HTMLSerializer
>>> token_stream = parse('<html><body>Hi!</body></html>')
>>> walker = getTreeWalker('etree')
>>> serializer = HTMLSerializer(omit_optional_tags=False)
>>> serializer.render(walker(token_stream))
'<html><head></head><body>Hi!</body></html>'