html5lib Package

html5lib Package

HTML parsing library based on the WHATWG “HTML5” specification. The parser is designed to be compatible with existing HTML found in the wild and implements well-defined error recovery that is largely compatible with modern desktop web browsers.

Example usage:

import html5lib f = open(“my_document.html”) tree = html5lib.parse(f)

class html5lib.__init__.HTMLParser(tree=None, strict=False, namespaceHTMLElements=True, debug=False)

Bases: object

HTML parser. Generates a tree structure from a stream of (possibly malformed) HTML

adjustForeignAttributes(token)
adjustMathMLAttributes(token)
adjustSVGAttributes(token)
documentEncoding

The name of the character encoding that was used to decode the input stream, or None if that is not determined yet.

isHTMLIntegrationPoint(element)
isMathMLTextIntegrationPoint(element)
mainLoop()
normalizeToken(token)

HTML5 specific normalizations to the token stream

normalizedTokens()
parse(stream, *args, **kwargs)

Parse a HTML document into a well-formed tree

stream - a filelike object or string containing the HTML to be parsed

The optional encoding parameter must be a string that indicates the encoding. If specified, that encoding will be used, regardless of any BOM or later declaration (such as in a meta element)

scripting - treat noscript elements as if javascript was turned on

parseError(errorcode='XXX-undefined-error', datavars=None)
parseFragment(stream, *args, **kwargs)

Parse a HTML fragment into a well-formed tree fragment

container - name of the element we’re setting the innerHTML property if set to None, default to ‘div’

stream - a filelike object or string containing the HTML to be parsed

The optional encoding parameter must be a string that indicates the encoding. If specified, that encoding will be used, regardless of any BOM or later declaration (such as in a meta element)

scripting - treat noscript elements as if javascript was turned on

parseRCDataRawtext(token, contentType)

Generic RCDATA/RAWTEXT Parsing algorithm contentType - RCDATA or RAWTEXT

reparseTokenNormal(token)
reset()
resetInsertionMode()
html5lib.__init__.parse(doc, treebuilder='etree', namespaceHTMLElements=True, **kwargs)

Parse a string or file-like object into a tree

html5lib.__init__.parseFragment(doc, container='div', treebuilder='etree', namespaceHTMLElements=True, **kwargs)
html5lib.__init__.getTreeBuilder(treeType, implementation=None, **kwargs)

Get a TreeBuilder class for various types of tree with built-in support

treeType - the name of the tree type required (case-insensitive). Supported

values are:

“dom” - A generic builder for DOM implementations, defaulting to
a xml.dom.minidom based implementation.
“etree” - A generic builder for tree implementations exposing an
ElementTree-like interface, defaulting to xml.etree.cElementTree if available and xml.etree.ElementTree if not.
“lxml” - A etree-based builder for lxml.etree, handling
limitations of lxml’s implementation.
implementation - (Currently applies to the “etree” and “dom” tree types). A
module implementing the tree type e.g. xml.etree.ElementTree or xml.etree.cElementTree.
html5lib.__init__.getTreeWalker(treeType, implementation=None, **kwargs)

Get a TreeWalker class for various types of tree with built-in support

Args:
treeType (str): the name of the tree type required (case-insensitive).

Supported values are:

  • “dom”: The xml.dom.minidom DOM implementation

  • “etree”: A generic walker for tree implementations exposing an

    elementtree-like interface (known to work with ElementTree, cElementTree and lxml.etree).

  • “lxml”: Optimized walker for lxml.etree

  • “genshi”: a Genshi stream

Implementation: A module implementing the tree type e.g.
xml.etree.ElementTree or cElementTree (Currently applies to the “etree” tree type only).
html5lib.__init__.serialize(input, tree='etree', encoding=None, **serializer_opts)

constants Module

exception html5lib.constants.DataLossWarning

Bases: UserWarning

exception html5lib.constants.ReparseException

Bases: Exception

html5parser Module

class html5lib.html5parser.HTMLParser(tree=None, strict=False, namespaceHTMLElements=True, debug=False)

Bases: object

HTML parser. Generates a tree structure from a stream of (possibly malformed) HTML

adjustForeignAttributes(token)
adjustMathMLAttributes(token)
adjustSVGAttributes(token)
documentEncoding

The name of the character encoding that was used to decode the input stream, or None if that is not determined yet.

isHTMLIntegrationPoint(element)
isMathMLTextIntegrationPoint(element)
mainLoop()
normalizeToken(token)

HTML5 specific normalizations to the token stream

normalizedTokens()
parse(stream, *args, **kwargs)

Parse a HTML document into a well-formed tree

stream - a filelike object or string containing the HTML to be parsed

The optional encoding parameter must be a string that indicates the encoding. If specified, that encoding will be used, regardless of any BOM or later declaration (such as in a meta element)

scripting - treat noscript elements as if javascript was turned on

parseError(errorcode='XXX-undefined-error', datavars=None)
parseFragment(stream, *args, **kwargs)

Parse a HTML fragment into a well-formed tree fragment

container - name of the element we’re setting the innerHTML property if set to None, default to ‘div’

stream - a filelike object or string containing the HTML to be parsed

The optional encoding parameter must be a string that indicates the encoding. If specified, that encoding will be used, regardless of any BOM or later declaration (such as in a meta element)

scripting - treat noscript elements as if javascript was turned on

parseRCDataRawtext(token, contentType)

Generic RCDATA/RAWTEXT Parsing algorithm contentType - RCDATA or RAWTEXT

reparseTokenNormal(token)
reset()
resetInsertionMode()
exception html5lib.html5parser.ParseError

Bases: Exception

Error in parsed document

html5lib.html5parser.adjust_attributes(token, replacements)
html5lib.html5parser.impliedTagToken(name, type='EndTag', attributes=None, selfClosing=False)
html5lib.html5parser.method_decorator_metaclass(function)
html5lib.html5parser.parse(doc, treebuilder='etree', namespaceHTMLElements=True, **kwargs)

Parse a string or file-like object into a tree

html5lib.html5parser.parseFragment(doc, container='div', treebuilder='etree', namespaceHTMLElements=True, **kwargs)

serializer Module

class html5lib.serializer.HTMLSerializer(**kwargs)

Bases: object

alphabetical_attributes = False
encode(string)
encodeStrict(string)
escape_lt_in_attrs = False
escape_rcdata = False
inject_meta_charset = True
minimize_boolean_attributes = True
omit_optional_tags = True
options = ('quote_attr_values', 'quote_char', 'use_best_quote_char', 'omit_optional_tags', 'minimize_boolean_attributes', 'use_trailing_solidus', 'space_before_trailing_solidus', 'escape_lt_in_attrs', 'escape_rcdata', 'resolve_entities', 'alphabetical_attributes', 'inject_meta_charset', 'strip_whitespace', 'sanitize')
quote_attr_values = 'legacy'
quote_char = '"'
render(treewalker, encoding=None)
resolve_entities = True
sanitize = False
serialize(treewalker, encoding=None)
serializeError(data='XXX ERROR MESSAGE NEEDED')
space_before_trailing_solidus = True
strip_whitespace = False
use_best_quote_char = True
use_trailing_solidus = False
exception html5lib.serializer.SerializeError

Bases: Exception

Error in serialized tree

html5lib.serializer.htmlentityreplace_errors(exc)
html5lib.serializer.serialize(input, tree='etree', encoding=None, **serializer_opts)
html5lib.serializer.xmlcharrefreplace_errors()

Implements the ‘xmlcharrefreplace’ error handling, which replaces an unencodable character with the appropriate XML character reference.