html5lib Package

HTML parsing library based on the WHATWG HTML specification. The parser is designed to be compatible with existing HTML found in the wild and implements well-defined error recovery that is largely compatible with modern desktop web browsers.

Example usage:

import html5lib
with open("my_document.html", "rb") as f:
    tree = html5lib.parse(f)

For convenience, this module re-exports the following names:

constants Module

exception html5lib.constants.DataLossWarning

Bases: UserWarning

exception html5lib.constants.ReparseException

Bases: Exception

html5parser Module

class html5lib.html5parser.HTMLParser(tree=None, strict=False, namespaceHTMLElements=True, debug=False)

Bases: object

HTML parser. Generates a tree structure from a stream of (possibly malformed) HTML

adjustForeignAttributes(token)
adjustMathMLAttributes(token)
adjustSVGAttributes(token)
documentEncoding

The name of the character encoding that was used to decode the input stream, or None if that is not determined yet.

isHTMLIntegrationPoint(element)
isMathMLTextIntegrationPoint(element)
mainLoop()
normalizeToken(token)

HTML5 specific normalizations to the token stream

normalizedTokens()
parse(stream, *args, **kwargs)

Parse a HTML document into a well-formed tree

stream - a filelike object or string containing the HTML to be parsed

The optional encoding parameter must be a string that indicates the encoding. If specified, that encoding will be used, regardless of any BOM or later declaration (such as in a meta element)

scripting - treat noscript elements as if javascript was turned on

parseError(errorcode='XXX-undefined-error', datavars=None)
parseFragment(stream, *args, **kwargs)

Parse a HTML fragment into a well-formed tree fragment

container - name of the element we’re setting the innerHTML property if set to None, default to ‘div’

stream - a filelike object or string containing the HTML to be parsed

The optional encoding parameter must be a string that indicates the encoding. If specified, that encoding will be used, regardless of any BOM or later declaration (such as in a meta element)

scripting - treat noscript elements as if javascript was turned on

parseRCDataRawtext(token, contentType)

Generic RCDATA/RAWTEXT Parsing algorithm contentType - RCDATA or RAWTEXT

reparseTokenNormal(token)
reset()
resetInsertionMode()
exception html5lib.html5parser.ParseError

Bases: Exception

Error in parsed document

html5lib.html5parser.adjust_attributes(token, replacements)
html5lib.html5parser.impliedTagToken(name, type='EndTag', attributes=None, selfClosing=False)
html5lib.html5parser.method_decorator_metaclass(function)
html5lib.html5parser.parse(doc, treebuilder='etree', namespaceHTMLElements=True, **kwargs)

Parse a string or file-like object into a tree

html5lib.html5parser.parseFragment(doc, container='div', treebuilder='etree', namespaceHTMLElements=True, **kwargs)

serializer Module

class html5lib.serializer.HTMLSerializer(**kwargs)

Bases: object

alphabetical_attributes = False
encode(string)
encodeStrict(string)
escape_lt_in_attrs = False
escape_rcdata = False
inject_meta_charset = True
minimize_boolean_attributes = True
omit_optional_tags = True
options = ('quote_attr_values', 'quote_char', 'use_best_quote_char', 'omit_optional_tags', 'minimize_boolean_attributes', 'use_trailing_solidus', 'space_before_trailing_solidus', 'escape_lt_in_attrs', 'escape_rcdata', 'resolve_entities', 'alphabetical_attributes', 'inject_meta_charset', 'strip_whitespace', 'sanitize')
quote_attr_values = 'legacy'
quote_char = '"'
render(treewalker, encoding=None)
resolve_entities = True
sanitize = False
serialize(treewalker, encoding=None)
serializeError(data='XXX ERROR MESSAGE NEEDED')
space_before_trailing_solidus = True
strip_whitespace = False
use_best_quote_char = True
use_trailing_solidus = False
exception html5lib.serializer.SerializeError

Bases: Exception

Error in serialized tree

html5lib.serializer.htmlentityreplace_errors(exc)
html5lib.serializer.serialize(input, tree='etree', encoding=None, **serializer_opts)
html5lib.serializer.xmlcharrefreplace_errors()

Implements the ‘xmlcharrefreplace’ error handling, which replaces an unencodable character with the appropriate XML character reference.