html5lib Package¶
HTML parsing library based on the WHATWG HTML specification. The parser is designed to be compatible with existing HTML found in the wild and implements well-defined error recovery that is largely compatible with modern desktop web browsers.
Example usage:
import html5lib
with open("my_document.html", "rb") as f:
tree = html5lib.parse(f)
For convenience, this module re-exports the following names:
-
html5lib.__version__= '1.2-dev'¶ Distribution version number.
constants Module¶
-
exception
html5lib.constants.DataLossWarning[source]¶ Bases:
UserWarningRaised when the current tree is unable to represent the input data
html5parser Module¶
-
class
html5lib.html5parser.HTMLParser(tree=None, strict=False, namespaceHTMLElements=True, debug=False)[source]¶ Bases:
objectHTML parser
Generates a tree structure from a stream of (possibly malformed) HTML.
-
__init__(tree=None, strict=False, namespaceHTMLElements=True, debug=False)[source]¶ Parameters: - tree – a treebuilder class controlling the type of tree that will be returned. Built in treebuilders can be accessed through html5lib.treebuilders.getTreeBuilder(treeType)
- strict – raise an exception when a parse error is encountered
- namespaceHTMLElements – whether or not to namespace HTML elements
- debug – whether or not to enable debug mode which logs things
Example:
>>> from html5lib.html5parser import HTMLParser >>> parser = HTMLParser() # generates parser with etree builder >>> parser = HTMLParser('lxml', strict=True) # generates parser with lxml builder which is strict
-
documentEncoding¶ Name of the character encoding that was used to decode the input stream, or
Noneif that is not determined yet
-
parse(stream, *args, **kwargs)[source]¶ Parse a HTML document into a well-formed tree
Parameters: - stream –
a file-like object or string containing the HTML to be parsed
The optional encoding parameter must be a string that indicates the encoding. If specified, that encoding will be used, regardless of any BOM or later declaration (such as in a meta element).
- scripting – treat noscript elements as if JavaScript was turned on
Returns: parsed tree
Example:
>>> from html5lib.html5parser import HTMLParser >>> parser = HTMLParser() >>> parser.parse('<html><body><p>This is a doc</p></body></html>') <Element u'{http://www.w3.org/1999/xhtml}html' at 0x7feac4909db0>
- stream –
-
parseFragment(stream, *args, **kwargs)[source]¶ Parse a HTML fragment into a well-formed tree fragment
Parameters: - container – name of the element we’re setting the innerHTML property if set to None, default to ‘div’
- stream –
a file-like object or string containing the HTML to be parsed
The optional encoding parameter must be a string that indicates the encoding. If specified, that encoding will be used, regardless of any BOM or later declaration (such as in a meta element)
- scripting – treat noscript elements as if JavaScript was turned on
Returns: parsed tree
Example:
>>> from html5lib.html5libparser import HTMLParser >>> parser = HTMLParser() >>> parser.parseFragment('<b>this is a fragment</b>') <Element u'DOCUMENT_FRAGMENT' at 0x7feac484b090>
-
-
html5lib.html5parser.parse(doc, treebuilder='etree', namespaceHTMLElements=True, **kwargs)[source]¶ Parse an HTML document as a string or file-like object into a tree
Parameters: - doc – the document to parse as a string or file-like object
- treebuilder – the treebuilder to use when parsing
- namespaceHTMLElements – whether or not to namespace HTML elements
Returns: parsed tree
Example:
>>> from html5lib.html5parser import parse >>> parse('<html><body><p>This is a doc</p></body></html>') <Element u'{http://www.w3.org/1999/xhtml}html' at 0x7feac4909db0>
-
html5lib.html5parser.parseFragment(doc, container='div', treebuilder='etree', namespaceHTMLElements=True, **kwargs)[source]¶ Parse an HTML fragment as a string or file-like object into a tree
Parameters: - doc – the fragment to parse as a string or file-like object
- container – the container context to parse the fragment in
- treebuilder – the treebuilder to use when parsing
- namespaceHTMLElements – whether or not to namespace HTML elements
Returns: parsed tree
Example:
>>> from html5lib.html5libparser import parseFragment >>> parseFragment('<b>this is a fragment</b>') <Element u'DOCUMENT_FRAGMENT' at 0x7feac484b090>
serializer Module¶
-
html5lib.serializer.serialize(input, tree='etree', encoding=None, **serializer_opts)[source]¶ Serializes the input token stream using the specified treewalker
Parameters: - input – the token stream to serialize
- tree – the treewalker to use
- encoding – the encoding to use
- serializer_opts – any options to pass to the
html5lib.serializer.HTMLSerializerthat gets created
Returns: the tree serialized as a string
Example:
>>> from html5lib.html5parser import parse >>> from html5lib.serializer import serialize >>> token_stream = parse('<html><body><p>Hi!</p></body></html>') >>> serialize(token_stream, omit_optional_tags=False) '<html><head></head><body><p>Hi!</p></body></html>'
-
html5lib.serializer.xmlcharrefreplace_errors()¶ Implements the ‘xmlcharrefreplace’ error handling, which replaces an unencodable character with the appropriate XML character reference.
-
class
html5lib.serializer.HTMLSerializer(**kwargs)[source]¶ Bases:
object-
__init__(**kwargs)[source]¶ Initialize HTMLSerializer
Parameters: - inject_meta_charset –
Whether or not to inject the meta charset.
Defaults to
True. - quote_attr_values –
Whether to quote attribute values that don’t require quoting per legacy browser behavior (
"legacy"), when required by the standard ("spec"), or always ("always").Defaults to
"legacy". - quote_char –
Use given quote character for attribute quoting.
Defaults to
"which will use double quotes unless attribute value contains a double quote, in which case single quotes are used. - escape_lt_in_attrs –
Whether or not to escape
<in attribute values.Defaults to
False. - escape_rcdata –
Whether to escape characters that need to be escaped within normal elements within rcdata elements such as style.
Defaults to
False. - resolve_entities –
Whether to resolve named character entities that appear in the source tree. The XML predefined entities < > & " ' are unaffected by this setting.
Defaults to
True. - strip_whitespace –
Whether to remove semantically meaningless whitespace. (This compresses all whitespace to a single space except within
pre.)Defaults to
False. - minimize_boolean_attributes –
Shortens boolean attributes to give just the attribute value, for example:
<input disabled="disabled">
becomes:
<input disabled>
Defaults to
True. - use_trailing_solidus –
Includes a close-tag slash at the end of the start tag of void elements (empty elements whose end tag is forbidden). E.g.
<hr/>.Defaults to
False. - space_before_trailing_solidus –
Places a space immediately before the closing slash in a tag using a trailing solidus. E.g.
<hr />. Requiresuse_trailing_solidus=True.Defaults to
True. - sanitize –
Strip all unsafe or unknown constructs from output. See
html5lib.filters.sanitizer.Filter.Defaults to
False. - omit_optional_tags –
Omit start/end tags that are optional.
Defaults to
True. - alphabetical_attributes –
Reorder attributes to be in alphabetical order.
Defaults to
False.
- inject_meta_charset –
-
render(treewalker, encoding=None)[source]¶ Serializes the stream from the treewalker into a string
Parameters: - treewalker – the treewalker to serialize
- encoding – the string encoding to use
Returns: the serialized tree
Example:
>>> from html5lib import parse, getTreeWalker >>> from html5lib.serializer import HTMLSerializer >>> token_stream = parse('<html><body>Hi!</body></html>') >>> walker = getTreeWalker('etree') >>> serializer = HTMLSerializer(omit_optional_tags=False) >>> serializer.render(walker(token_stream)) '<html><head></head><body>Hi!</body></html>'
-