html5lib Package¶
HTML parsing library based on the WHATWG HTML specification. The parser is designed to be compatible with existing HTML found in the wild and implements well-defined error recovery that is largely compatible with modern desktop web browsers.
Example usage:
import html5lib
with open("my_document.html", "rb") as f:
tree = html5lib.parse(f)
For convenience, this module re-exports the following names:
-
html5lib.
__version__
= '1.2-dev'¶ Distribution version number.
constants
Module¶
-
exception
html5lib.constants.
DataLossWarning
[source]¶ Bases:
UserWarning
Raised when the current tree is unable to represent the input data
html5parser
Module¶
-
class
html5lib.html5parser.
HTMLParser
(tree=None, strict=False, namespaceHTMLElements=True, debug=False)[source]¶ Bases:
object
HTML parser
Generates a tree structure from a stream of (possibly malformed) HTML.
-
__init__
(tree=None, strict=False, namespaceHTMLElements=True, debug=False)[source]¶ Parameters: - tree – a treebuilder class controlling the type of tree that will be returned. Built in treebuilders can be accessed through html5lib.treebuilders.getTreeBuilder(treeType)
- strict – raise an exception when a parse error is encountered
- namespaceHTMLElements – whether or not to namespace HTML elements
- debug – whether or not to enable debug mode which logs things
Example:
>>> from html5lib.html5parser import HTMLParser >>> parser = HTMLParser() # generates parser with etree builder >>> parser = HTMLParser('lxml', strict=True) # generates parser with lxml builder which is strict
-
documentEncoding
¶ Name of the character encoding that was used to decode the input stream, or
None
if that is not determined yet
-
parse
(stream, *args, **kwargs)[source]¶ Parse a HTML document into a well-formed tree
Parameters: - stream –
a file-like object or string containing the HTML to be parsed
The optional encoding parameter must be a string that indicates the encoding. If specified, that encoding will be used, regardless of any BOM or later declaration (such as in a meta element).
- scripting – treat noscript elements as if JavaScript was turned on
Returns: parsed tree
Example:
>>> from html5lib.html5parser import HTMLParser >>> parser = HTMLParser() >>> parser.parse('<html><body><p>This is a doc</p></body></html>') <Element u'{http://www.w3.org/1999/xhtml}html' at 0x7feac4909db0>
- stream –
-
parseFragment
(stream, *args, **kwargs)[source]¶ Parse a HTML fragment into a well-formed tree fragment
Parameters: - container – name of the element we’re setting the innerHTML property if set to None, default to ‘div’
- stream –
a file-like object or string containing the HTML to be parsed
The optional encoding parameter must be a string that indicates the encoding. If specified, that encoding will be used, regardless of any BOM or later declaration (such as in a meta element)
- scripting – treat noscript elements as if JavaScript was turned on
Returns: parsed tree
Example:
>>> from html5lib.html5libparser import HTMLParser >>> parser = HTMLParser() >>> parser.parseFragment('<b>this is a fragment</b>') <Element u'DOCUMENT_FRAGMENT' at 0x7feac484b090>
-
-
html5lib.html5parser.
parse
(doc, treebuilder='etree', namespaceHTMLElements=True, **kwargs)[source]¶ Parse an HTML document as a string or file-like object into a tree
Parameters: - doc – the document to parse as a string or file-like object
- treebuilder – the treebuilder to use when parsing
- namespaceHTMLElements – whether or not to namespace HTML elements
Returns: parsed tree
Example:
>>> from html5lib.html5parser import parse >>> parse('<html><body><p>This is a doc</p></body></html>') <Element u'{http://www.w3.org/1999/xhtml}html' at 0x7feac4909db0>
-
html5lib.html5parser.
parseFragment
(doc, container='div', treebuilder='etree', namespaceHTMLElements=True, **kwargs)[source]¶ Parse an HTML fragment as a string or file-like object into a tree
Parameters: - doc – the fragment to parse as a string or file-like object
- container – the container context to parse the fragment in
- treebuilder – the treebuilder to use when parsing
- namespaceHTMLElements – whether or not to namespace HTML elements
Returns: parsed tree
Example:
>>> from html5lib.html5libparser import parseFragment >>> parseFragment('<b>this is a fragment</b>') <Element u'DOCUMENT_FRAGMENT' at 0x7feac484b090>
serializer
Module¶
-
html5lib.serializer.
serialize
(input, tree='etree', encoding=None, **serializer_opts)[source]¶ Serializes the input token stream using the specified treewalker
Parameters: - input – the token stream to serialize
- tree – the treewalker to use
- encoding – the encoding to use
- serializer_opts – any options to pass to the
html5lib.serializer.HTMLSerializer
that gets created
Returns: the tree serialized as a string
Example:
>>> from html5lib.html5parser import parse >>> from html5lib.serializer import serialize >>> token_stream = parse('<html><body><p>Hi!</p></body></html>') >>> serialize(token_stream, omit_optional_tags=False) '<html><head></head><body><p>Hi!</p></body></html>'
-
html5lib.serializer.
xmlcharrefreplace_errors
()¶ Implements the ‘xmlcharrefreplace’ error handling, which replaces an unencodable character with the appropriate XML character reference.
-
class
html5lib.serializer.
HTMLSerializer
(**kwargs)[source]¶ Bases:
object
-
__init__
(**kwargs)[source]¶ Initialize HTMLSerializer
Parameters: - inject_meta_charset –
Whether or not to inject the meta charset.
Defaults to
True
. - quote_attr_values –
Whether to quote attribute values that don’t require quoting per legacy browser behavior (
"legacy"
), when required by the standard ("spec"
), or always ("always"
).Defaults to
"legacy"
. - quote_char –
Use given quote character for attribute quoting.
Defaults to
"
which will use double quotes unless attribute value contains a double quote, in which case single quotes are used. - escape_lt_in_attrs –
Whether or not to escape
<
in attribute values.Defaults to
False
. - escape_rcdata –
Whether to escape characters that need to be escaped within normal elements within rcdata elements such as style.
Defaults to
False
. - resolve_entities –
Whether to resolve named character entities that appear in the source tree. The XML predefined entities < > & " ' are unaffected by this setting.
Defaults to
True
. - strip_whitespace –
Whether to remove semantically meaningless whitespace. (This compresses all whitespace to a single space except within
pre
.)Defaults to
False
. - minimize_boolean_attributes –
Shortens boolean attributes to give just the attribute value, for example:
<input disabled="disabled">
becomes:
<input disabled>
Defaults to
True
. - use_trailing_solidus –
Includes a close-tag slash at the end of the start tag of void elements (empty elements whose end tag is forbidden). E.g.
<hr/>
.Defaults to
False
. - space_before_trailing_solidus –
Places a space immediately before the closing slash in a tag using a trailing solidus. E.g.
<hr />
. Requiresuse_trailing_solidus=True
.Defaults to
True
. - sanitize –
Strip all unsafe or unknown constructs from output. See
html5lib.filters.sanitizer.Filter
.Defaults to
False
. - omit_optional_tags –
Omit start/end tags that are optional.
Defaults to
True
. - alphabetical_attributes –
Reorder attributes to be in alphabetical order.
Defaults to
False
.
- inject_meta_charset –
-
render
(treewalker, encoding=None)[source]¶ Serializes the stream from the treewalker into a string
Parameters: - treewalker – the treewalker to serialize
- encoding – the string encoding to use
Returns: the serialized tree
Example:
>>> from html5lib import parse, getTreeWalker >>> from html5lib.serializer import HTMLSerializer >>> token_stream = parse('<html><body>Hi!</body></html>') >>> walker = getTreeWalker('etree') >>> serializer = HTMLSerializer(omit_optional_tags=False) >>> serializer.render(walker(token_stream)) '<html><head></head><body>Hi!</body></html>'
-