html5lib Package¶

HTML parsing library based on the WHATWG HTML specification. The parser is designed to be compatible with existing HTML found in the wild and implements well-defined error recovery that is largely compatible with modern desktop web browsers.

Example usage:

import html5lib
with open("my_document.html", "rb") as f:
    tree = html5lib.parse(f)

For convenience, this module re-exports the following names:

html5lib.__version__ = '1.2-dev'¶: Distribution version number.

`constants` Module¶

exception html5lib.constants.DataLossWarning[source]¶

Bases: UserWarning

Raised when the current tree is unable to represent the input data

`html5parser` Module¶

class html5lib.html5parser.HTMLParser(tree=None, strict=False, namespaceHTMLElements=True, debug=False)[source]¶

Bases: object

HTML parser

Generates a tree structure from a stream of (possibly malformed) HTML.

__init__(tree=None, strict=False, namespaceHTMLElements=True, debug=False)[source]¶

Parameters:	tree – a treebuilder class controlling the type of tree that will be returned. Built in treebuilders can be accessed through html5lib.treebuilders.getTreeBuilder(treeType) strict – raise an exception when a parse error is encountered namespaceHTMLElements – whether or not to namespace HTML elements debug – whether or not to enable debug mode which logs things

Example:

>>> from html5lib.html5parser import HTMLParser
>>> parser = HTMLParser()                     # generates parser with etree builder
>>> parser = HTMLParser('lxml', strict=True)  # generates parser with lxml builder which is strict

documentEncoding¶: Name of the character encoding that was used to decode the input stream, or None if that is not determined yet

parse(stream, *args, **kwargs)[source]¶

Parse a HTML document into a well-formed tree

Parameters:	stream – a file-like object or string containing the HTML to be parsed The optional encoding parameter must be a string that indicates the encoding. If specified, that encoding will be used, regardless of any BOM or later declaration (such as in a meta element). scripting – treat noscript elements as if JavaScript was turned on
Returns:	parsed tree

Example:

>>> from html5lib.html5parser import HTMLParser
>>> parser = HTMLParser()
>>> parser.parse('<html><body><p>This is a doc</p></body></html>')
<Element u'{http://www.w3.org/1999/xhtml}html' at 0x7feac4909db0>

parseFragment(stream, *args, **kwargs)[source]¶

Parse a HTML fragment into a well-formed tree fragment

Parameters:

container – name of the element we’re setting the innerHTML property if set to None, default to ‘div’
stream –
a file-like object or string containing the HTML to be parsed

The optional encoding parameter must be a string that indicates the encoding. If specified, that encoding will be used, regardless of any BOM or later declaration (such as in a meta element)
scripting – treat noscript elements as if JavaScript was turned on

Returns:

parsed tree

Example:

>>> from html5lib.html5libparser import HTMLParser
>>> parser = HTMLParser()
>>> parser.parseFragment('<b>this is a fragment</b>')
<Element u'DOCUMENT_FRAGMENT' at 0x7feac484b090>

exception html5lib.html5parser.ParseError[source]¶

Bases: Exception

Error in parsed document

html5lib.html5parser.parse(doc, treebuilder='etree', namespaceHTMLElements=True, **kwargs)[source]¶

Parse an HTML document as a string or file-like object into a tree

Parameters:	doc – the document to parse as a string or file-like object treebuilder – the treebuilder to use when parsing namespaceHTMLElements – whether or not to namespace HTML elements
Returns:	parsed tree

Example:

>>> from html5lib.html5parser import parse
>>> parse('<html><body><p>This is a doc</p></body></html>')
<Element u'{http://www.w3.org/1999/xhtml}html' at 0x7feac4909db0>

html5lib.html5parser.parseFragment(doc, container='div', treebuilder='etree', namespaceHTMLElements=True, **kwargs)[source]¶

Parse an HTML fragment as a string or file-like object into a tree

Parameters:	doc – the fragment to parse as a string or file-like object container – the container context to parse the fragment in treebuilder – the treebuilder to use when parsing namespaceHTMLElements – whether or not to namespace HTML elements
Returns:	parsed tree

Example:

>>> from html5lib.html5libparser import parseFragment
>>> parseFragment('<b>this is a fragment</b>')
<Element u'DOCUMENT_FRAGMENT' at 0x7feac484b090>

`serializer` Module¶

exception html5lib.serializer.SerializeError[source]¶

Bases: Exception

Error in serialized tree

html5lib.serializer.serialize(input, tree='etree', encoding=None, **serializer_opts)[source]¶

Serializes the input token stream using the specified treewalker

Parameters:	input – the token stream to serialize tree – the treewalker to use encoding – the encoding to use serializer_opts – any options to pass to the `html5lib.serializer.HTMLSerializer` that gets created
Returns:	the tree serialized as a string

Example:

>>> from html5lib.html5parser import parse
>>> from html5lib.serializer import serialize
>>> token_stream = parse('<html><body><p>Hi!</p></body></html>')
>>> serialize(token_stream, omit_optional_tags=False)
'<html><head></head><body><p>Hi!</p></body></html>'

html5lib.serializer.xmlcharrefreplace_errors()¶: Implements the ‘xmlcharrefreplace’ error handling, which replaces an unencodable character with the appropriate XML character reference.

class html5lib.serializer.HTMLSerializer(**kwargs)[source]¶

Bases: object

__init__(**kwargs)[source]¶

Initialize HTMLSerializer

Parameters:

inject_meta_charset –
Whether or not to inject the meta charset.

Defaults to True.
quote_attr_values –
Whether to quote attribute values that don’t require quoting per legacy browser behavior ("legacy"), when required by the standard ("spec"), or always ("always").

Defaults to "legacy".
quote_char –
Use given quote character for attribute quoting.

Defaults to " which will use double quotes unless attribute value contains a double quote, in which case single quotes are used.
escape_lt_in_attrs –
Whether or not to escape < in attribute values.

Defaults to False.
escape_rcdata –
Whether to escape characters that need to be escaped within normal elements within rcdata elements such as style.

Defaults to False.
resolve_entities –
Whether to resolve named character entities that appear in the source tree. The XML predefined entities < > & " ' are unaffected by this setting.

Defaults to True.
strip_whitespace –
Whether to remove semantically meaningless whitespace. (This compresses all whitespace to a single space except within pre.)

Defaults to False.
minimize_boolean_attributes –
Shortens boolean attributes to give just the attribute value, for example:
```
<input disabled="disabled">
```
becomes:
```
<input disabled>
```
Defaults to True.
use_trailing_solidus –
Includes a close-tag slash at the end of the start tag of void elements (empty elements whose end tag is forbidden). E.g. <hr/>.

Defaults to False.
space_before_trailing_solidus –
Places a space immediately before the closing slash in a tag using a trailing solidus. E.g. <hr />. Requires use_trailing_solidus=True.

Defaults to True.
sanitize –
Strip all unsafe or unknown constructs from output. See html5lib.filters.sanitizer.Filter.

Defaults to False.
omit_optional_tags –
Omit start/end tags that are optional.

Defaults to True.
alphabetical_attributes –
Reorder attributes to be in alphabetical order.

Defaults to False.

render(treewalker, encoding=None)[source]¶

Serializes the stream from the treewalker into a string

Parameters:	treewalker – the treewalker to serialize encoding – the string encoding to use
Returns:	the serialized tree

Example:

>>> from html5lib import parse, getTreeWalker
>>> from html5lib.serializer import HTMLSerializer
>>> token_stream = parse('<html><body>Hi!</body></html>')
>>> walker = getTreeWalker('etree')
>>> serializer = HTMLSerializer(omit_optional_tags=False)
>>> serializer.render(walker(token_stream))
'<html><head></head><body>Hi!</body></html>'

html5lib Package¶

constants Module¶

html5parser Module¶

serializer Module¶

Subpackages¶

`constants` Module¶

`html5parser` Module¶

`serializer` Module¶