treebuilders Package¶
A collection of modules for building different kinds of trees from HTML documents.
To create a treebuilder for a new type of tree, you need to do implement several things:
A set of classes for various types of elements: Document, Doctype, Comment, Element. These must implement the interface of
base.treebuilders.Node
(although comment nodes have a different signature for their constructor, seetreebuilders.etree.Comment
) Textual content may also be implemented as another node type, or not, as your tree implementation requires.A treebuilder object (called
TreeBuilder
by convention) that inherits fromtreebuilders.base.TreeBuilder
. This has 4 required attributes:documentClass
- the class to use for the bottommost node of a documentelementClass
- the class to use for HTML ElementscommentClass
- the class to use for commentsdoctypeClass
- the class to use for doctypes
It also has one required method:
getDocument
- Returns the root node of the complete document tree
If you wish to run the unit tests, you must also create a
testSerializer
method on your treebuilder which accepts a node and returns a string containing Node and its children serialized according to the format used in the unittests
-
html5lib.treebuilders.
getTreeBuilder
(treeType, implementation=None, **kwargs)[source]¶ Get a TreeBuilder class for various types of trees with built-in support
Parameters: - treeType –
the name of the tree type required (case-insensitive). Supported values are:
- ”dom” - A generic builder for DOM implementations, defaulting to a xml.dom.minidom based implementation.
- ”etree” - A generic builder for tree implementations exposing an ElementTree-like interface, defaulting to xml.etree.cElementTree if available and xml.etree.ElementTree if not.
- ”lxml” - A etree-based builder for lxml.etree, handling limitations of lxml’s implementation.
- implementation – (Currently applies to the “etree” and “dom” tree types). A module implementing the tree type e.g. xml.etree.ElementTree or xml.etree.cElementTree.
- kwargs – Any additional options to pass to the TreeBuilder when creating it.
Example:
>>> from html5lib.treebuilders import getTreeBuilder >>> builder = getTreeBuilder('etree')
- treeType –
base
Module¶
-
class
html5lib.treebuilders.base.
Node
(name)[source]¶ Bases:
object
Represents an item in the tree
-
appendChild
(node)[source]¶ Insert node as a child of the current node
Parameters: node – the node to insert
-
cloneNode
()[source]¶ Return a shallow copy of the current node i.e. a node with the same name and attributes but with no parent or child nodes
-
insertBefore
(node, refNode)[source]¶ Insert node as a child of the current node, before refNode in the list of child nodes. Raises ValueError if refNode is not a child of the current node
Parameters: - node – the node to insert
- refNode – the child node to insert the node before
-
insertText
(data, insertBefore=None)[source]¶ Insert data as text in the current node, positioned before the start of node insertBefore or to the end of the node’s text.
Parameters: - data – the data to insert
- insertBefore – True if you want to insert the text before the node and False if you want to insert it after the node
-
-
class
html5lib.treebuilders.base.
TreeBuilder
(namespaceHTMLElements)[source]¶ Bases:
object
Base treebuilder implementation
- documentClass - the class to use for the bottommost node of a document
- elementClass - the class to use for HTML Elements
- commentClass - the class to use for comments
- doctypeClass - the class to use for doctypes
-
__init__
(namespaceHTMLElements)[source]¶ Create a TreeBuilder
Parameters: namespaceHTMLElements – whether or not to namespace HTML elements
-
elementInActiveFormattingElements
(name)[source]¶ Check if an element exists between the end of the active formatting elements and the last marker. If it does, return it, else return false
dom
Module¶
etree
Module¶
etree_lxml
Module¶
Module for supporting the lxml.etree library. The idea here is to use as much of the native library as possible, without using fragile hacks like custom element names that break between releases. The downside of this is that we cannot represent all possible trees; specifically the following are known to cause problems:
Text or comments as siblings of the root element Docypes with no name
When any of these things occur, we emit a DataLossWarning
-
class
html5lib.treebuilders.etree_lxml.
TreeBuilder
(namespaceHTMLElements, fullTree=False)[source]¶