treebuilders Package

treebuilders Package

A collection of modules for building different kinds of trees from HTML documents.

To create a treebuilder for a new type of tree, you need to do implement several things:

  1. A set of classes for various types of elements: Document, Doctype, Comment, Element. These must implement the interface of base.treebuilders.Node (although comment nodes have a different signature for their constructor, see treebuilders.etree.Comment) Textual content may also be implemented as another node type, or not, as your tree implementation requires.

  2. A treebuilder object (called TreeBuilder by convention) that inherits from treebuilders.base.TreeBuilder. This has 4 required attributes:

    • documentClass - the class to use for the bottommost node of a document
    • elementClass - the class to use for HTML Elements
    • commentClass - the class to use for comments
    • doctypeClass - the class to use for doctypes

    It also has one required method:

    • getDocument - Returns the root node of the complete document tree
  3. If you wish to run the unit tests, you must also create a testSerializer method on your treebuilder which accepts a node and returns a string containing Node and its children serialized according to the format used in the unittests

html5lib.treebuilders.getTreeBuilder(treeType, implementation=None, **kwargs)[source]

Get a TreeBuilder class for various types of trees with built-in support

Parameters:
  • treeType

    the name of the tree type required (case-insensitive). Supported values are:

    • ”dom” - A generic builder for DOM implementations, defaulting to a xml.dom.minidom based implementation.
    • ”etree” - A generic builder for tree implementations exposing an ElementTree-like interface, defaulting to xml.etree.cElementTree if available and xml.etree.ElementTree if not.
    • ”lxml” - A etree-based builder for lxml.etree, handling limitations of lxml’s implementation.
  • implementation – (Currently applies to the “etree” and “dom” tree types). A module implementing the tree type e.g. xml.etree.ElementTree or xml.etree.cElementTree.
  • kwargs – Any additional options to pass to the TreeBuilder when creating it.

Example:

>>> from html5lib.treebuilders import getTreeBuilder
>>> builder = getTreeBuilder('etree')

base Module

class html5lib.treebuilders.base.ActiveFormattingElements[source]

Bases: list

append(node)[source]

Append object to the end of the list.

class html5lib.treebuilders.base.Node(name)[source]

Bases: object

Represents an item in the tree

__init__(name)[source]

Creates a Node

Parameters:name – The tag name associated with the node
appendChild(node)[source]

Insert node as a child of the current node

Parameters:node – the node to insert
cloneNode()[source]

Return a shallow copy of the current node i.e. a node with the same name and attributes but with no parent or child nodes

hasContent()[source]

Return true if the node has children or text, false otherwise

insertBefore(node, refNode)[source]

Insert node as a child of the current node, before refNode in the list of child nodes. Raises ValueError if refNode is not a child of the current node

Parameters:
  • node – the node to insert
  • refNode – the child node to insert the node before
insertText(data, insertBefore=None)[source]

Insert data as text in the current node, positioned before the start of node insertBefore or to the end of the node’s text.

Parameters:
  • data – the data to insert
  • insertBefore – True if you want to insert the text before the node and False if you want to insert it after the node
removeChild(node)[source]

Remove node from the children of the current node

Parameters:node – the child node to remove
reparentChildren(newParent)[source]

Move all the children of the current node to newParent. This is needed so that trees that don’t store text as nodes move the text in the correct way

Parameters:newParent – the node to move all this node’s children to
class html5lib.treebuilders.base.TreeBuilder(namespaceHTMLElements)[source]

Bases: object

Base treebuilder implementation

  • documentClass - the class to use for the bottommost node of a document
  • elementClass - the class to use for HTML Elements
  • commentClass - the class to use for comments
  • doctypeClass - the class to use for doctypes
__init__(namespaceHTMLElements)[source]

Create a TreeBuilder

Parameters:namespaceHTMLElements – whether or not to namespace HTML elements
createElement(token)[source]

Create an element but don’t insert it anywhere

elementInActiveFormattingElements(name)[source]

Check if an element exists between the end of the active formatting elements and the last marker. If it does, return it, else return false

getDocument()[source]

Return the final tree

getFragment()[source]

Return the final fragment

getTableMisnestedNodePosition()[source]

Get the foster parent element, and sibling to insert before (or None) when inserting a misnested table node

insertElementTable(token)[source]

Create an element and insert it into the tree

insertText(data, parent=None)[source]

Insert text data.

testSerializer(node)[source]

Serialize the subtree of node in the format required by unit tests

Parameters:node – the node from which to start serializing

dom Module

etree Module

etree_lxml Module

Module for supporting the lxml.etree library. The idea here is to use as much of the native library as possible, without using fragile hacks like custom element names that break between releases. The downside of this is that we cannot represent all possible trees; specifically the following are known to cause problems:

Text or comments as siblings of the root element Docypes with no name

When any of these things occur, we emit a DataLossWarning

class html5lib.treebuilders.etree_lxml.TreeBuilder(namespaceHTMLElements, fullTree=False)[source]

Bases: html5lib.treebuilders.base.TreeBuilder

__init__(namespaceHTMLElements, fullTree=False)[source]

Create a TreeBuilder

Parameters:namespaceHTMLElements – whether or not to namespace HTML elements
getDocument()[source]

Return the final tree

getFragment()[source]

Return the final fragment

testSerializer(element)[source]

Serialize the subtree of node in the format required by unit tests

Parameters:node – the node from which to start serializing
html5lib.treebuilders.etree_lxml.tostring(element)[source]

Serialize an element and its child nodes to a string