treewalkers Package

treewalkers Package

A collection of modules for iterating through different kinds of tree, generating tokens identical to those produced by the tokenizer module.

To create a tree walker for a new type of tree, you need to implement a tree walker object (called TreeWalker by convention) that implements a ‘serialize’ method which takes a tree as sole argument and returns an iterator which generates tokens.

html5lib.treewalkers.getTreeWalker(treeType, implementation=None, **kwargs)[source]

Get a TreeWalker class for various types of tree with built-in support

Parameters:
  • treeType (str) –

    the name of the tree type required (case-insensitive). Supported values are:

    • ”dom”: The xml.dom.minidom DOM implementation
    • ”etree”: A generic walker for tree implementations exposing an elementtree-like interface (known to work with ElementTree, cElementTree and lxml.etree).
    • ”lxml”: Optimized walker for lxml.etree
    • ”genshi”: a Genshi stream
  • implementation – A module implementing the tree type e.g. xml.etree.ElementTree or cElementTree (Currently applies to the “etree” tree type only).
  • kwargs – keyword arguments passed to the etree walker–for other walkers, this has no effect
Returns:

a TreeWalker class

html5lib.treewalkers.pprint(walker)[source]

Pretty printer for tree walkers

Takes a TreeWalker instance and pretty prints the output of walking the tree.

Parameters:walker – a TreeWalker instance

base Module

class html5lib.treewalkers.base.TreeWalker(tree)[source]

Bases: object

Walks a tree yielding tokens

Tokens are dicts that all have a type field specifying the type of the token.

__init__(tree)[source]

Creates a TreeWalker

Parameters:tree – the tree to walk
comment(data)[source]

Generates a Comment token

Parameters:data – the comment
Returns:Comment token
doctype(name, publicId=None, systemId=None)[source]

Generates a Doctype token

Parameters:
  • name
  • publicId
  • systemId
Returns:

the Doctype token

emptyTag(namespace, name, attrs, hasChildren=False)[source]

Generates an EmptyTag token

Parameters:
  • namespace – the namespace of the token–can be None
  • name – the name of the element
  • attrs – the attributes of the element as a dict
  • hasChildren – whether or not to yield a SerializationError because this tag shouldn’t have children
Returns:

EmptyTag token

endTag(namespace, name)[source]

Generates an EndTag token

Parameters:
  • namespace – the namespace of the token–can be None
  • name – the name of the element
Returns:

EndTag token

entity(name)[source]

Generates an Entity token

Parameters:name – the entity name
Returns:an Entity token
error(msg)[source]

Generates an error token with the given message

Parameters:msg – the error message
Returns:SerializeError token
startTag(namespace, name, attrs)[source]

Generates a StartTag token

Parameters:
  • namespace – the namespace of the token–can be None
  • name – the name of the element
  • attrs – the attributes of the element as a dict
Returns:

StartTag token

text(data)[source]

Generates SpaceCharacters and Characters tokens

Depending on what’s in the data, this generates one or more SpaceCharacters and Characters tokens.

For example:

>>> from html5lib.treewalkers.base import TreeWalker
>>> # Give it an empty tree just so it instantiates
>>> walker = TreeWalker([])
>>> list(walker.text(''))
[]
>>> list(walker.text('  '))
[{u'data': '  ', u'type': u'SpaceCharacters'}]
>>> list(walker.text(' abc '))  # doctest: +NORMALIZE_WHITESPACE
[{u'data': ' ', u'type': u'SpaceCharacters'},
{u'data': u'abc', u'type': u'Characters'},
{u'data': u' ', u'type': u'SpaceCharacters'}]
Parameters:data – the text data
Returns:one or more SpaceCharacters and Characters tokens
unknown(nodeType)[source]

Handles unknown node types

class html5lib.treewalkers.base.NonRecursiveTreeWalker(tree)[source]

Bases: html5lib.treewalkers.base.TreeWalker

dom Module

class html5lib.treewalkers.dom.TreeWalker(tree)[source]

Bases: html5lib.treewalkers.base.NonRecursiveTreeWalker

etree Module

etree_lxml Module

class html5lib.treewalkers.etree_lxml.TreeWalker(tree)[source]

Bases: html5lib.treewalkers.base.NonRecursiveTreeWalker

__init__(tree)[source]

Creates a TreeWalker

Parameters:tree – the tree to walk

genshi Module

class html5lib.treewalkers.genshi.TreeWalker(tree)[source]

Bases: html5lib.treewalkers.base.TreeWalker