treewalkers Package¶
A collection of modules for iterating through different kinds of tree, generating tokens identical to those produced by the tokenizer module.
To create a tree walker for a new type of tree, you need to implement a tree walker object (called TreeWalker by convention) that implements a ‘serialize’ method which takes a tree as sole argument and returns an iterator which generates tokens.
-
html5lib.treewalkers.
getTreeWalker
(treeType, implementation=None, **kwargs)[source]¶ Get a TreeWalker class for various types of tree with built-in support
Parameters: - treeType (str) –
the name of the tree type required (case-insensitive). Supported values are:
- ”dom”: The xml.dom.minidom DOM implementation
- ”etree”: A generic walker for tree implementations exposing an elementtree-like interface (known to work with ElementTree, cElementTree and lxml.etree).
- ”lxml”: Optimized walker for lxml.etree
- ”genshi”: a Genshi stream
- implementation – A module implementing the tree type e.g. xml.etree.ElementTree or cElementTree (Currently applies to the “etree” tree type only).
- kwargs – keyword arguments passed to the etree walker–for other walkers, this has no effect
Returns: a TreeWalker class
- treeType (str) –
-
html5lib.treewalkers.
pprint
(walker)[source]¶ Pretty printer for tree walkers
Takes a TreeWalker instance and pretty prints the output of walking the tree.
Parameters: walker – a TreeWalker instance
base
Module¶
-
class
html5lib.treewalkers.base.
TreeWalker
(tree)[source]¶ Bases:
object
Walks a tree yielding tokens
Tokens are dicts that all have a
type
field specifying the type of the token.-
comment
(data)[source]¶ Generates a Comment token
Parameters: data – the comment Returns: Comment token
-
doctype
(name, publicId=None, systemId=None)[source]¶ Generates a Doctype token
Parameters: - name –
- publicId –
- systemId –
Returns: the Doctype token
-
emptyTag
(namespace, name, attrs, hasChildren=False)[source]¶ Generates an EmptyTag token
Parameters: - namespace – the namespace of the token–can be
None
- name – the name of the element
- attrs – the attributes of the element as a dict
- hasChildren – whether or not to yield a SerializationError because this tag shouldn’t have children
Returns: EmptyTag token
- namespace – the namespace of the token–can be
-
endTag
(namespace, name)[source]¶ Generates an EndTag token
Parameters: - namespace – the namespace of the token–can be
None
- name – the name of the element
Returns: EndTag token
- namespace – the namespace of the token–can be
-
entity
(name)[source]¶ Generates an Entity token
Parameters: name – the entity name Returns: an Entity token
-
error
(msg)[source]¶ Generates an error token with the given message
Parameters: msg – the error message Returns: SerializeError token
-
startTag
(namespace, name, attrs)[source]¶ Generates a StartTag token
Parameters: - namespace – the namespace of the token–can be
None
- name – the name of the element
- attrs – the attributes of the element as a dict
Returns: StartTag token
- namespace – the namespace of the token–can be
-
text
(data)[source]¶ Generates SpaceCharacters and Characters tokens
Depending on what’s in the data, this generates one or more
SpaceCharacters
andCharacters
tokens.For example:
>>> from html5lib.treewalkers.base import TreeWalker >>> # Give it an empty tree just so it instantiates >>> walker = TreeWalker([]) >>> list(walker.text('')) [] >>> list(walker.text(' ')) [{u'data': ' ', u'type': u'SpaceCharacters'}] >>> list(walker.text(' abc ')) # doctest: +NORMALIZE_WHITESPACE [{u'data': ' ', u'type': u'SpaceCharacters'}, {u'data': u'abc', u'type': u'Characters'}, {u'data': u' ', u'type': u'SpaceCharacters'}]
Parameters: data – the text data Returns: one or more SpaceCharacters
andCharacters
tokens
-