A stream is the common representation of markup as a stream of events.
Contents
A stream can be attained in a number of ways. It can be:
For example, the functions XML() and HTML() can be used to convert literal XML or HTML text to a markup stream:
The stream is the result of parsing the text into events. Each event is a tuple of the form (kind, data, pos), where:
One important feature of markup streams is that you can apply filters to the stream, either filters that come with Genshi, or your own custom filters.
A filter is simply a callable that accepts the stream as parameter, and returns the filtered stream:
Filters can be applied in a number of ways. The simplest is to just call the filter directly:
The Stream class also provides a filter() method, which takes an arbitrary number of filter callables and applies them all:
Finally, filters can also be applied using the bitwise or operator (|), which allows a syntax similar to pipes on Unix shells:
One example of a filter included with Genshi is the HTMLSanitizer in genshi.filters. It processes a stream of HTML markup, and strips out any potentially dangerous constructs, such as Javascript event handlers. HTMLSanitizer is not a function, but rather a class that implements __call__, which means instances of the class are callable:
Both the filter() method and the pipe operator allow easy chaining of filters:
That is equivalent to:
For more information about the built-in filters, see Stream Filters.
Serialization means producing some kind of textual output from a stream of events, which you'll need when you want to transmit or store the results of generating or otherwise processing markup.
The Stream class provides two methods for serialization: serialize() and render(). The former is a generator that yields chunks of Markup objects (which are basically unicode strings that are considered safe for output on the web). The latter returns a single string, by default UTF-8 encoded.
Here's the output from serialize():
And here's the output from render():
Both methods can be passed a method parameter that determines how exactly the events are serialized to text. This parameter can be either a string or a custom serializer class:
Note how the <br> element isn't closed, which is the right thing to do for HTML. See serialization methods for more details.
In addition, the render() method takes an encoding parameter, which defaults to “UTF-8”. If set to None, the result will be a unicode string.
The different serializer classes in genshi.output can also be used directly:
The pipe operator allows a nicer syntax:
Genshi supports the use of different serialization methods to use for creating a text representation of a markup stream.
The XHTMLSerializer is a specialization of the generic XMLSerializer that understands the pecularities of producing XML-compliant output that can also be parsed without problems by the HTML parsers found in modern web browsers. Thus, the output by this serializer should be usable whether sent as "text/html" or "application/xhtml+html" (although there are a lot of subtle issues to pay attention to when switching between the two, in particular with respect to differences in the DOM and CSS).
For example, instead of rendering a script tag as <script/> (which confuses the HTML parser in many browsers), it will produce <script></script>. Also, it will normalize any boolean attributes values that are minimized in HTML, so that for example <hr noshade="1"/> becomes <hr noshade="noshade" />.
This serializer supports the use of namespaces for compound documents, for example to use inline SVG inside an XHTML document.
Both serialize() and render() support additional keyword arguments that are passed through to the initializer of the serializer class. The following options are supported by the built-in serializers:
Whether the serializer should remove trailing spaces and empty lines. Defaults to True.
(This option is not available for serialization to plain text.)
A (name, pubid, sysid) tuple defining the name, publid identifier, and system identifier of a DOCTYPE declaration to prepend to the generated output. If provided, this declaration will override any DOCTYPE declaration in the stream.
The parameter can also be specified as a string to refer to commonly used doctypes:
| Shorthand | DOCTYPE | 
|---|---|
| html or html-strict | HTML 4.01 Strict | 
| html-transitional | HTML 4.01 Transitional | 
| html-frameset | HTML 4.01 Frameset | 
| html5 | DOCTYPE proposed for the work-in-progress HTML5 standard | 
| xhtml or xhtml-strict | XHTML 1.0 Strict | 
| xhtml-transitional | XHTML 1.0 Transitional | 
| xhtml-frameset | XHTML 1.0 Frameset | 
| xhtml11 | XHTML 1.1 | 
| svg or svg-full | SVG 1.1 | 
| svg-basic | SVG 1.1 Basic | 
| svg-tiny | SVG 1.1 Tiny | 
(This option is not available for serialization to plain text.)
The namespace prefixes to use for namespace that are not bound to a prefix in the stream itself.
(This option is not available for serialization to HTML or plain text.)
Whether to remove the XML declaration (the <?xml ?> part at the beginning of a document) when serializing. This defaults to True as an XML declaration throws some older browsers into "Quirks" rendering mode.
(This option is only available for serialization to XHTML.)
Whether the text serializer should detect and remove any tags or entity encoded characters in the text.
(This option is only available for serialization to plain text.)
XPath can be used to extract a specific subset of the stream via the select() method:
Often, streams cannot be reused: in the above example, the sub-stream is based on a generator. Once it has been serialized, it will have been fully consumed, and cannot be rendered again. To work around this, you can wrap such a stream in a list:
See Using XPath in Genshi for more information about the XPath support in Genshi.
Every event in a stream is of one of several kinds, which also determines what the data item of the event tuple looks like. The different kinds of events are documented below.
Note
The data item is generally immutable. If the data is to be modified when processing a stream, it must be replaced by a new tuple. Effectively, this means the entire event tuple is immutable.
The opening tag of an element.
For this kind of event, the data item is a tuple of the form (tagname, attrs), where tagname is a QName instance describing the qualified name of the tag, and attrs is an Attrs instance containing the attribute names and values associated with the tag (excluding namespace declarations):
The closing tag of an element.
The data item of end events consists of just a QName instance describing the qualified name of the tag:
Character data outside of elements and comments.
For text events, the data item should be a unicode object:
The start of a namespace mapping, binding a namespace prefix to a URI.
The data item of this kind of event is a tuple of the form (prefix, uri), where prefix is the namespace prefix and uri is the full URI to which the prefix is bound. Both should be unicode objects. If the namespace is not bound to any prefix, the prefix item is an empty string:
The end of a namespace mapping.
The data item of such events consists of only the namespace prefix (a unicode object):
A document type declaration.
For this type of event, the data item is a tuple of the form (name, pubid, sysid), where name is the name of the root element, pubid is the public identifier of the DTD (or None), and sysid is the system identifier of the DTD (or None):
A comment.
For such events, the data item is a unicode object containing all character data between the comment delimiters:
A processing instruction.
The data item is a tuple of the form (target, data) for processing instructions, where target is the target of the PI (used to identify the application by which the instruction should be processed), and data is text following the target (excluding the terminating question mark):
Marks the beginning of a CDATA section.
The data item for such events is always None:
Marks the end of a CDATA section.
The data item for such events is always None: