See more tricks in the Gallery.
Charlie Halpern-Hamu, Ph.D.
Structured-Text Consultant Incremental Development, Inc.
charlie@IncrementalDevelopment.com
Charlie Halpern-Hamu received his doctorate in Computer Science from the University of Toronto. He is completing an MBA through Heriot-Watt University. He has published papers in the areas of denotational semantics, programming-language design tools and graphical control of robots by the disabled. He has been a been a structured-text consultant for eight years. His company, Incremental Development, Inc., helps organizations structure both text and surrounding business processes with emphasis on simplicity and skill-transfer.
This paper presents several Stupid XSL Tricks. A Stupid XSL Trick is a use of XSL for something unusual or amusing for which it wasn't necessarily designed. A better name for this paper would be Stupid XSLT Tricks, as all the examples in this paper use the transformation half of XSL, rather than the formatting-object half.
This paper is intended for an audience that, like the author, is learning XSLT and wishes do so by poking around in various less-explored corners. This is not a scientific paper that expands the boundaries of human knowledge; it is more of a tutorial that might expand the boundaries of your knowledge.
This paper presents several Stupid XSL[xsl] Tricks. A Stupid XSL Trick is a use of XSL for something unusual or amusing for which it wasn't necessarily designed. A better name for this paper would be Stupid XSLT[xslt] Tricks, as all the examples in this paper use the transformation half of XSL, rather than the formatting-object half.
Here are today's entertainments:
Given an XML[xml] schema, produce a sample instance that conforms to that schema.
A schema expresses, as an XML document, the possible relations between elements, attributes and data values in a class of XML documents.
Here is an simplified schema:
<?xml version='1.0'?>
<schema>
<element name='doc'>
<archtype>
<element ref='head'/>
<element ref='body'/>
</archtype>
</element>
<element name='head'>
<archtype>
<element ref='title'/>
<element ref='date'/>
</archtype>
</element>
<element name='body'>
<archtype>
<element ref='para'/>
</archtype>
</element>
<element name='title'>
<archtype content='mixed'/>
</element>
<element name='date'>
<archtype>
<element name='year' type='four-digit-year'/>
<element name='month'>
<archtype content='mixed'/>
</element>
<element name='day' type='integer'/>
</archtype>
</element>
<element name='para'>
<archtype content='mixed'>
<element ref='bold'/>
<element ref='italic'/>
</archtype>
</element>
<element name='bold'>
<archtype content='mixed'/>
</element>
<element name='italic'>
<archtype content='mixed'/>
</element>
</schema>
I mean for this schema to correspond to the following DTD:
<!ELEMENT doc (head, body)>
<!ELEMENT head (title, date)>
<!ELEMENT body (para)>
<!ELEMENT title (#PCDATA)>
<!ELEMENT date (#PCDATA)>
<!ELEMENT year (#PCDATA)> <!--Should be a four-digit-year.-->
<!ELEMENT month (#PCDATA)>
<!ELEMENT day (#PCDATA)> <!--Should be an integer.-->
<!ELEMENT para (#PCDATA | bold | italic)*>
<!ELEMENT bold (#PCDATA)>
<!ELEMENT italic (#PCDATA)>
Note that I haven't included any occurrence
information: perhaps date should be
optional; certainly para should be allowed
more than once. This is left as an exercise to the
reader, or perhaps an exercise for the author if he can
find some more time.
Note also that this schema language allows to be
declared locally (as year,
month and day are in this example)
or globally (as all the other elements are). Locally
declared elements be referenced only in the same
archtype element as their declaration. Locally
declared elements are an interesting feature of the
current
XML Schema working draft, because they allow
the same element name to have alternative content
restrictions depending on context.
Local element declarations are less challenging from the point of view of this XSLT exercise because the nested declarations map easily to a nested sample isntance. For me, the interesting aspect of this exercise is following a chain of references to produce a consolidated, nested structure. References to local element declarations do add an interesting wrinkle, in that both the current context and the global context need to be searched for the appropriate declaration.
My goal is to write a transformation that produces the following output:
<?xml version='1.0'?>
<doc>
<head>
<title>title</title>
<date>
<year>four-digit-year [unknown type]</year>
<month>month</month>
<day>123</day>
</date>
</head>
<body>
<para>para and <bold>bold</bold> and
<italic>italic</italic>
</para>
</body>
</doc>
This example output, like all the others in this paper,
is in fact actual output generated using the
appropriate stylesheet and XT (James Clark's
implementation of
XSLT)[xt]. At
the time this paper was written, XT did not fully
implement all the attributes of
xsl:output. So XML declarations and indentation
have been added by hand.
The following XSLT stylesheet accomplishes the desired transformation:
<?xml version='1.0'?>
<xsl:stylesheet
version='1.0'
xmlns:xsl='http://www.w3.org/1999/XSL/Transform'>
<xsl:output
method='xml'
indent='yes'/>
<xsl:template match='/schema'>
<xsl:apply-templates select='element[1]'/>
</xsl:template>
<xsl:template match='element[@name and archtype]'>
<xsl:element name='{@name}'>
<xsl:apply-templates select='archtype'/>
</xsl:element>
</xsl:template>
<xsl:template match='element[@name and @type]'>
<xsl:element name='{@name}'>
<xsl:choose>
<xsl:when test='@type="integer"'>
<xsl:text>123</xsl:text>
</xsl:when>
<xsl:otherwise>
<xsl:value-of select='@type'/>
<xsl:text> [unknown type]</xsl:text>
</xsl:otherwise>
</xsl:choose>
</xsl:element>
</xsl:template>
<xsl:template match='archtype[@content="mixed"]'>
<xsl:value-of select='../@name'/>
<xsl:for-each select='*'>
<xsl:text> and </xsl:text>
<xsl:apply-templates select='.'/>
</xsl:for-each>
</xsl:template>
<xsl:template match='archtype'>
<xsl:apply-templates select='*'/>
</xsl:template>
<xsl:template match='element[@ref]'>
<xsl:apply-templates
select='../element[@name=current()/@ref]
| /schema/element
[@name=current()/@ref]'/>
</xsl:template>
</xsl:stylesheet>
Here's how it works.
The first xsl:template matches the
schema element just under the root of the tree.
It assumes that the first element child of
the schema is to be the root element of
the resulting sample instance. Therefore, attention is
directed, using apply-templates, to this
first element child.
The second xsl:template matches an
element declaration in which an element is
identified by the name attribute and is
defined using an archtype child. In this
case, a sample element is generated using
xsl:element. The contents of this generated
element are determined by directing attention to the
archtype that is the child of the
element.
The third xsl:template handles the other
way to declare an element. Rather than having a child
archtype that restricts the possible
subelements of the element being declared, an
element can have a type attribute
that determines the lexical structure of the textual
content of the element being declared. In this simple
example, we only handle one predefined datatype
(integer) and throw our hands up at any
other value.
Just as in the previous xsl:template, we
first output a sample element using
xsl:element before filling in its contents. The
second and third templates could have been combined
into a single template that matched based on the
existence of the name attribute. The
combined template could then unconditionally output a
sample element before turning its attention to how to
fill in the contents. At that point, it could
distinguish between the two ways of defining content:
the archtype child or the
type attribute.
The next two xsl:templates match the two
kinds of archtypes in our simplified
schema language: content='mixed' and the
default order='seq'. These two
xsl:templates are described in the following
paragraphs.
A natural next step would be to add
order='choice'. Choice could be handled very
much like sequence, except instead of recursing into
each child, only one would be chosen. (The first
child would be easiest choice.) As it is, the
stylesheet assumes that if it isnt mixed content, it
is a sequence. This assumption luckily works when
order='choice' and
maxOccurrences='*'. So this schema-to-instance
stylesheet can in fact be used with the
instance-to-schema stylesheet below to effect a round
trip. A round trip from schema to instance and back
has the predictable effect of loosening up the
schema. A round trip from instance to schema and back
can have the effect of 'beefing up' the sample.
The fourth xsl:template is the first of
the two xsl:templates to match the
archtype element. This template matches
mixed content. A mixed model is
transformed into a sample instance fragment by
generating some sample text (chosen here to be the same
as the name of the element) and iterating over each of
its children, if any.
Iteration using xsl:for-each is used
instead of the typical recursion using
xsl:apply-templates because we want to have the
opportunity to highlight the mixed content by adding
the word "and" before each subelement. If we used
xsl:apply-templates, the templates that
matched each child of the archtype would have to check
if they were in mixed content to add the "and"
themselves. The xsl:for-each allows this
logic to be consolidated in one place.
The fifth xsl:template is the second of
the two to match the archtype element. It
will only be triggered if the more specific
match pattern of the preceding
xsl:template fails. It is extremely easy to
handle. No text is allowed in a element-content
sequence, so we only need to assemble the ordered
results of evaluating the archtype's
children. XSLT is purposely open about the
order in which the attention will be paid to each of
the children, but guarantees that the output will be
strung together in the same order as the corresponding
input.
The last xsl:template matches a the third
kind of element, a reference to an element
declared elsewhere in the schema. The
distinguishing characteristic of this kind of
element is that it has a ref
attribute. We want to behave just as if that element
declaration were defined right here where we are. So we
need to redirect our attention, using
xsl:apply-templates to an element
whose name attribute matches the
current element's ref
attribute ([@name=current()/@ref]). The
element we seek may be in one of two places. It may be
local, in which case it is a sibling, another child of
our parent (../element). Or it may be
global, in which case it will be a top-level element, a
child of schema (/schema/element).
Given a sample instance, produce an XML schema that allows that document and some similar documents without allowing everything.
This trick was done for SGML instances and DTDs by Fred [fred] and more recently for XML DTDs by SAXON DTDGenerator [dtdgenerator].
My goal is to start with a sample instance like this:
<?xml version='1.0'?>
<document>
<head>
<title>Title</title>
<abstract>
<para>A <bold>bold</bold> bit.</para>
</abstract>
</head>
<body>
<para>First paragraph.</para>
<note>
<para>An <italic>italic</italic> bit.</para>
</note>
<para>Last paragraph.</para>
</body>
</document>
Note that one para contains
bold and another contains italic.
I'd like to collect these two clues into a single
declaration that allows both subelements.
Also note that we, as humans, can guess that
head must always appear before body
and at the same time guess that any number of
paras and notes can intermix in
body. I'm not going to have the stylesheet
make this guess, instead it will assume any number of
subelements in any order (based on the parent-child
relationships that it actually sees).
From the sample instance above, I would like to produce a schema like this:
<?xml version='1.0'?>
<schema xmlns="http://www.w3.org/1999/09/23-xmlschema/">
<element name="document">
<archtype order="choice" maxOccurrence="*">
<element ref="head"/>
<element ref="body"/>
</archtype>
</element>
<element name="head">
<archtype order="choice" maxOccurrence="*">
<element ref="title"/>
<element ref="abstract"/>
</archtype>
</element>
<element name="title">
<archtype content="mixed"/>
</element>
<element name="abstract">
<archtype order="choice" maxOccurrence="*">
<element ref="para"/>
</archtype>
</element>
<element name="bold">
<archtype content="mixed"/>
</element>
<element name="body">
<archtype order="choice" maxOccurrence="*">
<element ref="note"/>
<element ref="para"/>
</archtype>
</element>
<element name="note">
<archtype order="choice" maxOccurrence="*">
<element ref="para"/>
</archtype>
</element>
<element name="italic">
<archtype content="mixed"/>
</element>
<element name="para">
<archtype content="mixed">
<element ref="bold"/>
<element ref="italic"/>
</archtype>
</element>
</schema>
I mean for this schema to correspond to the following DTD:
<!ELEMENT document (head | body)*>
<!ELEMENT head (title | abstract)*>
<!ELEMENT title (#PCDATA)>
<!ELEMENT abstract (para)*>
<!ELEMENT bold (#PCDATA)>
<!ELEMENT body (note | para)*>
<!ELEMENT note (para)*>
<!ELEMENT italic (PCDATA)>
<!ELEMENT para (PCDATA | bold | italic)*>
This transformation is accomplished by the following XSLT stylesheet:
<?xml version='1.0'?>
<xsl:stylesheet
version='1.0'
xmlns:xsl='http://www.w3.org/1999/XSL/Transform'>
<xsl:strip-space elements='*'/>
<xsl:output
method='xml'
indent='yes'/>
<xsl:template match='/'>
<xsl:element name='schema'>
<xsl:apply-templates select='//*'/>
</xsl:element>
</xsl:template>
<xsl:template match='*'>
<xsl:variable name='parent' select='name()'/>
<xsl:if
test='0 = count(
following::*[name()=$parent])'>
<xsl:element name='element'>
<xsl:attribute name='name'>
<xsl:value-of select='$parent'/>
</xsl:attribute>
<xsl:element name='archtype'>
<xsl:choose>
<xsl:when
test='0 != count(
//*[name()=$parent]/text())'>
<xsl:attribute name='content'>mixed</xsl:attribute>
</xsl:when>
<xsl:otherwise>
<xsl:attribute name='order'>choice</xsl:attribute>
<xsl:attribute name='maxOccurrence'>*</xsl:attribute>
</xsl:otherwise>
</xsl:choose>
<xsl:call-template name='find-kids'>
<xsl:with-param name='parent'
select='$parent'/>
</xsl:call-template>
</xsl:element>
</xsl:element>
</xsl:if>
</xsl:template>
<xsl:template name='find-kids'>
<xsl:param name='parent'/>
<xsl:for-each select='//*[name()=$parent]/*'>
<xsl:variable name='child'>
<xsl:value-of select='name()'/>
</xsl:variable>
<xsl:if
test='0 = count(
following::*[name()=$child]/
parent::*[name()=$parent])'>
<xsl:element name='element'>
<xsl:attribute name='ref'>
<xsl:value-of select='$child'/>
</xsl:attribute>
</xsl:element>
</xsl:if>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
Here's how it works.
The first xsl:template matches the root of
the tree (/): one level above the root
element (document in our example input).
It writes out a schema element. The
"usual" next step would be to
apply-templates to the children of the root (the
root element). The template that matched this element
would do some processing and then move on to each child
of this new element, etc. Rather than taking the usual
approach, the match pattern used here simply reviews
all the elements in the sample instance in one go
(.//*).
The second xsl:template reviews an
element, potentially writing out an
element declaration for it. Because each element
may appear many times in the sample, but only one
element declaration should be created,
only the last example of each element type triggers the
creation of an element declaration. Each
input element is considered as a parent: all its
children (throughout the sample) are found and listed
in the element declaration created.
First, for convenience, set the variable
parent to the name of the current input element.
(We call the current element "parent"
because we are interested in all the children it has.)
Then, test if this is the last example of
the parent. The test is: "is
the count of all following elements whose
name is the same as parent equal to
0?". We only process the element if it is the last
example of its type. (The reason we choose to process
the last example instead of the more intuitive first
example is that XT does not yet implement the
preceding axis.) If there are more of the same
element to come, nothing is done now.
If this is the last example of the
parent in the sample, create an
element declaration. Set the name
attribute of this element declaration to
be parent, the name of the current input
element. Then check if the parent ever
contains text directly. Do this by
counting the number of times, in the whole sample, the
pattern "parent is parent and child is
text" is found. If parent
ever contains text, define its contents to
be #PCDATA mixed with all the
children found inside the parent.
Otherwise, define its contents to be a repeating choice
of all these same children, but without the
#PCDATA. So either set the content
attribute of the element declaration or
set both the order and
maxOccurrence attributes. Either way, the
contents of this element declaration
should be references to all the children that are ever
found inside the element being declared.
The third xsl:template does the actual
work of tracing down all the children of the
parent, wherever they appear in sample input. It
is not invoked by a match on the input,
but rather by xsl:call-template calling it
by its name. Because it is only called
once, one might reasonably choose to place this logic
inline in the previous xsl:template.
It iterates for-each element whose parent
is parent, anywhere in the document. The
name of each such element is stored in the
child variable. Again, to avoid repetition, only
the last example of
parent/child
relationship triggers an element
reference. The ref attribute of the
element reference is set to child:
the name of the child element.
It's been awhile since I've done this myself, so I'm not sure I'm using the right vocabulary to describe the class of functions I'm trying to differentiate. But an example will illustrate. Sample input and output:
To make this easy on myself, I'll start with a very structured and restrictive DTD that insists on input and output like this:
Here is that input DTD:
<!ELEMENT function-of-x (term+)>
<!ELEMENT term (coeff, x, power)>
<!ELEMENT coeff (#PCDATA)>
<!ELEMENT x EMPTY>
<!ELEMENT power (#PCDATA)>
This is the sample input shown above as expressed in this DTD:
<?xml version='1.0'?>
<function-of-x>
<term><coeff>1</coeff><x/><power>3</power></term>
<term><coeff>2</coeff><x/><power>2</power></term>
<term><coeff>3</coeff><x/><power>1</power></term>
<term><coeff>4</coeff><x/><power>0</power></term>
</function-of-x>
This is the sample output as expressed in the same DTD:
<?xml version='1.0'?>
<function-of-x>
<term><coeff>3</coeff><x/><power>2</power></term>
<term><coeff>4</coeff><x/><power>1</power></term>
<term><coeff>3</coeff><x/><power>0</power></term>
<term><coeff>0</coeff><x/><power>-1</power></term>
</function-of-x>
MathML[mathml] this ain't. For the curious, this is MathML:
<?xml version='1.0'?>
<reln><eq>
<apply><diff/>
<bvar><ci>x</ci><degree><cn>1</cn></degree></bvar>
<apply><fn><ci>f</ci></fn><ci>x</ci></apply>
</apply>
<apply><plus/>
<apply><times/><cn>3</cn>
<apply><power/><ci>x</ci><cn>2</cn></apply></apply>
<apply><times/><cn>4</cn>
<apply><power/><ci>x</ci><cn>1</cn></apply></apply>
<apply><times/><cn>3</cn>
<apply><power/><ci>x</ci><cn>0</cn></apply></apply>
<apply><times/><cn>0</cn>
<apply><power/><ci>x</ci><cn>-1</cn></apply></apply>
</apply>
</reln>
The desired transformation is accomplished with this stylesheet:
<?xml version='1.0'?>
<xsl:stylesheet
version='1.0'
xmlns:xsl='http://www.w3.org/1999/XSL/Transform'>
<xsl:strip-space elements='*'/>
<xsl:output
method='xml'
indent='yes'/>
<xsl:template match='/function-of-x'>
<xsl:element name='function-of-x'>
<xsl:apply-templates select='term'/>
</xsl:element>
</xsl:template>
<xsl:template match='term'>
<term>
<coeff>
<xsl:value-of select='coeff * power'/>
</coeff>
<x/>
<power>
<xsl:value-of select='power - 1'/>
</power>
</term>
</xsl:template>
</xsl:stylesheet>
Here's how it works.
The first xsl:template matches the root
element, echos the same function-of-x
element to the output, and then turns its attention to
each child term.
The second xsl:template matches the
term element. It writes out a new
term, in which the coeff has the
value of the old coeff times the old
power, and the power has the
value of the old power minus
1.
Though the output is correct, it would be prettier if the stylesheet noted the power 1 and the coefficient 0 and simplified appropriately.
For me, the exciting thing about this example is the ease with which the coefficient and power of a term can be used in the expressions calculating the new values. Unlike the previous stylesheets, this one reads very much like English.
Some more tricks to try:
I expect that the principal stumbling block with each of these tricks will be the inability to do more than one iteration / go more than set number of levels deep.
Lessons learned:
XSLT effectively prevents multiple
iterations in which each iteration uses the results
of the previous iteration. Is there any sneaky way
around this for any/all of the tricks listed above?
current(). When I revised the paper in
response to the October draft, current()
alleviated the need for a variable.
XSLT's limits
on looping are quite intentional and so my
difficulties in looping do not reflect an accidental
omission in the language.
bold to nest in italic and
vice-versa.)
coeff * power above.
key feature could have been used in
the schema-to-instance exercise if all
element declarations were global. It wouldn't
have been particularly easier on the
stylesheet-writer, but I assume that it should be
more efficient.
My original inspiration came before reading Rick Jelliffe's thoughts about using XSL as schema-validation language [xslv] but before reading Francis Norton's follow-on suggestions about using XSLT to build such XSL schema-validators [xsltv]. Reading both spurned me on to actually experiment myself by showing that such tricks are possible. And amusing.