Read XSLT 2.0 and XPath 2.0 Programmer's Reference, 4th Edition Online
Authors: Michael Kay
Another attempt appears in the specification of Canonical XML (
http://www.w3.org/TR/xml-c14n
). This specification approaches the question by defining a transformation that can be applied to any XML document to turn it into canonical form, and if two documents have the same canonical form, they are considered equivalent.
The process of turning a document into canonical form is summarized as follows:
1.
The document is encoded in UTF-8.
2.
Line breaks are normalized to
x0A
.
3.
Attribute values are normalized, depending on the attribute type.
4.
Character references and parsed entity references are expanded.
5.
CDATA sections are replaced with their character content.
6.
The XML declaration and document type declaration (DTD) are removed.
7.
Empty element tags (
) are converted to tag pairs (
).
8.
Whitespace outside the document element and within tags is normalized.
9.
Attribute value delimiters are set to double quotes.
10.
Special characters in attribute values and character content are replaced by character references.
11.
Redundant namespace declarations are removed.
12.
Default attribute values defined in the DTD are added to each element.
13.
Attributes and namespace declarations are sorted into alphabetical order.
Canonical form discards some of the original content that the InfoSet retains, for example CDATA sections. However, this specification has a gray area too: canonical form may or may not retain comments from the original document.
XDM leans more towards the minimalist view of the Canonical XML specification.
Figure 2-8
illustrates the resulting classification: the central core is information that is retained in canonical form (comments being on the boundary since the spec leaves the question open); the “peripheral” ring is information that is present in the Infoset but not in canonical XML, while the outer ring represents features of an XML document that are also excluded from the InfoSet. XDM sticks essentially to the Core features (including comments), with a couple of minor additions: XSLT also recognizes unparsed entities, and also makes available the base URI (which is a rather peculiar property, since it can't actually be determined from the content of the XML document, only from its location).
From Textual XML to a Data Model
I've explained the data model so far in this chapter by relating the constructs in XDM (such as element nodes and text nodes) to constructs in a textual XML document. This isn't actually how the W3C specs define it. There are two important differences:
Examples of the variations that have arisen in this area between different 1.0 processors are:
XDM leaves additional scope for variations between processors. Because the model is designed to support XQuery as well as XSLT, the range of possible usage scenarios is greatly increased. Many XQuery vendors aim to offer implementations capable of searching databases containing hundreds of gigabytes of data, and in such environments performance optimization becomes a paramount requirement. In fact, database products have traditionally treated performance as a more important quality than standards conformance, and there are indications that this culture is present among some of the XQuery vendors. Examples of the kind of variations that may be encountered include the following:
It remains to be seen how most vendors will handle these problems. Hopefully, vendors will offer any optimizations as an option that the user can choose, rather than as the default way that source XML is processed when loading the data.
Controlling Serialization
The transformation processor, which generates the result tree, generally gives the user control only over the core information items and properties (including comments) in the output. The output processor or serializer gives a little bit of extra control over how the result tree is converted into a serial XML document. Specifically, it allows control over the following:
Although you get some control over these features during serialization, one thing you can't do is copy them from the source document unchanged through to the result. The fact that text was in a
CDATA
section in the input document has no bearing on whether it will be represented as a
CDATA
section in the output document. The tree model does not provide any way for this extra information to be retained.
The Transformation Process
I've described the essential process performed by XSLT, transformation of a source tree to a result tree under the control of a stylesheet, and looked at the structure of these trees. Now it's time to look at how the transformation process actually works, which means taking a look inside the stylesheet.
Invoking a Transformation
The actual interface for firing off a transformation is outside the scope of the XSLT specification, and it's done differently by different products. There are also different styles of interface: possibilities include an API that can be invoked by applications, a GUI interface within a development environment, a command line interface, an interface from a pipeline processor such as XProc or a build tool such as
ant
, as well as the use of an
processing instruction within a source document, which is described in Chapter 3 (see page 99). There's a common API for Java processors that was initially called TrAX, then became part of JAXP, and since JDK 1.4 has been part of the standard Java class library. For browsers, Microsoft and Firefox each have their own API, but there is at least one project (Sarissa, see
http://sarissa.sourceforge.net/
) that provides a common API that can be used on both these browsers as well as Opera, Safari, and Konqueror.
What the XSLT 2.0 specification does do is to describe in abstract terms what information can be passed across this interface when the transformation is started. This includes the following:
Template Rules
As we saw in Chapter 1, most stylesheets will contain a number of template rules. Each template rule is expressed in the stylesheet as an
match
attribute. The value of the
match
attribute is a pattern. The pattern determines which nodes in the source tree the template rule matches.
For example, the pattern
/
matches the document node, the pattern
title
matches a
chapter/title
matches a