Read XSLT 2.0 and XPath 2.0 Programmer's Reference, 4th Edition Online
Authors: Michael Kay
A bold and
An example of a well-formed document that is not a well-formed external general parsed entity (because it contains a
standalone
attribute) is:
A
The rules for document entities and external general parsed entities overlap, as shown in
Figure 15-1
.
Essentially, an XSLT stylesheet can output anything that fits in either of the two shaded circles, which means anything that is a well-formed XML document entity, a well-formed external general parsed entity, or both.
Well, almost anything:
In the XML standard, the rules for an external parsed entity are given as:
extParsedEnt ⇒ TextDecl? content
where
content
is a sequence of components including child elements, character data, entity references, CDATA sections, processing instructions, and comments, each of which may appear any number of times and in any order.
The corresponding rule for a document entity is effectively:
document ⇒ XMLDecl? Misc* doctypedecl? Misc* element Misc*
where
Misc
permits whitespace, comments, and processing instructions.
So the principal differences between the two cases are:
The
TextDecl
(text declaration) looks at first sight very much like an XML declaration; for example,
could be used either as an XML declaration or as a text declaration. There are differences, however:
So the following are all examples of well-formed external general parsed entities:
1: Hello!
2: Hello!
Goodbye!
3: Hello!
4: Hello!
The following is a well-formed XML document, but it is
not
a well-formed external general parsed entity, because of both the
standalone
attribute and the document type declaration. This is also legitimate output:
Hello!
The following is neither a well-formed XML document nor a well-formed external general parsed entity.
Hello!
Goodbye!
It cannot be an XML document because it has more than one top-level element, and it cannot be an external general parsed entity because it has a
declaration. A stylesheet attempting to produce this output is in error. The XSLT specification also places two other constraints on the form of the output, although these are rules for the implementor to follow rather than rules that directly affect the stylesheet author. These rules are:
If the output is an XML document, the meaning of this is clear enough, but if it is merely an external entity, some further explanation is needed. The standard provides this by saying that when the entity is pulled into a document by adding an element tag around its content, the resulting document must conform with the XML Namespaces rules.
The rule is expressed by describing what may change when the data model is serialized to XML and then parsed again to create a new data model. Things that may change include the order of attributes and namespace nodes and the base URI. If the parsing stage does DTD or schema validation, then this may cause new attribute or element values to appear, as specified by the DTD or schema. Perhaps the most significant change is that type annotations on element and attribute nodes will not be preserved. Any type annotations in the data model after such a round trip will be based on revalidation of the textual XML document; the type annotations in the original result tree are lost during serialization.
Serialization never adds attributes such as
xml:base
or
xsi:type
. If you want these present in the output, you must put them in the result tree, just like any other attribute.
Between XSLT 1.0 and XSLT 2.0, there is a change in the way the rules concerning namespace declarations are described. In XSLT 1.0, it was the job of the serializer to generate namespace declarations, not only for namespace nodes explicitly present on the result tree, but also for any namespaces used in the result document, but not represented by namespace nodes. This situation can happen because there is nothing in the rules for
Although the output is required to be well-formed XML, it is not the job of the serializer to ensure that the XML is valid against either a DTD or a schema. Just because you generate a document type declaration that refers to a specific DTD, or a reference to a schema, don't expect the XSLT processor to check that the output document actually conforms to that DTD or schema. Instead, XSLT 2.0 provides facilities to validate the result tree against a schema before it is serialized, by using the
validation
or
type
attribute on
With the
xml
output method, the other attributes of
The W3C serialization specification speaks of
doctype-system
,
doctype-public
and so on as serialization parameters. This is because the specification is designed to be used independently of XSLT, and therefore abstracts away from the actual XSLT syntax used. In this book we're concerned with how to control serialization from XSLT, so we'll refer to these things as attributes, which is how they appear in XSLT's concrete syntax.
Attribute | Interpretation |
cdata-section-elements | This is a list of element names, each expressed as a lexical QName, separated by whitespace. Any prefix in a QName is treated as a reference to the corresponding namespace URI in the normal way, using the namespace declarations in effect on the actual cdata-section-elements attribute appears. Because these are element names, the default namespace is assumed where the name has no prefix. When a text node is output, if the parent element of the text node is identified by a name in this list, then the text node is output as a CDATA section. For example, the text value James is output as , and the text value AT&T is output as . Otherwise, this value would probably be output as AT&T . The XSLT processor is free to choose other equivalent representations if it wishes, for example a character reference, but the standard says that it should not use CDATA unless it is explicitly requested. The CDATA section may be split into parts if necessary, perhaps when the terminator sequence ]]> appears in the data, or when there is a character that can only be output using a character reference because it is not supported directly in the chosen encoding. |
doctype-system | If this attribute is specified, the output file will include a document type declaration (that is, ) after the XML declaration and before the first element start tag. The name of the document type will be the same as the name of the first element. The value of this attribute will be used as the system identifier in the document type declaration. This attribute should not be used unless the output is a well-formed XML document. |
doctype-public | This attribute is ignored unless the doctype-system attribute is also specified. It defines the value of the public identifier to go in the document type declaration. If no public identifier is specified, none is included in the document type declaration. |
encoding | This specifies the preferred character encoding for the output document. All XSLT processors are required to support the values UTF-8 and UTF-16 (which are also the only values that XML parsers are required to support). This encoding name will be used in the encoding attribute of the XML or Text declaration at the start of the output file, and all characters in the file will be encoded using these conventions. The standard encoding names are not case-sensitive. If the encoding is one that does not allow all XML characters to be represented directly, for example iso-8859-1 , then characters outside this subset will be represented where possible using XML character references (such as ₤ ). It is an error if such characters appear in contexts where character references are not recognized (for example within a processing instruction or comment, or in an element or attribute name). If the result tree is serialized to a destination that expects a stream of Unicode characters rather than a stream of bytes, then the encoding attribute is ignored. This happens, for example, if you send the output to a Java Writer . It often causes confusion when you use the transformNode() method in Microsoft's MSXML API, which returns the serialized result of the transformation as a string: This is a value of type BSTR , which is encoded in UTF-16 regardless of the encoding you requested in the stylesheet. |