Read XSLT 2.0 and XPath 2.0 Programmer's Reference, 4th Edition Online
Authors: Michael Kay
XSLT 2.0 and XPath 2.0 share similar mechanisms for dealing with collations, because they are needed not only in sorting but also in defining what operators such as
eq
mean, and in functions such as
distinct-values()
. The assumption behind the design is that many computing environments (for example the Windows operating system, the Java virtual machine, or the Oracle database platform) already include extensive mechanisms for defining and customizing collations, and that XSLT processors will be written to take advantage of these. As a result, sorting order will not be identical between different implementations.
The basic model is that a collation (a set of rules for determining string ordering) is identified by a URI. Like a namespace URI, this is an abstract identifier, not necessarily the location of a document somewhere on the Web. The form of the URI, and its meaning, is entirely up to the implementation. There is a proposal (RFC 4790) for IANA (the Internet Assigned Numbers Authority) to set up a register of collation names, but even if this comes to fruition, it will still be up to the implementation to decide whether to support these registered collations or not. Until such time, the best you can do to achieve interoperability is pass the collation URI to the stylesheet as a parameter; the API can then sort out the logic for choosing different collations according to which processor you are using.
The Unicode consortium has published an algorithm for collating strings called the Unicode Collation Algorithm (see
http://www.unicode.org/unicode/reports/tr10/index.html
). Although the XSLT specification refers to this document, it doesn't say that implementations have to support it. In practice, many of the facilities available in platforms such as Windows and Java are closely based on this algorithm. The Unicode Collation Algorithm is not itself a collation, because it can be parameterized. Rather, it is a framework for defining a collation with the particular properties that you are looking for.
You can specify the URI of the collation to be used in the
collation
attribute of the
There is one collation URI that every implementation is required to support, called the
Unicode codepoint collation
(not to be confused with the Unicode Collation Algorithm mentioned earlier). This is selected using the URI
http://www.w3.org/2005/xpath-functions/collation/codepoint
Under the codepoint collation, strings are simply compared using the numeric code values of the characters in the string: if two characters have the same Unicode codepoint they are equal, and if one has a numerically lower Unicode codepoint, then it comes first. This isn't a very sophisticated or user-friendly algorithm, but it has the advantage of being fast. If you are sorting strings that use a limited alphabet, for example part numbers, then it is probably perfectly adequate.
Codepoint collation is subtly different from string comparisons in languages such as Java. Java represents Greek Zero Sign (x1018A) as a surrogate pair (xD800, xDD8A), and therefore sorts it before Wavy Overline (xFE4B). In XSLT, Wavy Overline comes first because its codepoint is lower.
If you specify a collation that the implementation doesn't recognize, then it raises an error. However, the word “recognize” is deliberately vague. An implementation could choose to recognize every possible collation URI that you might throw at it, and never raise this error at all. More probably, an implementation might decide to use parameterized URIs (for example, allowing a component such as
language=fr
to select the target language), and it's then an implementation decision whether to “recognize” a URI that contains invalid or missing parameters.
If you don't specify the collation attribute on
lang
and/or
case-order
attributes. These are retained from XSLT 1.0, which didn't support explicit collation URIs, but they are still available for use in 2.0.