Read XSLT 2.0 and XPath 2.0 Programmer's Reference, 4th Edition Online
Authors: Michael Kay
Effect
This pair of functions is analogous to the
doc()/doc-available()
pair described on page 750, except that the file referenced by the URI is treated as text rather than as XML. The file is located and its textual content is returned as the result of the
unparsed-text()
function, in the form of a string.
The value of the
href
argument is a URI. It mustn't contain a fragment identifier (the part marked with a
#
sign). It may be an absolute URI or a relative URI reference; if it is relative, it is resolved against the base URI of the stylesheet. This is true even if the relative reference is contained in a node in a source document. In this situation it is a good idea to resolve the URI yourself before calling this function. For example, if the reference is in an attribute called
src
then the call might be
unparsed-text(resolve-uri(@src, base-uri(@src)))
. The
resolve-uri()
function is described on page 867; its first argument is the relative reference, and the second argument is the base URI used to resolve it.
The optional
encoding
argument specifies the character encoding of the file. This can be any character encoding supported by the implementation; the only encodings that an implementation must support are UTF-8 and UTF-16. The system will not necessarily use this encoding: the rules for deciding an encoding are as follows and are based on the rules given in the XLink recommendation:
Why would you use this function, rather than
doc()
, to access an XML document? The thinking is that it is quite common for one XML document to act as an envelope for another XML document that is carried transparently in a CDATA section, and if you want to create such a composite document, you will want to read the payload document without parsing it.
Various errors can occur in this process, and they are generally fatal. This is why the auxiliary function
unparsed-text-available()
is provided: if any failure is going to occur when reading the file, then calling
unparsed-text-available()
will in effect catch the error before it occurs.
One limitation is that it is not possible to process a file if it contains characters that are invalid in XML (this applies to most control characters in the range
x00
to
x1F
under XML 1.0, but only to the null character
x00
under XML 1.1).
Usage and Examples
There are a number of ways this function can be used, and I will show three. These are as follows:
Up-Conversion
Up-conversion is the name often given to the process of analyzing input data for structure that is implicit in the textual content, and producing as output an XML document in which this structure is revealed by explicit markup. I have used this process, for example, to analyze HTML pages containing census data, in order to clean the data to make it suitable for adding to a structured genealogy database. It can also be used to process data that arrives in non-XML formats such as comma-separated values or EDI syntax.
The
unparsed-text()
function is not the only way of supplying non-XML data as input to a stylesheet; it can also be done simply by passing a string as the value of a stylesheet parameter. But the
unparsed-text()
function is particularly useful because the data is referenced by URI, and accessed under the control of the stylesheet.
XSLT 2.0 is much more suitable for use in up-conversion applications than XSLT 1.0. The most important tools are the
Here is an example of a stylesheet that reads a comma-separated-values file and turns it into structured markup.
Example: Processing a Comma-Separated-Values File
This example is a stylesheet that reads a comma-separated-values file, given the URI of the file as a stylesheet parameter. It outputs an XML representation of this file, placing the rows in a
Input
This stylesheet does not use any source XML document. Instead, it expects the URI of an ordinary text file to be supplied as a parameter to the stylesheet.
This is what the input file
names.csv
looks like.
123,“Mary Jones”,“IBM”,“USA”,1997-05-14
423,“Barbara Smith”,“General Motors”,“USA”,1996-03-12
6721,“Martin McDougall”,“British Airways”,“UK”,2001-01-15
830,“Jonathan Perkins”,“Springer Verlag”,“Germany”,2000-11-17
Stylesheet
This stylesheet
analyze-names.xsl
uses a named template
main
as its entry point: a new feature in XSLT 2.0.
The command for running the stylesheet under Saxon 9.0 looks like this.
java -jar saxon9.jar -it:main -xsl:analyze-names.xsl input-uri=names.csv
The
-it
option here indicates that processing should start without an XML source document, at the named template
main
.
The stylesheet first reads the input file using the
unparsed-text()
function, and then uses two levels of processing using
\n
) splits the input into lines. The second level is explained more fully under the description of the
regex-group()
function on page 860: it extracts either the contents of a quoted string, or any value terminated by a comma, and copies this to a
xmlns:xs=“http://www.w3.org/2001/XMLSchema”
version=“2.0”>
select=“unparsed-text($input-uri, ‘iso-8859-1’)”/>
”]*?)”)|([
∧
,]+?),’>
Output
The output is as follows: