XSLT 2.0 and XPath 2.0 Programmer's Reference, 4th Edition (635 page)

BOOK: XSLT 2.0 and XPath 2.0 Programmer's Reference, 4th Edition
8.76Mb size Format: txt, pdf, ePub

Effect

This pair of functions is analogous to the
doc()/doc-available()
pair described on page 750, except that the file referenced by the URI is treated as text rather than as XML. The file is located and its textual content is returned as the result of the
unparsed-text()
function, in the form of a string.

The value of the
href
argument is a URI. It mustn't contain a fragment identifier (the part marked with a
#
sign). It may be an absolute URI or a relative URI reference; if it is relative, it is resolved against the base URI of the stylesheet. This is true even if the relative reference is contained in a node in a source document. In this situation it is a good idea to resolve the URI yourself before calling this function. For example, if the reference is in an attribute called
src
then the call might be
unparsed-text(resolve-uri(@src, base-uri(@src)))
. The
resolve-uri()
function is described on page 867; its first argument is the relative reference, and the second argument is the base URI used to resolve it.

The optional
encoding
argument specifies the character encoding of the file. This can be any character encoding supported by the implementation; the only encodings that an implementation must support are UTF-8 and UTF-16. The system will not necessarily use this encoding: the rules for deciding an encoding are as follows and are based on the rules given in the XLink recommendation:

  • First, the processor looks for so-called external encoding information. This typically means information supplied in an HTTP header, but the term is general and could apply to any metadata associated with the file, for example WebDAV properties.
  • Next, it looks at the media type (MIME type), and if this identifies the file as XML, then it determines the encoding using the same rules as an XML parser (for example, it looks for an XML declaration, and if there is none, it looks for a byte order mark).

Why would you use this function, rather than
doc()
, to access an XML document? The thinking is that it is quite common for one XML document to act as an envelope for another XML document that is carried transparently in a CDATA section, and if you want to create such a composite document, you will want to read the payload document without parsing it.

  • Next, it uses the
    encoding
    argument if this has been supplied.
  • If there is no
    encoding
    argument, the XSLT processor can use “implementation-defined heuristics” to guess the encoding; if that fails, then it tries to use UTF-8 encoding. The term “implementation-defined heuristics” could cover a wide range of strategies, such as recognizing known document types like HTML from the first few bytes of the file, or treating the file as Windows codepage 1252 if it cannot be decoded as UTF-8.

Various errors can occur in this process, and they are generally fatal. This is why the auxiliary function
unparsed-text-available()
is provided: if any failure is going to occur when reading the file, then calling
unparsed-text-available()
will in effect catch the error before it occurs.

One limitation is that it is not possible to process a file if it contains characters that are invalid in XML (this applies to most control characters in the range
x00
to
x1F
under XML 1.0, but only to the null character
x00
under XML 1.1).

Usage and Examples

There are a number of ways this function can be used, and I will show three. These are as follows:

  • Up-conversion: that is, loading text that lacks markup in order to generate the XML markup
  • XML envelope/payload applications
  • HTML boilerplate generation

Up-Conversion

Up-conversion is the name often given to the process of analyzing input data for structure that is implicit in the textual content, and producing as output an XML document in which this structure is revealed by explicit markup. I have used this process, for example, to analyze HTML pages containing census data, in order to clean the data to make it suitable for adding to a structured genealogy database. It can also be used to process data that arrives in non-XML formats such as comma-separated values or EDI syntax.

The
unparsed-text()
function is not the only way of supplying non-XML data as input to a stylesheet; it can also be done simply by passing a string as the value of a stylesheet parameter. But the
unparsed-text()
function is particularly useful because the data is referenced by URI, and accessed under the control of the stylesheet.

XSLT 2.0 is much more suitable for use in up-conversion applications than XSLT 1.0. The most important tools are the

instruction, which enables the stylesheet to make use of structure that is implicit in the text, and the

instruction, which makes it much easier to analyze poorly structured markup. These can often be used in tandem: in the first stage in processing,

is used to recognize patterns in the text and mark these patterns using elements in a temporary tree, and in the second stage,

is used to turn flat markup structures into hierarchic structures that reflect the true data model.

Here is an example of a stylesheet that reads a comma-separated-values file and turns it into structured markup.

Example: Processing a Comma-Separated-Values File

This example is a stylesheet that reads a comma-separated-values file, given the URI of the file as a stylesheet parameter. It outputs an XML representation of this file, placing the rows in a

element and each value in a

element. It does not attempt to process a header row containing field names, but this would be a simple extension.

Input

This stylesheet does not use any source XML document. Instead, it expects the URI of an ordinary text file to be supplied as a parameter to the stylesheet.

This is what the input file
names.csv
looks like.

123,“Mary Jones”,“IBM”,“USA”,1997-05-14

423,“Barbara Smith”,“General Motors”,“USA”,1996-03-12

6721,“Martin McDougall”,“British Airways”,“UK”,2001-01-15

830,“Jonathan Perkins”,“Springer Verlag”,“Germany”,2000-11-17

Stylesheet

This stylesheet
analyze-names.xsl
uses a named template
main
as its entry point: a new feature in XSLT 2.0.

The command for running the stylesheet under Saxon 9.0 looks like this.

java -jar saxon9.jar -it:main -xsl:analyze-names.xsl input-uri=names.csv

The
-it
option here indicates that processing should start without an XML source document, at the named template
main
.

The stylesheet first reads the input file using the
unparsed-text()
function, and then uses two levels of processing using

to identify the structure. The first level (using the regex
\n
) splits the input into lines. The second level is explained more fully under the description of the
regex-group()
function on page 860: it extracts either the contents of a quoted string, or any value terminated by a comma, and copies this to a

element.


    xmlns:xs=“http://www.w3.org/2001/XMLSchema”

    version=“2.0”>




  

                select=“unparsed-text($input-uri, ‘iso-8859-1’)”/>

  


  

    

      

      
”]*?)”)|([

,]+?),’>

        

          

             

             

          

        

      

      

    

  

  




Output

The output is as follows:



   

      123

      Mary Jones

      IBM

      USA

   

   

      423

      Barbara Smith

      General Motors

      USA

   

   

      6721

      Martin McDougall

      British Airways

      UK

   

   

      830

      Jonathan Perkins

      Springer Verlag

      Germany

   


Other books

Free Radical by Shamus Young
Royal Assassin by Robin Hobb
The Green Man by Kate Sedley
Hooked by Matt Richtel
In Bed with the Enemy by Janet Woods
Tymber Dalton by It's a Sweet Life