Read XSLT 2.0 and XPath 2.0 Programmer's Reference, 4th Edition Online
Authors: Michael Kay
Places
Place names have an internal structure, but the structure is highly variable. In many cases, components of the place name may be missing, and the part that is missing may be the major part rather than the minor part. For example, you might know that someone was born in Wolverton, England, without knowing which of the three towns of that name it refers to. The GEDCOM schema allows the place name to be entered as unstructured text but also allows individual components of the name to be marked up using a
Type
, which can take values such as
Country
,
City
, or
Parish
to indicate what kind of place this is, and
Level
, which is a number that represents the relationship of this part of the place name to the other parts.
Personal Names
As with place names, personal names have a highly variable internal structure. The name can be written simply as a character string (within an
Type
attribute can be used to identify the name part as, for example, a surname or generation suffix, and the
Level
attribute can be used to indicate its relative importance, for example when used as a key for sorting and indexing.
Creating a Data File
Our next task is to create an XML file containing the Kennedy family tree in the appropriate format. I started by entering the data in a genealogy package, taking the information from public sources such as the Web site of the Kennedy museum. The package I use is called
The Master Genealogist
, and like all such software it is capable of outputting the data in GEDCOM 5.5 format. This is a file containing records that look something like this (it's included in the downloads for this chapter as
kennedy.ged
):
0 @I1@ INDI
1 NAME John Fitzgerald/Kennedy/
1 SEX M
1 BIRT
2 DATE 29 MAY 1917
2 PLAC Brookline, MA, USA
1 DEAT
2 DATE 22 NOV 1963
2 PLAC Dallas, TX, USA
2 NOTE Assassinated by Lee Harvey Oswald.
1 NOTE Educated at Harvard University.
2 CONT Elected Congressman in 1945
2 CONT aged 29; served three terms in the House of Representatives.
2 CONT Elected Senator in 1952. Elected President in 1960, the
2 CONT youngest ever President of the United States.
1 FAMS @F1@
1 FAMC @F2@
This isn't XML, of course, but it is a hierarchic data file containing tagged data, so it is a good candidate for converting into XML that looks like the document below. This doesn't conform to the GEDCOM 6.0 data model or schema, but it's a useful starting point.
Elected Congressman in 1945
aged 29; served three terms in the House of Representatives.
Elected Senator in 1952. Elected President in 1960, the
youngest ever President of the United States.
Each record in a GEDCOM file has a unique identifier (in this case
I1
– that's letter I, digit one), which is used to construct cross-references between records. Most of the information in this record is self-explanatory, except the
The first stage in processing data is to do this conversion into XML, a process which we will examine in the next section.
Converting GEDCOM Files to XML
The main purpose of XSLT is to convert one XML document into another. But that's not all it can do; it can also generate structured text as the output, and in XSLT 2.0, there are new facilities to accept structured text files as the input. That's exactly what we need to do here.
We'll do this in two stages (splitting a complex transformation into a series of simpler transformations arranged in a pipeline is always a good idea). Since GEDCOM 5.5 is a hierarchic format that uses level numbers to represent the nesting, we will start by converting this mechanically to an XML representation. Then in the second phase, we will convert this first cut XML into XML that conforms to the GEDCOM 6.0 specification.
The source document is thus a text file containing records like this:
0 @I1@ INDI
1 NAME John Fitzgerald/Kennedy/
1 SEX M
1 BIRT
2 DATE 29 MAY 1917
2 PLAC Brookline, MA, USA
which needs to be converted into XML like this:
The stylesheet that does this (
parse-gedcom.xsl
) is in fact a micropipeline in its own right, written as a series of variable declarations each one computing a new value from the value of the previous variable. It starts the usual way, and declares a parameter to accept the name of the input text document:
xmlns:xsl=“http://www.w3.org/1999/XSL/Transform”
xmlns:xs=“http://www.w3.org/2001/XMLSchema”
exclude-result-prefixes=“xs”>
The file identified by this parameter is then read using the XSLT 2.0
unparsed-text()
function:
select=“unparsed-text($input, ‘iso-8859-1’)”/>
I've actually cheated here. GEDCOM requires files to be encoded in a character set called ANSEL, otherwise ANSI Z39.47-1985, which is used for almost no other purpose. If ANSEL were a mainstream character encoding, it could be specified in the second argument of the
unparsed-text()
function call. In practice, however, it is rather unlikely that any XSLT 2.0 processor would support this encoding natively. Therefore, the conversion from ANSEL to a mainstream character encoding needs some extra logic. If you use Saxon, you can write a custom
UnparsedTextResolver
in Java to take care of low-level interfacing issues like this. This class can invoke a custom character-code converter in the form of a Java
Reader
—an example called
AnselInputReader
is supplied in the downloads for this chapter. (For detailed instructions, see the Saxon documentation.)
We can now split the input file into lines by using the XPath 2.0
tokenize()
function. We use a separator that matches both Unix and Windows line endings:
select=“tokenize($input-text, ‘\r?\n’)”/>
The result is a sequence of strings (one for each line), and the next stage is to parse the individual lines. Each line in a GEDCOM file has up to five fields: a level number, an identifier, a tag, a cross-reference, and a value. We will create an XML
regex=“
∧
([0-9]+)\s*
(@([A-Za-z0-9]+)@)?\s*
([A-Za-z]*)?\s*
(@([A-Za-z0-9]+)@)?
(.*)$”>
ID=“{regex-group(3)}”
tag=“{regex-group(4)}”
REF=“{regex-group(6)}”
text=“{regex-group(7)}”/>
This code creates a
flags=“x”
means that whitespace in the pattern is ignored, which allows the regex to be split into multiple lines for readability.
I describe this usage of
regex-group()
function, which returns the part of the matching substring that matched the nth parenthesized subexpression within the regex. If the relevant part of the regex wasn't matched (for example, if the optional identifier was absent), then this returns a zero-length string, and our XSLT code then creates a zero-length attribute.
So we now have a sequence of XML elements each representing one line of the GEDCOM file, each containing attributes to represent the contents of the five fields in the input. It's useful when debugging to display the content of this intermediate variable, and I added a debugging template to the stylesheet
(