Read XSLT 2.0 and XPath 2.0 Programmer's Reference, 4th Edition Online
Authors: Michael Kay
My first step was to load the DTD into Stylus Studio and convert it to a schema. You could equally well do this using other tools such as XML Spy or oXygen. In fact Stylus Studio offers a choice of two converters, one of which is native to Stylus, the other a packaging of James Clark's
trang
program. I found that the native tool, with all options defaulted, did a very satisfactory job: the output is in the download directory as
rawschema-stylus.xsd
.
I then refined this schema by hand. The changes fell into the following categories:
An interesting feature of this data is that the schema is very permissive. For example, it specifies a default format for dates in the form
DD
MMM
YYYY
(such as
18
APR
1924
), which has long been the convention used by genealogists. However, it doesn't insist that the date of an event takes this form. It's quite OK, for example, to replace the last digit of the year by a question mark, perhaps to reflect the fact that the digit is difficult to decipher on an original manuscript. There are certain approved conventions such as preceding the date with
ABT
to indicate that the date is approximate, or
EST
to say that it is estimated, but there are no absolute rules. The golden rule in genealogy is that when you find information in a source document, you should be able to transcribe it as faithfully to the original as you possibly can, and a schema that imposes restrictions on your ability to do this is considered a bad thing. If you find an old church register in which a date of baptism is recorded as
Septuagesima 1582
, then you should be able to enter that in your database. I'll come back to the modeling of dates in the schema on page 1057.
In GEDCOM, there is no formal way of linking one file to another. XML, of course, creates wonderful opportunities to define how your family tree links to someone else's. But the linking isn't as easy as it sounds (nothing is, in genealogy) because of the problems of maintaining version integrity between two datasets that are changing independently. So I'll avoid getting into that area and stick to the model that has the whole family tree in one XML document.
The GEDCOM 6.0 Schema
Let's now take a quick look at some aspects of the XML Schema which I created for GEDCOM 6.0. In principle, because it's converted from the DTD, it covers all aspects of the specification; however, in improving the schema to describe the specification more precisely and more usefully, I concentrated on the parts that we are actually using in the application in this chapter: in particular, the three main object types individual, event, and family, and the three main properties, namely date, place, and personal name.
Individuals
Here is the element declaration for an
IndivName
gives the name of the individual.
Gender
has the obvious meaning;
DeathStatus
is for recording information such as “died in infancy” when no specific death event is known.
PersInfo
allows recording of arbitrary personal information such as occupation and religion.
AssocIndiv
is for links to related individuals where the relationships cannot be expressed directly through Family objects (for example, links to godparents).
DupIndiv
is interesting: it allows an assertion that this
IndividualRec
refers to the same individual as another
IndividualRec
. This is very useful when combining data sets compiled by different genealogists; merging the two records into one can be very difficult if there are inconsistencies in the data, and it can prove very difficult to unmerge the data later if they are found to be different individuals after all. Within the
CommonFields
group, which is also present in other top-level elements,
ExternalID
is for reference numbers that identify the individual in external databases;
Submitter
is the person who created this record;
Note
is for arbitrary comments;
Evidence
says where the information came from;
Enrichment
is for inline documentation such as photographs or transcripts of original documents, and
Changed
is for a change history of this record.