Read XSLT 2.0 and XPath 2.0 Programmer's Reference, 4th Edition Online
Authors: Michael Kay
.
Another situation that can cause two different substrings to match at the same position is where the regex contains two alternatives that both match. For example, when the regex
#|##
is applied to a string that contains two consecutive
#
characters, both branches will match. The rule here is that the first (leftmost) alternative wins. In this case, this is almost certainly not what was intended: rewrite the expression as
##|#
, or as
##?
.
If the input string starts with a separator, then the output sequence will start with a zero-length string representing what was found before the first separator. If the input string ends with a separator, there will similarly be a zero-length string at the end of the sequence. If there are two adjacent separators in the middle of the string, you will get a zero-length string in the middle of the result sequence. In all cases the number of items in the result sequence is the number of separators in the input string plus one.
If the regex does not match the input string, the
tokenize()
function will return the input string unchanged, as a singleton sequence. If this is not the effect you are looking for, use the
matches()
function first to see if there is a match.
If the regex is one that matches a zero-length string, that is, if
matches(“”
,
$regex)
is true, the system reports an error. An example of such a regex is
\s*
. Although various interpretations of such a construct are possible, the Working Group decided that the results were too confusing and decided not to allow it.
Examples
Expression | Result |
tokenize(“Go home, Jack!”, “\W+”) | (“Go”, “home”, “Jack”, “”) |
tokenize(“abc[NL]def[XY]”, “\[.*?\]”) | (“abc”, “def”, “”) |
Usage
A limitation of this function is that it is not possible to do anything with the separator substrings. This means, for example, that you can't treat a number differently depending on whether it was separated from the next number by a comma or a semicolon. One solution to this problem is to process the string in two passes: first, do a
replace()
call in which the separators
,
and
;
are replaced by (say)
,#
and
;#
; then use
tokenize()
to split the string at the
#
characters, and the original
,
or
;