There are four schema languages for expressing data constraints in XML documents:
Each schema language has strengths and weaknesses. Used alone, each leave gaps in data checking. Used together, they can express most, if not all, data constraints.The purpose of this document is to characterize each schema language, its functionality, its role in data checking, and recommend as best practice the use of multiple schema languages.
The flight XML document below will be used as a reference example. It has data constraints typical of many documents. It contains data about an in-flight Aircraft and Vertical Obstructions on the flight path. The document is valid only if these data checks are met:
Additionally, these are desired properties:
Schematron | Relax NG | XML Schemas | DTD | |
---|---|---|---|---|
Check that the Aircraft Altitude is at least 500 feet higher than the top of all VerticalObstructions | No | No | No | |
Check that the units used for measuring the Aircraft Altitude are the same as that used for measuring the VerticalObstructions | No | No | No | |
Check that the VerticalObstructions are of a known variety: mountain, tower, or building | ||||
Check that the Altitude is expressed in feet | ||||
Check that the Altitude reference is expressed as MSL (Mean Sea Level) | ||||
Check that the measures used in the VerticalObstructions are expressed in units of feet | ||||
Check that the value of each Latitude element is a decimal, in the range -90 to +90, with three digits to the right of the decimal point | Tedious | No | ||
Check that the value of each Longitude element is a decimal, in the range -180 to +180, with three digits to the right of the decimal point | Tedious | No | ||
Check that the Flight element contains one child Aircraft element | ||||
Check that the Aircraft element contains one child Altitude element and one child Location element | ||||
Check that the Altitude element has a units attribute and a reference attribute | ||||
Check that each VerticalObstruction contains an Elevation and a Location | ||||
Check that each Location element contains a Latitude and Longitude | ||||
Check that the value of the Aircraft's Altitude is an integer | Tedious | No | ||
Check that the value of each VerticalObstruction Elevation element is an integer | Tedious | No | ||
Check that the value of each VerticalObstruction Height element is an integer | Tedious | No | ||
The Aircraft element and the VerticalObstruction elements can be arranged in any order | No | No | ||
The document is "open". That is, other information can be added. As long as the document meets the above constraints then the XML document is acceptable, regardless of any additional items inserted into the document | Tedious | Tedious | No | |
There should be meaningful (operational) diagnostic messages for each constraint listed above | No | No | No |
The following sections characterize the schema languages.
Relax NG, XML Schema, and DTD are grammar-based schema languages.
Relax NG and XML Schema uses XML syntax. DTD uses a non-XML syntax.
Schematron is an assertion-based schema language.
Schematron uses XPath to express assertions.
The bottom line for evaluating any schema language is:
Here's how the schema languages performed in expressing the data checks in the above Flight case study:
Providing meaningful, user-defined diagnostic messages is a valuable attribute of a schema language. Schematron is the only schema language with support for connecting a data check to a diagnostic.
With Schematron the schema developer writes a diagnostic message in tandem with a constraint check. For example, the constraint, "Check that the Aircraft Altitude is at least 500 feet higher than the top of all VerticalObstructions" is expressed in Schematron as such:
The "diagnostics" attribute is a reference to a diagnostic that the schema developer writes:
The diagnostics provide operational (user) error messages.
Relax NG, XML Schemas, and DTDs have no mechanism for correlating constraints with diagnostic messages. Consequently the diagnostics are generic, parser-generated error messages, absent of operational (user) terminology.
Schematron supports co-constraints.
The first data constraint listed above in the Flight case study is a co-constraint, "Check that the Aircraft Altitude is at least 500 feet higher than the top of all VerticalObstructions." There the constraint is between one element and multiple other elements.
Relax NG, XML Schemas, and DTD do not support co-constraints.
Schematron supports open content.
Supporting open content is common in many markup languages. For example, HTML, XHTML, and XSLT support open content. XSLT allows new elements within each XSLT template. HTML and XHTML allows new elements in its body. Open content enables flexible document content. Without open content many XML technologies would not be feasible, e.g., XSLT would not be possible.
With Schematron, if you do nothing then you enable open content. With Relax NG and XML Schemas you must explicitly insert elements where openness is desired. If you desire openness everywhere then you will have to insert many elements. DTDs similarly require inserting openness indicators.
A "closed" XML instance document is one that is only allowed to have specified elements and no others. In many security environments closed content is important, the desire is to control exactly what is allowed in XML instance documents.
The "default" operating mode of Relax NG, XML schemas, and DTDs is closed content..
The level of openness that Schematron provides by default is not practical with Relax NG, XML Schemas, or DTDs. Too many interleave/any/anyAttribute elements would need to be inserted.
It is possible to design a Schematron document to be closed. For example, if it is desired that the Flight element be comprised of one Aircraft element, an unbounded number of VerticalObstruction elements, and nothing else ... this can be expressed (using XPath) as follows:
This is tedious at best, error-prone at worst.
Often the order of elements in an XML document is irrelevant. For the Flight example, it makes no difference whether an Aircraft element is first, or a VerticalObstruction element is first.
With Schematron the "default" operating mode is to allow elements to occur in any order. With XML Schemas there is very restricted support for unordered content using the "all" element. Relax NG provides support for unordered content via the interleave element. Unordered elements is not practical in DTD beyond a handful of elements.
Sometimes it is necessary to require the elements occur in a specific sequence. Specifying sequences is straightforward with Relax NG, XML Schemas, and DTDs. It is not straightforward using Schematron (although not impossible).
Relax NG and XML Schemas have a rich set of datatypes, compatible with what you find in a high level programming language or in databases.
With Schematron datatype constraints are expressed using XPath 2.0 regular expressions. For example, to express the constraint, "Check that the value of the Aircraft's Altitude is an integer", this regular expression is used:
It is much easier to express this constraint using either Relax NG or XML Schemas -- simply reference the built-in integer datatype.
DTDs have a limited set of datatypes. Mostly all data is treated as a string.
An XML Schema validator, as it validates an XML instance document, simultaneously creates metadata for each component, e.g., "The content of element xyz is annotated as a boolean datatype." There are XML technologies which take advantage of this metadata information. For example, XSLT 2.0 uses the PSVI metadata to perform schema checking within a stylesheet. Validators of Relax NG, Schematron, and DTDs do not generate PSVI metadata.
Use multiple schema languages. Use an assertion-based schema language plus a grammar-based schema language. That is, use Schematron plus one of {Relax NG, XML Schema, DTD}. If enabling unordered content is important to your situation, use Relax NG. If the PSVI is important, use XML Schema. If simply defining an XML vocabulary is all you require, use DTDs.