Best Practice: Use Multiple Schema Languages

Roger L. Costello, November 1, 2006

Introduction

There are four schema languages for expressing data constraints in XML documents:

Each schema language has strengths and weaknesses. Used alone, each leave gaps in data checking. Used together, they can express most, if not all, data constraints.

The purpose of this document is to characterize each schema language, its functionality, its role in data checking, and recommend as best practice the use of multiple schema languages.

Case Study

The flight XML document below will be used as a reference example. It has data constraints typical of many documents. It contains data about an in-flight Aircraft and Vertical Obstructions on the flight path. The document is valid only if these data checks are met:

  1. Check that the Aircraft Altitude is at least 500 feet higher than the top of all VerticalObstructions
  2. Check that the units used for measuring the Aircraft Altitude are the same as that used for measuring the VerticalObstructions
  3. Check that the VerticalObstructions are of a known variety: mountain, tower, or building
  4. Check that the Altitude is expressed in feet
  5. Check that the Altitude reference is expressed as MSL (Mean Sea Level)
  6. Check that the measures used in the VerticalObstructions are expressed in units of feet
  7. Check that the value of each Latitude element is a decimal, in the range -90 to +90, with three digits to the right of the decimal point
  8. Check that the value of each Longitude element is a decimal, in the range -180 to +180, with three digits to the right of the decimal point
  9. Check that the Flight element contains one child Aircraft element
  10. Check that the Aircraft element contains one child Altitude element and one child Location element
  11. Check that the Altitude element has a units attribute and a reference attribute
  12. Check that each VerticalObstruction contains an Elevation and a Location
  13. Check that each Location element contains a Latitude and Longitude
  14. Check that the value of the Aircraft's Altitude is an integer
  15. Check that the value of each VerticalObstruction Elevation element is an integer
  16. Check that the value of each VerticalObstruction Height element is an integer

Additionally, these are desired properties:

  1. The Aircraft element and the VerticalObstruction elements can be arranged in any order
  2. The document is "open". That is, other information can be added. As long as the document meets the above constraints then the XML document is acceptable, regardless of any new items inserted into the document
  3. There should be meaningful (operational) diagnostic messages for each constraint listed above

Schema Language Support for Implementing the Data Checks

A Comparison of Schematron, Relax NG, XML Schema and DTDs with Respect to their Ability to Support the Flight Data Checks
Schematron Relax NG XML Schemas DTD
Check that the Aircraft Altitude is at least 500 feet higher than the top of all VerticalObstructions No No No
Check that the units used for measuring the Aircraft Altitude are the same as that used for measuring the VerticalObstructions No No No
Check that the VerticalObstructions are of a known variety: mountain, tower, or building The schema language supports this constraint The schema language supports this constraint The schema language supports this constraint The schema language supports this constraint
Check that the Altitude is expressed in feet The schema language supports this constraint The schema language supports this constraint The schema language supports this constraint The schema language supports this constraint
Check that the Altitude reference is expressed as MSL (Mean Sea Level) The schema language supports this constraint The schema language supports this constraint The schema language supports this constraint The schema language supports this constraint
Check that the measures used in the VerticalObstructions are expressed in units of feet The schema language supports this constraint The schema language supports this constraint The schema language supports this constraint The schema language supports this constraint
Check that the value of each Latitude element is a decimal, in the range -90 to +90, with three digits to the right of the decimal point Tedious The schema language supports this constraint The schema language supports this constraint No
Check that the value of each Longitude element is a decimal, in the range -180 to +180, with three digits to the right of the decimal point Tedious The schema language supports this constraint The schema language supports this constraint No
Check that the Flight element contains one child Aircraft element The schema language supports this constraint The schema language supports this constraint The schema language supports this constraint The schema language supports this constraint
Check that the Aircraft element contains one child Altitude element and one child Location element The schema language supports this constraint The schema language supports this constraint The schema language supports this constraint The schema language supports this constraint
Check that the Altitude element has a units attribute and a reference attribute The schema language supports this constraint The schema language supports this constraint The schema language supports this constraint The schema language supports this constraint
Check that each VerticalObstruction contains an Elevation and a Location The schema language supports this constraint The schema language supports this constraint The schema language supports this constraint The schema language supports this constraint
Check that each Location element contains a Latitude and Longitude The schema language supports this constraint The schema language supports this constraint The schema language supports this constraint The schema language supports this constraint
Check that the value of the Aircraft's Altitude is an integer Tedious The schema language supports this constraint The schema language supports this constraint No
Check that the value of each VerticalObstruction Elevation element is an integer Tedious The schema language supports this constraint The schema language supports this constraint No
Check that the value of each VerticalObstruction Height element is an integer Tedious The schema language supports this constraint The schema language supports this constraint No
The Aircraft element and the VerticalObstruction elements can be arranged in any order The schema language supports this constraint The schema language supports this constraint No No
The document is "open". That is, other information can be added. As long as the document meets the above constraints then the XML document is acceptable, regardless of any additional items inserted into the document The schema language supports this constraint Tedious Tedious No
There should be meaningful (operational) diagnostic messages for each constraint listed above The schema language supports this constraint No No No

Characterizing the Schema Languages

The following sections characterize the schema languages.

Grammar versus Assertion Based Schema Languages

Relax NG, XML Schema, and DTD are grammar-based schema languages.

Grammar-based schema language
A grammar-based schema language specifies what elements may be used in an XML instance document, the order of the elements, the number of occurrences of each element, and the content/datatype of each element. That is, it specifies what components are allowed and the rules for using the components.

Relax NG and XML Schema uses XML syntax. DTD uses a non-XML syntax.

Schematron is an assertion-based schema language.

Assertion-based schema language
An assertion-based schema language specifies the relationships between the elements and attributes in an XML instance document. It is also called a rule-based schema language.

Schematron uses XPath to express assertions.

Expressing Constraints

The bottom line for evaluating any schema language is:

Here's how the schema languages performed in expressing the data checks in the above Flight case study:

No schema language is able to express all of the Flight data checks. Multiple schema languages are needed.

Meaningful Diagnostics

Providing meaningful, user-defined diagnostic messages is a valuable attribute of a schema language. Schematron is the only schema language with support for connecting a data check to a diagnostic.

With Schematron the schema developer writes a diagnostic message in tandem with a constraint check. For example, the constraint, "Check that the Aircraft Altitude is at least 500 feet higher than the top of all VerticalObstructions" is expressed in Schematron as such:

  1. <sch:pattern name="Altitude Check">
  2. <sch:rule context="Flight">
  3. <sch:assert test="xPath expression" diagnostics="Aircraft-Too-Low">
  4. The Aircraft Altitude must be at least 500 feet higher than the top of each VerticalObstruction.
  5. </sch:assert>
  6. </sch:rule>
  7. </sch:pattern>

The "diagnostics" attribute is a reference to a diagnostic that the schema developer writes:

  1. <sch:diagnostic id="Aircraft-Too-Low">
  2. Warning! The Aircraft Altitude is too low!
  3. Current Aircraft Altitude = <sch:value-of select="Aircraft/Altitude"/>
  4. Maintain an altitude of at least 500 feet above each VerticalObstruction
  5. </sch:diagnostic>

The diagnostics provide operational (user) error messages.

Relax NG, XML Schemas, and DTDs have no mechanism for correlating constraints with diagnostic messages. Consequently the diagnostics are generic, parser-generated error messages, absent of operational (user) terminology.

Co-Constraints

Schematron supports co-constraints.

Co-constraint
A co-constraint is constraint between two or more values. A co-constraint may exist between elements, or elements and attributes, or attribute and attribute. A co-constraint may exist within a single XML document, or across multiple XML documents.

The first data constraint listed above in the Flight case study is a co-constraint, "Check that the Aircraft Altitude is at least 500 feet higher than the top of all VerticalObstructions." There the constraint is between one element and multiple other elements.

Relax NG, XML Schemas, and DTD do not support co-constraints.

Open versus Closed Content

Schematron supports open content.

Open Content
Open content refers to the ability of an XML instance document to contain new elements, above and beyond that described by its schema. There is a spectrum of "openness", from allowing new elements anywhere in the document, to restricting new elements to just certain parts of the document.

Supporting open content is common in many markup languages. For example, HTML, XHTML, and XSLT support open content. XSLT allows new elements within each XSLT template. HTML and XHTML allows new elements in its body. Open content enables flexible document content. Without open content many XML technologies would not be feasible, e.g., XSLT would not be possible.

With Schematron, if you do nothing then you enable open content. With Relax NG and XML Schemas you must explicitly insert elements where openness is desired. If you desire openness everywhere then you will have to insert many elements. DTDs similarly require inserting openness indicators.

A "closed" XML instance document is one that is only allowed to have specified elements and no others. In many security environments closed content is important, the desire is to control exactly what is allowed in XML instance documents.

The "default" operating mode of Relax NG, XML schemas, and DTDs is closed content..

The level of openness that Schematron provides by default is not practical with Relax NG, XML Schemas, or DTDs. Too many interleave/any/anyAttribute elements would need to be inserted.

It is possible to design a Schematron document to be closed. For example, if it is desired that the Flight element be comprised of one Aircraft element, an unbounded number of VerticalObstruction elements, and nothing else ... this can be expressed (using XPath) as follows:

  1. (count(Aircraft) eq 1) and (empty(*[not(self::Aircraft) and not(self::VerticalObstruction)]))

This is tedious at best, error-prone at worst.

Order of Data

Often the order of elements in an XML document is irrelevant. For the Flight example, it makes no difference whether an Aircraft element is first, or a VerticalObstruction element is first.

With Schematron the "default" operating mode is to allow elements to occur in any order. With XML Schemas there is very restricted support for unordered content using the "all" element. Relax NG provides support for unordered content via the interleave element. Unordered elements is not practical in DTD beyond a handful of elements.

Sometimes it is necessary to require the elements occur in a specific sequence. Specifying sequences is straightforward with Relax NG, XML Schemas, and DTDs. It is not straightforward using Schematron (although not impossible).

Datatype Constraints

Relax NG and XML Schemas have a rich set of datatypes, compatible with what you find in a high level programming language or in databases.

With Schematron datatype constraints are expressed using XPath 2.0 regular expressions. For example, to express the constraint, "Check that the value of the Aircraft's Altitude is an integer", this regular expression is used:

  1. [\+\-]?\d+

It is much easier to express this constraint using either Relax NG or XML Schemas -- simply reference the built-in integer datatype.

DTDs have a limited set of datatypes. Mostly all data is treated as a string.

PSVI - Post Schema Validation Infoset

An XML Schema validator, as it validates an XML instance document, simultaneously creates metadata for each component, e.g., "The content of element xyz is annotated as a boolean datatype." There are XML technologies which take advantage of this metadata information. For example, XSLT 2.0 uses the PSVI metadata to perform schema checking within a stylesheet. Validators of Relax NG, Schematron, and DTDs do not generate PSVI metadata.

Recommendations

Use multiple schema languages. Use an assertion-based schema language plus a grammar-based schema language. That is, use Schematron plus one of {Relax NG, XML Schema, DTD}. If enabling unordered content is important to your situation, use Relax NG. If the PSVI is important, use XML Schema. If simply defining an XML vocabulary is all you require, use DTDs.

Flight.xml

  1. <?xml version="1.0"?>
  2. <Flight xmlns="http://www.aviation.org">
  3. <Aircraft type="Boeing 747">
  4. <Altitude units="feet" reference="MSL">3300</Altitude>
  5. <Location>
  6. <Latitude>42.371</Latitude>
  7. <Longitude>-71.000</Longitude>
  8. </Location>
  9. </Aircraft>
  10. <VerticalObstruction type="tower">
  11. <!-- The top of the tower is 1500 feet -->
  12. <Elevation units="feet">1000</Elevation>
  13. <Height units="feet">500</Height>
  14. <Location>
  15. <Latitude>42.371</Latitude>
  16. <Longitude>-71.025</Longitude>
  17. </Location>
  18. </VerticalObstruction>
  19. <VerticalObstruction type="mountain">
  20. <!-- The top of the mountain is 2600 feet -->
  21. <Elevation units="feet">2600</Elevation>
  22. <Location>
  23. <Latitude>42.371</Latitude>
  24. <Longitude>-71.155</Longitude>
  25. </Location>
  26. </VerticalObstruction>
  27. <VerticalObstruction type="building">
  28. <!-- The top of the building is 700 feet -->
  29. <Elevation units="feet">500</Elevation>
  30. <Height units="feet">200</Height>
  31. <Location>
  32. <Latitude>42.371</Latitude>
  33. <Longitude>-71.299</Longitude>
  34. </Location>
  35. </VerticalObstruction>
  36. </Flight>