Selecting Language(s) to Implement a System's Data Validation Requirements

Issue

You are tasked with implementing a system's XML data validation requirements. For some data and deployment requirements, there is only one XML validation language that has the needed capability, so the selection of language is clear. For other requirements, however, there is a choice; the requirement could be implemented by several XML validation languages. How do you decide which language to use? What factors should go into making the decision? Should multiple languages be used, or is it better to implement all the data requirements in one language?

Example

Suppose this XML instance document is representative of the type of data that the system exchanges:

        <?xml version="1.0"?>
        <Document classification="secret">
              <Para classification="unclassified">
                   One if by land; two if by sea.
              </Para>
        </Document>

And suppose the system's data requirements are:

The <Para> classification value cannot be more sensitive than the <Document> classification value (top-secret is more sensitive than secret, which is more sensitive than confidential, which is more sensitive than unclassified).
The <Document> element must have a classification attribute, whose value is either top-secret, secret, confidential, or unclassified.
The <Para> element must have a classification attribute, whose value is either top-secret, secret, confidential, or unclassified.

The first requirement is a co-constraint and cannot currently be expressed using a grammar-based language. It must be implemented using Schematron. At the bottom of this document is a Schematron implementation of this Security Classification co-constraint.

For the next two requirements, however, there are alternative XML validation languages that could be used. Here's how the requirements could be implemented using the W3C XML Schemas:

        <attribute name="classification">
            <simpleType>
                <enumeration value="top-secret" />
                <enumeration value="secret" />
                <enumeration value="confidential" />
                <enumeration value="unclassified" />
            </simpleType>
        </attribute>

Here's how the requirements could be implemented using Relax NG:

        <attribute name="classification">
            <choice>
                <value>top-secret</value>
                <value>secret</value>
                <value>confidential</value>
                <value>unclassified</value>
            </choice>
        </attribute>

Here's how the requirements could be implemented using Schematron:

        <sch:pattern name="Classification Values"> 

           <sch:rule context="*[@classification]">

              <sch:assert test="@classification='top-secret' or
                                @classification='secret' or
                                @classification='confidential' or
                                @classification='unclassified'">
                  The value of a classification must be one of top-secret,
                  secret, confidential, or unclassified.
              </sch:assert>

           </sch:rule>

         </sch:pattern>

All the implementations seem equally plausible. So how does one decide which language to use? What factors should enter into the decision?

Factors to take into Consideration

Which language would be easiest to implement the constraint in, which language would be easiest to maintain and extend?
Traceability of an implementation to its requirement is terribly important. Schematron provides a "see" attribute on each assertion that can be used to connect the Schematron implementation directly to the requirement it implements.
Let's say you decide to implement some constraints in a grammar-based language and some in Schematron. A downstream tool may want to reason about the constraints on, say, the Para element. The tool will need to examine both the Schematron rules and the grammars together (rather than examining just a single document and a single language). That's not necessarily a bad thing, but it is something to be aware of.
Schematron is implicitly focused on situations in which a human user (or maybe a text log) will be the recipient of a report on how the XML instance fared with respect to the schema. The grammar-based languages are implicitly aimed at scenarios in which the validation will be embedded in some processing context, perhaps a database system, which will get its validation reports through some API.
Consider whether the implementation is geared towards a technical user or a business user.
Performance of validator tools is a factor to consider.

Recommendation

For the example that we have been considering, Schematron must be used at least minimally to implement the Security Classification co-constraint. Suppose that after considering the various factors you decide to implement the second and third data requirements using a grammar-based language. That is, the implementation of the system's data requirements will be divided up across multiple languages.

The Schematron schema could be written to "assume" that all classification values are legal. However, to be safe, it is good practice to provide a "catch all" rule to catch any errors (not just illegal classification values). Here's how to implement the Security classification co-constraint with a catch all rule (see the last rule, in blue):

Security Classification Implementation with Catch-All Rule

<?xml version="1.0"?>
<sch:schema xmlns:sch="http://www.ascc.net/xml/schematron">

   <sch:pattern name="Security Classification Policy">

      <sch:p>A Para's classification value cannot be more sensitive
             than the Document's classification value.</sch:p>

      <sch:rule context="Para[@classification='top-secret']">

         <sch:assert test="/Document/@classification='top-secret'">
             If there is a Para labeled "top-secret" then the Document
             must be labeled top-secret
         </sch:assert>

      </sch:rule>

      <sch:rule context="Para[@classification='secret']">

         <sch:assert test="(/Document/@classification='top-secret') or
                           (/Document/@classification='secret')">
             If there is a Para labeled "secret" then the Document
             must be labeled either secret or top-secret
         </sch:assert>

      </sch:rule>

      <sch:rule context="Para[@classification='confidential']">

         <sch:assert test="(/Document/@classification='top-secret') or
                           (/Document/@classification='secret') or
                           (/Document/@classification='confidential')">
             If there is a Para labeled "confidential" then the Document 
             must be labeled either confidential, secret or top-secret
         </sch:assert>

      </sch:rule>

      <sch:rule context="Para[@classification='unclassified']">

         <sch:assert test="(/Document/@classification='top-secret') or
                           (/Document/@classification='secret') or
                           (/Document/@classification='confidentia') or
                           (/Document/@classification='unclassified')">
         </sch:assert>

      </sch:rule>

      <sch:p>Catch all rule: a valid Para element should fire on one
         of the above rules. If for whatever reason none of the above
         rules fire then drop into this "catch all" rule.
         This rule will be fired if a Para doesn't have a classification
         attribute or if it has an illegal classification value.</sch:p>

      <sch:rule context="Para">

         <sch:assert test="false()">
             If there is a Para without a classification or with a classification 
             label other than top-secret, secret, confidential, or unclassified 
             then the document is in error
         </sch:assert>

      </sch:rule>


   </sch:pattern>

</sch:schema>

Acknowledgements

The following people contributed to the creation of this document:

Len Bullard
Dave Carver
Roger Costello
Mark Delaney
Rick Jelliffe
Noah Mendelsohn
Bryan Rasmussen
Rob Simmons
Tobias Trapp

Schematron Implementation of the Security Classification Co-Constraint

<?xml version="1.0"?>
<sch:schema xmlns:sch="http://www.ascc.net/xml/schematron">

   <sch:pattern name="Security Classification Policy">

      <sch:p>A Para's classification value cannot be more sensitive
             than the Document's classification value.</sch:p>

      <sch:rule context="Para[@classification='top-secret']">

         <sch:assert test="/Document/@classification='top-secret'">
             If there is a Para labeled "top-secret" then the Document
             must be labeled top-secret
         </sch:assert>

      </sch:rule>

      <sch:rule context="Para[@classification='secret']">

         <sch:assert test="(/Document/@classification='top-secret') or
                           (/Document/@classification='secret')">
             If there is a Para labeled "secret" then the Document
             must be labeled either secret or top-secret
         </sch:assert>

      </sch:rule>

      <sch:rule context="Para[@classification='confidential']">

         <sch:assert test="(/Document/@classification='top-secret') or
                           (/Document/@classification='secret') or
                           (/Document/@classification='confidential')">
             If there is a Para labeled "confidential" then the Document 
             must be labeled either confidential, secret or top-secret
         </sch:assert>

      </sch:rule>

      <sch:rule context="Para[@classification='unclassified']">

         <sch:assert test="(/Document/@classification='top-secret') or
                           (/Document/@classification='secret') or
                           (/Document/@classification='confidentia') or
                           (/Document/@classification='unclassified')">
         </sch:assert>

      </sch:rule>

   </sch:pattern>

</sch:schema>