Embedded versus Separate Schemas

Issue

You have a set of data validation requirements for your system. You have decided to implement the requirements using a combination of a grammar-based schema language (e.g. Relax NG or XML Schema) plus Schematron. Which approach should you take:

  1. Should you keep the Schematron schema and grammar schema separate, or
  2. Should you embed the Schematron schema within the grammar schema, or
  3. Should you embed the grammar schema within the Schematron schema, or
  4. Should you divide up the data validation requirements by topic and have a schema document for each topic (thus each document may contain a grammar schema or a Schematron schema or both).

Example

Suppose this XML instance document is representative of the type of data that your system exchanges:

        <?xml version="1.0"?>
        <Document classification="secret">
              <Para classification="unclassified">
                   One if by land; two if by sea.
              </Para>
        </Document>
    

And suppose the system's data requirements are:

  1. The <Para> classification value cannot be more sensitive than the <Document> classification value (top-secret is more sensitive than secret, which is more sensitive than confidential, which is more sensitive than unclassified).
  2. The <Document> element must have a classification attribute, whose value is either top-secret, secret, confidential, or unclassified.
  3. The <Para> element must have a classification attribute, whose value is either top-secret, secret, confidential, or unclassified.

The first requirement will be implemented using Schematron. The next two requirements will be implemented using a grammmar-based schema, say, XML Schemas.

There are four alternatives (the actual implementations are shown at the bottom of this document):

  1. Create two documents: one document for the Schematron implementation, and a second document for the XML Schema implementation.
  2. Create one document: the Schematron patterns, rules, and assertions are embedded within <appinfo> elements in the XML Schema.
  3. Create one document: the XML Schema declarations and definitions are embedded within the Schematron patterns, rules, and assertions.
  4. Create a document per topic: in the example the validation requirements may be categorized along two topics: (a) Security Classification Data Requirement, and (b) Classification Values Data Requirement. Thus, create two documents. Each document contains the XML Schema declarations/definitions and Schematron patterns/rules/assertions that are needed to implement a topic.

The following sections discuss the advantages and disadvantages of each alternative.

  1. Advantages/Disadvantages of Separate Schematron and Grammar Schemas

    Advantages

    A. The particular grammar language currently being used can be easily replaced. Thus, if XML Schema is currently being used, at a later date you can easily replace it with Relax NG without impacting the Schematron schema.

    B. Data validation can be done in stages, in a pipeline fashion. It might be desirable for your system to implement the data validation requirements in stages, e.g. first do grammar checking, then do co-constraint checking (using Schematron), then do data cardinality checking (using Schematron), then do algorithmic checking (using Schematron).

    Example: A real-world example of the need for data validation in stages: consider XML Schema validation of an XML instance document that accidently specifies the wrong namespace; a lax validator will report a false positive even though the XML instance document may be incorrect; to avoid this, prior to XML Schema validation perform a validation which checks the namespace.

    C. There may be a performance improvement, e.g. suppose grammar validation is done first and fails (i.e. outputs errors); it may not be necessary to execute the Schematron validation; thus there is a time savings.

    Disadvantages

    A. There may be a performance degradation. Running several validations rather than a single validation may be more expensive.

  2. Advantages/Disadvantages of Schematron Embedded within a Grammar Schema

    Advantages

    A. There may be a performance improvement. Running one validation rather than several validations may yield a savings in performance.

    Disadvantages

    A. Swapping out the particular grammar language that is currently being used and replacing it with a different grammar language may be difficult since the two are tightly intertwined. [This disadvantage may be negated if the schemas are being generated from, say, a UML model.]

    B. Data validation is a big-bang event. All data validation requirements -- grammar, co-constraints, cardinality, algorithmic -- are performed at once.

    C. There may be a performance degradation, e.g. it is not possible to take advantage of omitting Schematron validation when grammar validation fails.

  3. Advantages/Disadvantages of a Grammar Schema Embedded within Schematron

    Advantages

    A. There may be a performance improvement. Running one validation rather than several may yield a savings in performance.

    Disadvantages

    A. Swapping out the particular grammar language that is currently being used and replacing it with a different grammar language may be difficult since the two are tightly intertwined.

    B. Data validation is a big-bang event. All data validation requirements -- grammar, co-constraints, cardinality, algorithmic -- are performed at once.

    C. There may be a performance degradation, e.g. it is not possible to take advantage of omitting Schematron validation when grammar validation fails.

    D. There are no tools currently available to process a document comprised of an XML Schema embedded in Schematron.

  4. Advantages/Disadvantages of Implementing Data Validation Requirements on a Per-Topic Basis

    Advantages

    A. With the division of data validation requirements made on topical lines it ensures a logical separation of concerns, which can fit in better with data requirement organization structures and reduce task interdependencies.

    B. Constraint checking can be done in stages, in a pipeline fashion, along topic lines.

    C. There may be a performance improvement. If validation on one topic fails then succeeding validations can be aborted.

    Disadvantages

    A. If a topic is implemented using a combination of a grammar language and Schematron within one document then swapping out the particular grammar language that is currently being used and replacing it with a different grammar language may be difficult since it may be tightly intertwined with Schematron.

Recommendation

A combination of approach #1 (keep grammar and Schematron schemas separate) and approach #4 (organize data validation requirements based on topic) yield the greatest advantages. Here's how to implement your data validation requirements to get the best of both approaches:

Divide up your data validation requirements by topic. Implement each topic using whatever collection of grammar and Schematron schemas are needed. Keep the grammar and Schematron schemas in separate documents. Create a pipeline of schemas, arranged by topic.

For example, in the pipeline of schemas the first three schemas implement topic #1, the next two schemas implement topic #2, and so forth.

This approach gives you all the benefits of keeping grammar and Schematron schemas separate, plus the benefits of organizing your data validation requirements by topic.

Acknowledgements

The following people contributed to the creation of this document:


1. Separate Schematron and Grammar Documents

Grammar Document (XML Schema)

<?xml version="1.0" encoding="UTF-8"?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
    
    <xsd:element name="Document">
        <xsd:complexType>
            <xsd:sequence>
                <xsd:element ref="Para" maxOccurs="unbounded"/>
            </xsd:sequence>
            <xsd:attribute name="classification" type="classification-type"/>
        </xsd:complexType>
    </xsd:element>
    
    <xsd:element name="Para">
        <xsd:complexType>
            <xsd:simpleContent>
                <xsd:extension base="string">
                    <xsd:attribute name="classification" type="classification-type"/>
                </xsd:extension>
            </xsd:simpleContent>
        </xsd:complexType>      
    </xsd:element>
    
    <xsd:simpleType name="classification-type">
        <xsd:restriction base="xsd:string">
            <xsd:enumeration value="unclassified"/>
            <xsd:enumeration value="confidential"/>
            <xsd:enumeration value="secret"/>
            <xsd:enumeration value="top-secret"/>
        </xsd:restriction>
    </xsd:simpleType>
    
</xsd:schema>

Schematron Document

<?xml version="1.0"?>
<sch:schema xmlns:sch="http://www.ascc.net/xml/schematron">

   <sch:pattern name="Security Classification Policy">

      <sch:p>A Para's classification value cannot be more sensitive 
             than the Document's classification value.</sch:p> 

      <sch:rule context="Para[@classification='top-secret']">

         <sch:assert test="/Document/@classification='top-secret'">
             If there is a Para labeled "top-secret" then the Document  
             must be labeled top-secret
         </sch:assert>

      </sch:rule>

      <sch:rule context="Para[@classification='secret']">

         <sch:assert test="(/Document/@classification='top-secret') or
                           (/Document/@classification='secret')">
             If there is a Para labeled "secret" then the Document  
             must be labeled either secret or top-secret
         </sch:assert>

      </sch:rule>

      <sch:rule context="Para[@classification='confidential']">

         <sch:assert test="(/Document/@classification='top-secret') or
                           (/Document/@classification='secret') or 
                           (/Document/@classification='confidential')">
             If there is a Para labeled "confidential" then the Document  
             must be labeled either confidential, secret or top-secret
         </sch:assert>

      </sch:rule>

   </sch:pattern>

</sch:schema>

2. Schematron Embedded within a Grammar Document

<?xml version="1.0" encoding="UTF-8"?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"
            xmlns:sch="http://purl.oclc.org/dsdl/schematron"">
    
    <xsd:element name="Document">
        <xsd:complexType>
            <xsd:sequence>
                <xsd:element ref="Para" maxOccurs="unbounded"/>
            </xsd:sequence>
            <xsd:attribute name="classification" type="classification-type"/>
        </xsd:complexType>
    </xsd:element>
    
    <xsd:element name="Para">
        <xsd:annotation>
            <xsd:appinfo>
                <sch:pattern name="Security Classification Policy">
                   <sch:rule context="Para[@classification='top-secret']">
                      <sch:assert test="/Document/@classification='top-secret'">
                          If there is a Para labeled "top-secret" then the Document  
                          must be labeled top-secret
                      </sch:assert>
                   </sch:rule>
                   <sch:rule context="Para[@classification='secret']">
                      <sch:assert test="(/Document/@classification='top-secret') or
                                        (/Document/@classification='secret')">
                          If there is a Para labeled "secret" then the Document  
                          must be labeled either secret or top-secret
                      </sch:assert>
                   </sch:rule>
                   <sch:rule context="Para[@classification='confidential']">
                      <sch:assert test="(/Document/@classification='top-secret') or
                                        (/Document/@classification='secret') or 
                                        (/Document/@classification='confidential')">
                          If there is a Para labeled "confidential" then the Document  
                          must be labeled either confidential, secret or top-secret
                      </sch:assert>
                   </sch:rule>
                </sch:pattern>
            </xsd:appinfo>
        </xsd:annotation>
        <xsd:complexType>
            <xsd:simpleContent>
                <xsd:extension base="string">
                    <xsd:attribute name="classification" type="classification-type"/>
                </xsd:extension>
            </xsd:simpleContent>
        </xsd:complexType>      
    </xsd:element>
    
    <xsd:simpleType name="classification-type">
        <xsd:restriction base="xsd:string">
            <xsd:enumeration value="unclassified"/>
            <xsd:enumeration value="confidential"/>
            <xsd:enumeration value="secret"/>
            <xsd:enumeration value="top-secret"/>
        </xsd:restriction>
    </xsd:simpleType>
    
</xsd:schema>

3. Grammar Embedded within a Schematron Document

<?xml version="1.0"?>
<sch:schema xmlns:sch="http://www.ascc.net/xml/schematron"
            xmlns:xsd="http://www.w3.org/2001/XMLSchema">

   <sch:pattern name="Security Classification Policy">

      <sch:p>A Para's classification value cannot be more sensitive 
             than the Document's classification value.</sch:p>     

        <xsd:element name="Para">
            <xsd:complexType>
                <xsd:simpleContent>
                    <xsd:extension base="string">
                        <xsd:attribute name="classification" type="classification-type"/>
                    </xsd:extension>
                </xsd:simpleContent>
            </xsd:complexType>      
        </xsd:element>
    
        <xsd:simpleType name="classification-type">
          <xsd:restriction base="xsd:string">
              <xsd:enumeration value="unclassified"/>
              <xsd:enumeration value="confidential"/>
              <xsd:enumeration value="secret"/>
              <xsd:enumeration value="top-secret"/>
          </xsd:restriction>
      </xsd:simpleType>

      <sch:rule context="Para[@classification='top-secret']">

         <sch:assert test="/Document/@classification='top-secret'">
             If there is a Para labeled "top-secret" then the Document  
             must be labeled top-secret
         </sch:assert>

      </sch:rule>

      <sch:rule context="Para[@classification='secret']">

         <sch:assert test="(/Document/@classification='top-secret') or
                           (/Document/@classification='secret')">
             If there is a Para labeled "secret" then the Document  
             must be labeled either secret or top-secret
         </sch:assert>

      </sch:rule>

      <sch:rule context="Para[@classification='confidential']">

         <sch:assert test="(/Document/@classification='top-secret') or
                           (/Document/@classification='secret') or 
                           (/Document/@classification='confidential')">
             If there is a Para labeled "confidential" then the Document  
             must be labeled either confidential, secret or top-secret
         </sch:assert>

      </sch:rule>

   </sch:pattern>

   <xsd:element name="Document">
        <xsd:complexType>
            <xsd:sequence>
                <xsd:element ref="Para" maxOccurs="unbounded"/>
            </xsd:sequence>
            <xsd:attribute name="classification" type="classification-type"/>
        </xsd:complexType>
   </xsd:element>

</sch:schema>

4. Documents Organized by Topic

--- TBD ---

Tags

Last Updated: July 30, 2007