You have a set of data validation requirements for your system. You have decided to implement the requirements using a combination of a grammar-based schema language (e.g. Relax NG or XML Schema) plus Schematron. Which approach should you take:
Suppose this XML instance document is representative of the type of data that your system exchanges:
<?xml version="1.0"?> <Document classification="secret"> <Para classification="unclassified"> One if by land; two if by sea. </Para> </Document>
And suppose the system's data requirements are:
The first requirement will be implemented using Schematron. The next two requirements will be implemented using a grammmar-based schema, say, XML Schemas.
There are four alternatives (the actual implementations are shown at the bottom of this document):
The following sections discuss the advantages and disadvantages of each alternative.
A. The particular grammar language currently being used can be easily replaced. Thus, if XML Schema is currently being used, at a later date you can easily replace it with Relax NG without impacting the Schematron schema.
B. Data validation can be done in stages, in a pipeline fashion. It might be desirable for your system to implement the data validation requirements in stages, e.g. first do grammar checking, then do co-constraint checking (using Schematron), then do data cardinality checking (using Schematron), then do algorithmic checking (using Schematron).
Example: A real-world example of the need for data validation in stages: consider XML Schema validation of an XML instance document that accidently specifies the wrong namespace; a lax validator will report a false positive even though the XML instance document may be incorrect; to avoid this, prior to XML Schema validation perform a validation which checks the namespace.
C. There may be a performance improvement, e.g. suppose grammar validation is done first and fails (i.e. outputs errors); it may not be necessary to execute the Schematron validation; thus there is a time savings.
A. There may be a performance degradation. Running several validations rather than a single validation may be more expensive.
A. There may be a performance improvement. Running one validation rather than several validations may yield a savings in performance.
A. Swapping out the particular grammar language that is currently being used and replacing it with a different grammar language may be difficult since the two are tightly intertwined. [This disadvantage may be negated if the schemas are being generated from, say, a UML model.]
B. Data validation is a big-bang event. All data validation requirements -- grammar, co-constraints, cardinality, algorithmic -- are performed at once.
C. There may be a performance degradation, e.g. it is not possible to take advantage of omitting Schematron validation when grammar validation fails.
A. There may be a performance improvement. Running one validation rather than several may yield a savings in performance.
A. Swapping out the particular grammar language that is currently being used and replacing it with a different grammar language may be difficult since the two are tightly intertwined.
B. Data validation is a big-bang event. All data validation requirements -- grammar, co-constraints, cardinality, algorithmic -- are performed at once.
C. There may be a performance degradation, e.g. it is not possible to take advantage of omitting Schematron validation when grammar validation fails.
D. There are no tools currently available to process a document comprised of an XML Schema embedded in Schematron.
A. With the division of data validation requirements made on topical lines it ensures a logical separation of concerns, which can fit in better with data requirement organization structures and reduce task interdependencies.
B. Constraint checking can be done in stages, in a pipeline fashion, along topic lines.
C. There may be a performance improvement. If validation on one topic fails then succeeding validations can be aborted.
A. If a topic is implemented using a combination of a grammar language and Schematron within one document then swapping out the particular grammar language that is currently being used and replacing it with a different grammar language may be difficult since it may be tightly intertwined with Schematron.
A combination of approach #1 (keep grammar and Schematron schemas separate) and approach #4 (organize data validation requirements based on topic) yield the greatest advantages. Here's how to implement your data validation requirements to get the best of both approaches:
Divide up your data validation requirements by topic. Implement each topic using whatever collection of grammar and Schematron schemas are needed. Keep the grammar and Schematron schemas in separate documents. Create a pipeline of schemas, arranged by topic.
For example, in the pipeline of schemas the first three schemas implement topic #1, the next two schemas implement topic #2, and so forth.
This approach gives you all the benefits of keeping grammar and Schematron schemas separate, plus the benefits of organizing your data validation requirements by topic.
The following people contributed to the creation of this document:
<?xml version="1.0" encoding="UTF-8"?> <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"> <xsd:element name="Document"> <xsd:complexType> <xsd:sequence> <xsd:element ref="Para" maxOccurs="unbounded"/> </xsd:sequence> <xsd:attribute name="classification" type="classification-type"/> </xsd:complexType> </xsd:element> <xsd:element name="Para"> <xsd:complexType> <xsd:simpleContent> <xsd:extension base="string"> <xsd:attribute name="classification" type="classification-type"/> </xsd:extension> </xsd:simpleContent> </xsd:complexType> </xsd:element> <xsd:simpleType name="classification-type"> <xsd:restriction base="xsd:string"> <xsd:enumeration value="unclassified"/> <xsd:enumeration value="confidential"/> <xsd:enumeration value="secret"/> <xsd:enumeration value="top-secret"/> </xsd:restriction> </xsd:simpleType> </xsd:schema>
<?xml version="1.0"?> <sch:schema xmlns:sch="http://www.ascc.net/xml/schematron"> <sch:pattern name="Security Classification Policy"> <sch:p>A Para's classification value cannot be more sensitive than the Document's classification value.</sch:p> <sch:rule context="Para[@classification='top-secret']"> <sch:assert test="/Document/@classification='top-secret'"> If there is a Para labeled "top-secret" then the Document must be labeled top-secret </sch:assert> </sch:rule> <sch:rule context="Para[@classification='secret']"> <sch:assert test="(/Document/@classification='top-secret') or (/Document/@classification='secret')"> If there is a Para labeled "secret" then the Document must be labeled either secret or top-secret </sch:assert> </sch:rule> <sch:rule context="Para[@classification='confidential']"> <sch:assert test="(/Document/@classification='top-secret') or (/Document/@classification='secret') or (/Document/@classification='confidential')"> If there is a Para labeled "confidential" then the Document must be labeled either confidential, secret or top-secret </sch:assert> </sch:rule> </sch:pattern> </sch:schema>
<?xml version="1.0" encoding="UTF-8"?> <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:sch="http://purl.oclc.org/dsdl/schematron""> <xsd:element name="Document"> <xsd:complexType> <xsd:sequence> <xsd:element ref="Para" maxOccurs="unbounded"/> </xsd:sequence> <xsd:attribute name="classification" type="classification-type"/> </xsd:complexType> </xsd:element> <xsd:element name="Para"> <xsd:annotation> <xsd:appinfo> <sch:pattern name="Security Classification Policy"> <sch:rule context="Para[@classification='top-secret']"> <sch:assert test="/Document/@classification='top-secret'"> If there is a Para labeled "top-secret" then the Document must be labeled top-secret </sch:assert> </sch:rule> <sch:rule context="Para[@classification='secret']"> <sch:assert test="(/Document/@classification='top-secret') or (/Document/@classification='secret')"> If there is a Para labeled "secret" then the Document must be labeled either secret or top-secret </sch:assert> </sch:rule> <sch:rule context="Para[@classification='confidential']"> <sch:assert test="(/Document/@classification='top-secret') or (/Document/@classification='secret') or (/Document/@classification='confidential')"> If there is a Para labeled "confidential" then the Document must be labeled either confidential, secret or top-secret </sch:assert> </sch:rule> </sch:pattern> </xsd:appinfo> </xsd:annotation> <xsd:complexType> <xsd:simpleContent> <xsd:extension base="string"> <xsd:attribute name="classification" type="classification-type"/> </xsd:extension> </xsd:simpleContent> </xsd:complexType> </xsd:element> <xsd:simpleType name="classification-type"> <xsd:restriction base="xsd:string"> <xsd:enumeration value="unclassified"/> <xsd:enumeration value="confidential"/> <xsd:enumeration value="secret"/> <xsd:enumeration value="top-secret"/> </xsd:restriction> </xsd:simpleType> </xsd:schema>
<?xml version="1.0"?> <sch:schema xmlns:sch="http://www.ascc.net/xml/schematron" xmlns:xsd="http://www.w3.org/2001/XMLSchema"> <sch:pattern name="Security Classification Policy"> <sch:p>A Para's classification value cannot be more sensitive than the Document's classification value.</sch:p> <xsd:element name="Para"> <xsd:complexType> <xsd:simpleContent> <xsd:extension base="string"> <xsd:attribute name="classification" type="classification-type"/> </xsd:extension> </xsd:simpleContent> </xsd:complexType> </xsd:element> <xsd:simpleType name="classification-type"> <xsd:restriction base="xsd:string"> <xsd:enumeration value="unclassified"/> <xsd:enumeration value="confidential"/> <xsd:enumeration value="secret"/> <xsd:enumeration value="top-secret"/> </xsd:restriction> </xsd:simpleType> <sch:rule context="Para[@classification='top-secret']"> <sch:assert test="/Document/@classification='top-secret'"> If there is a Para labeled "top-secret" then the Document must be labeled top-secret </sch:assert> </sch:rule> <sch:rule context="Para[@classification='secret']"> <sch:assert test="(/Document/@classification='top-secret') or (/Document/@classification='secret')"> If there is a Para labeled "secret" then the Document must be labeled either secret or top-secret </sch:assert> </sch:rule> <sch:rule context="Para[@classification='confidential']"> <sch:assert test="(/Document/@classification='top-secret') or (/Document/@classification='secret') or (/Document/@classification='confidential')"> If there is a Para labeled "confidential" then the Document must be labeled either confidential, secret or top-secret </sch:assert> </sch:rule> </sch:pattern> <xsd:element name="Document"> <xsd:complexType> <xsd:sequence> <xsd:element ref="Para" maxOccurs="unbounded"/> </xsd:sequence> <xsd:attribute name="classification" type="classification-type"/> </xsd:complexType> </xsd:element> </sch:schema>
--- TBD ---
Last Updated: July 30, 2007