Dangers of Copying Text into XML
Issue
What problems may arise in copying text from a document and pasting it into an XML document?
Example
Consider this XML document:
<?xml version="1.0" encoding="UTF-8"?>
<Document>
<Para id="...">...</Para>
</Document>
Suppose that text is copied from a document and pasted into the XML document, either as the
content of the <Para> element or as the value of the id attribute. What problems may be introduced?
Problems
- Reserved XML Characters: The text may contain these reserved characters: {<, >, ', ", &}.
These characters may introduce syntax errors into the XML document and may need to be escaped.
- Encoding Mismatch: The editor used to create the text may use a different encoding than the XML document's encoding.
Two types of problems may result when there is a mismatch of encodings:
- A byte sequence that represents a character in one encoding may represent a different character in
another encoding. Consequently, if the text was created in an editor that uses a different encoding
than the XML document then the characters that result from pasting the text into the XML document
may not be the same.
- A byte sequence that represents a character in one encoding may not represent
any character in
another encoding. Consequently, if the text was created in an editor that uses a different encoding
than the XML document then pasting the text into the XML document may result in introducing illegal
characters.
-
Example: Microsoft Word uses
Windows-1252 encoding. The hex value for
the left curly (a.k.a. smart) quote is x93.
In UTF-8 encoding the left curly quote is a three-byte
sequence of hex codes xE2 x80 x9C, and there is no character corresponding to hex value x93.
Copying a left curly quote from a Word document and pasting it into a UTF-8 XML document
may result in the XML document receiving a byte sequence that cannot be decoded as UTF-8.
- Non-XML Characters: The text might contain non-XML characters, which when pasted into the XML document will produce an erroneous XML document.
- Entity References: The text might contain entity references that are not defined in the XML document.
- Namespace Mismatch: The text may contain references to namespaces that are not defined
in the XML document, or it may overwrite a namespace that already exists in the XML document, or it may contain markup that is not allowed in the
XML document's namespace.
- Partial Markup: The text may contain an incomplete unit of markup, such as the start
symbols of a CDATA section but not its end symbols, a start tag but not its end tag, or the first delimiter of an attribute
value but not its end delimiter.
- Uniqueness Collision: If the text is pasted into an attribute that is of datatype ID then the attribute's value may collide with an ID value
elsewhere in the document. Similarly, if the text is pasted into an element that has a uniqueness constraint then the element's
value may no longer be unique.
- Invalid Datatype: The text may not be of the appropriate datatype for the element or attribute.
Minimizing Problems
Some of the above problems may go unnoticed when you copy and paste using a plain-text editor that doesn't
understand XML or character encodings (e.g. Notepad, vi). In particular, the problems go
unnoticed during editing, only to surface later when the document is parsed by an XML parser.
There are significantly fewer dangers when
you are using encoding- and XML-aware editors. Thus the problems may be detected earlier, during editing rather than
during XML parsing.
Acknowledgements
The following people contributed to the creation of this document:
- Greg Alvord
- Frank Benson
- Len Bullard
- David Carlisle
- Anthony Coates
- Roger Costello
- Brook Heaton
- Ken Holman
- Pete Kirkham
- Kit Leuder
- Noah Mendelsohn
- Dave Pawson
- Bryan Rasmussen
- Jonathan Robie
- Tom Rudick
- Rob Simmons
- Richard Tobin
Last Updated: September 12, 2012