It is considered best practice to embed within your XML and HTML documents an indication of the encoding used to create the documents.
For example, in XML documents encoding information is put in the XML declaration:
<?xml version="1.0" encoding="UTF-8"?>
In HTML documents encoding information is put in the meta tag:
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; Charset=UTF-8" />
Why? Shouldn't encoding metadata be external to a document? The next two sections explain why it may be preferrable to put encoding information inside your documents.
XML and HTML documents are exchanged on the Web using the HTTP protocol. The HTTP header has a property that may be used to indicate the charset (encoding) of its payload (i.e. the XML document or the HTML document), e.g.
Content-Type: application/xml; Charset=UTF-8
Isn't the (external) HTTP header sufficient to specify a document's encoding? It's not always possible, here's why:
Suppose you have a big web server with lots of sites and hundreds of pages, contributed by lots of people in lots of different languages. The web server wouldn't know the encoding of each document.
Note: if an HTTP header specifies an encoding then any encoding within the payload document will be ignored.
XML and HTML documents reside in many places, not just the Web. They may be stored in repositories or on hard disks, for instance, where they may not be accompanied by an external encoding label.
Thus, it is considered best practice to specify the encoding within the document.
Suppose there is no external information to indicate the encoding of a document. Then an intriguing problem arises: in order to read the document you need to know its encoding, but to know its encoding you must read the document!
Stated differently, for an XML parser to know how to interpret the bit strings in a document it must know the encoding, but to know the encoding it must read the document!
We have a chicken-and-egg problem. How is this resolved? That's described next.
Consider the problem of an XML parser trying to determine the encoding of an XML document. Suppose the XML document has this XML declaration:
<?xml version="1.0" encoding="UTF-8"?>
Note that the characters in an XML declaration are restricted to characters from the ASCII repertoire (however encoded).
One thing an XML parser must do is determine whether each character uses 1, 2, or 4 bytes. That is, the XML parser must determine whether the encoding is 1-byte-oriented (such as UTF-8), or 2-bytes-oriented (such as UTF-16), or 4-bytes-oriented.
With multi-byte encodings there are two approaches to arranging the bytes: store the most significant byte first or store the most significant byte last. The two storage approaches are called big-endian and little-endian. Thus, an XML parser must determine whether the characters in the XML declaration are stored as big-endian or little-endian.
An XML document may or may not have a Byte-Order Mark (BOM) in its first four bytes. The BOM indicates the ordering of the bytes, i.e. the endianness. So the first thing an XML parser will do is check for the presence of a BOM. If as BOM is present then the endianess is determined. Also, the BOM can be used to determine whether the document's encoding uses 1, 2, or 4 bytes per character. The following table shows how the BOM is analyzed by an XML parser to determine the endianness and byte-size:
Hex Values | Endianness | Encoding Byte Size |
---|---|---|
00 00 FE FF | big-endian | 4-byte |
FF FE 00 00 | little-endian | 4-byte |
FE FF ## ## | big-endian | 2-byte |
FF FE ## ## | little-endian | 2-byte |
EF BB BF | n/a | 1-byte |
The notation ## is used to denote any byte value except that two consecutive ##s cannot be both 00.
If the document does not have a BOM then an XML parser will look for the initial characers of the XML declaration, "<?xm". The next table shows how the endianness and byte-size may be determined by examining the initial byte sequence in the document. Note that 3C indicates "<", 3F indicates "?", 78 indicates "x", and 6D indicates "m"
Hex Values | Character(s) | Endianness | Encoding Byte Size |
---|---|---|---|
00 00 00 3C | < | big-endian | 4-byte |
3C 00 00 00 | < | little-endian | 4-byte |
00 3C 00 3F | <? | big-endian | 2-byte |
3C 00 3F 00 | <? | little-endian | 2-byte |
3C 3F 78 6D | <?xm | n/a | 1-byte |
By this point an XML parser will know the endianness and byte-size being used by the document's encoding. So it can parse the XML declaration!
It parses the XML declaration until it arrives at the encoding. With the encoding information the XML parser now knows exactly how to parse the remainder of the document!
We do not discuss here how an HTML parser determines the document's encoding.
In an XML document the XML declaration is optional. How does an XML parser determine the encoding when there is no XML declaration? The meta tag is optional in HTML, so how is its encoding determined?
First, a parser looks for external information, at the operating-system or transport-protocol level, to determine the encoding.
If the external information is unreliable or unavailable then a parser examines the first 4 bytes of the document to see if it is a BOM. If it is a BOM and the BOM indicates a 2-byte per character encoding then the parser defaults to UTF-16. If the BOM indicates a 1-byte per character encoding then the parser defaults to UTF-8. If the BOM indicates a 4-byte per character encoding then the parser defaults to UTF-32.
If a parser can't determine the encoding from external information and there is no BOM then it default to UTF-8.
The above discussion simplifies some things. For example, recall the discussion of determining the encoding where the XML document does not contain a BOM. We said that 3C indicates "<", 3F indicates "?", 78 indicates "x", and 6D indicates "m". That is true in the vast majority of encodings. However, there are some exceptions. For example, the EBCDIC encoding uses 4C to indicate "<", 6F to indicate "?", A7 to indicate "x", and 94 to indicate "m". So an XML Parser must deal with that situation. Since each implementation is assumed to support only a finite set of character encodings the problem is tractable.
Further information about how the encoding of a document is determined may be found in the XML 1.0 Recomendation, Appendix F: Autodetection of Character Encodings, http://www.w3.org/TR/REC-xml/#sec-guessing.
This is a good place to get information about encodings in HTML and XHTML documents: Tutorial: Character sets & encodings in XHTML, HTML and CSS http://www.w3.org/International/tutorials/tutorial-char-enc/en/all.html
This is another nice web site for character encoding, from the Web Standards Project: Specifying Character Encoding, http://www.webstandards.org/learn/articles/askw3c/dec2002/
The following people contributed to the creation of this document:
Last Updated: September 21, 2007