Tuesday, September 7, 2010

How to fix "weird" characters in XML

I had been aggravated by an extraneous a-circumflex character (Â, Windows keymap Alt-0194, Unicode '\u, HTML character entity Â) preceding all the degree symbols in XML output from C++ code using MSXML. For historical reasons, the degree symbol is defined in code as a literal const std::string = "°", which is later added to an XML Document tree, then output using fout() function.
Here's an abridged sample:

<?xml version="1.0"?>

[explain ISO-8859-1]. It appears that MSXML stores the character in the XML document as a two byte encoding. UTF-8 uses variable-width encoding, which means the ASCII character codes between 0 and 127 map directly to their hex values in UTF-8, e.g. the capital letter X, is 0x58 in both ASCII and UTF-8.

The hex value, C2B0 (1100 0100 1011 0000), represents a "lead unit". After some Googling, I found a very clear explanation from Andy Hassall to a similar problem:
But your PHP code may be trying to treat UTF-8 as single-byte ISO-8859-1.

A British pound symbol is two bytes in UTF-8 - it's U+00A3 which is 0xC2 0xA3
in UTF-8.


If you tried to display this as ISO-8859-1 you'd get:

0xC2 = Latin capital A with circumflex
0xA3 = British pound symbol

Mike Brown explains it this way:
Take, for example, the non-breaking space character, which in HTML we often write as "&nbsp;", a predefined (in HTML, not XML) entity reference defined as equivalent to "&#160;", which in turn is interpreted as the single non-breaking-space character. Different encoding schemes will represent this character as different bit sequences. 
For example, in the "iso-8859-1" encoding, the non-breaking space character maps to the bit sequence 10100000, an 8-bit byte representing a value that we can also easily express as decimal 160 or hex A0. But in "utf-8", the non-breaking space maps to the bit sequence 11000010 10100000. If we interpret this as a pair of 8-bit bytes, we could say they represent the values hex C2 followed by A0 (192 and 160). 
Now imagine you are the web browser, receiving an HTTP message containing an HTML document. All you see in the message is a stream of bits. How do you know what 11000010 10100000 means?
If you think the document is encoded using utf-8, you'll correctly interpret this sequence as one single NO-BREAK SPACE character (that's its Unicode name).
If you think the document is encoded using iso-8859-1, you will incorrectly interpret it as *two* characters: (0xC2) LATIN CAPITAL LETTER A WITH CIRCUMFLEX followed by (0xA0) NO-BREAK SPACE.

This was pretty much the same thing I was seeing. The XML processing instruction does not specify an encoding, so it defaults to ISO-8859-1. When MSXML renders the two byte degree character to single-byte ISO-8859-1, the first byte is saved as the Unicode lead unit 0xC2, followed by the single byte for the degree symbol, 0xB0. When the XML file is opened with a text editor (jEdit), the first byte is displayed as the a-circumflex, which happens to be 0xC2 in ISO-8859-1.

I fixed the problem by specifying the encoding in the XML processing instruction for MSXML:

<?xml version="1.0" encoding="UTF-8"?>

It works equally well if the encoding is specified as ISO-8895-1. Now the XML declaration includes an explicit processing instruction for the encoding scheme, and MSXML renders the degree character as a single byte, 0xB0.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.