Archive for May, 2008

Are you using a real XML parser

Monday, May 19th, 2008

Recently there’s been a bunch of announcements of new XML parsers that claim to be very fast, very small or both. I also see a lot of people get very enthusiastic about using them in their applications. Just the other day I got an email from a user asking if it was possible to use CodeSynthesis XSD with a light-weight XML parser that he found instead of Xerces-C++. Out of curiosity I checked the parser’s description and there I saw a number of common traits of most new, fast, and small XML parsers these days: no support for DTD (internal subset) and CDATA sections, limited support for character and entity references.

Once I tell people about these problems with their choice of XML parser, some say they don’t care since nobody uses these features anyway. It is true that most of these features are seldom used. You can get away with using a non-conforming XML parser if you control the production of XML you are planning to parse and thus restrict the set of XML constructs that can appear in your documents. This, however, is not a very common situation. If you control both the production and consumption of the data then you might as well choose a more natural (for your application and environment) and efficient (that’s why choose this new parser) exchange format than XML. A more common scenario is when you have to parse XML supplied by various third parties and once you say your format is XML then all bets are off; it is only a matter of time before someone sends you a perfectly valid XML document that your application won’t be able to handle.

The W3C XML specification defines two types of conforming XML parsers: validating and non-validating (see Section 5.1, “Validating and Non-Validating Processors”). Any parser that wants to be called capable of parsing XML 1.0 documents must at least satisfy the non-validating parser requirements. Besides the expected things like being able to parser elements and attributes as well as making sure they are well-formed, a conforming non-validating XML parser should also support the following:

  • At least UTF-8 and UTF-16 character encodings
  • CDATA sections (<![CDATA[<greeting>Hello, world!</greeting>]]>)
  • Character references (&#x20;)
  • Entity references including predefined (&amp;) and user-defined in the internal DTD subset
  • Parse and check for well-formedness the internal DTD subset
  • Normalize and supply default attribute values according to the internal DTD subset

The internal DTD subset consist of the DTD declarations that are found in the XML document itself as opposed to the external subset which consists of the declarations placed into separate DTD files and referenced from the XML documents. Here is a sample document that uses most of the above features (download it to test your favorite parser: test.xml):

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<!DOCTYPE hello [
<!ENTITY greeting-text "hello">
<!ATTLIST greeting lang NMTOKEN "en">
<!ATTLIST name lang NMTOKEN "en">
]>
<hello>
  <greeting>&greeting-text;</greeting>
  <name lang="  fr  ">tout&#x20;le&#x20;<![CDATA[monde]]></name>
</hello>

Parsing this document with a conforming non-validating XML parser should be equivalent to parsing the following document:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<hello>
  <greeting lang="en">hello</greeting>
  <name lang="fr">tout le monde</name>
</hello>

Here is the list of C and C++ XML parsers that are either conforming or strive to be conforming:

  • Xerces-C++: C++, DOM, SAX2, validating (DTD and XML Schema)
  • Libxml2: C, DOM, SAX2, Pull, validating (DTD)
  • Expat: C, SAX2, non-validating, small & fast
  • Faxpp: C, Pull, non-validating (no default/normalized attributes), small & fast

And here is a list of parsers that while calling themselves XML parsers, are actually parsers for markup languages that are subsets of XML (based on their description at the time of writing):

  • VTD-XML
  • Rapidxml
  • TinyXML
  • XML parser provided by applied-mathematics.net

Intuitive explanation of the Monty Hall problem

Sunday, May 11th, 2008

Yesterday I went to see 21 where one of the scenes brings up the Monty Hall problem: there are three doors behind which there are a car and two goats. You choose a door with the goal of winning the car. Then the host opens one of the remaining doors which hides a goat. The question is whether it is to your advantage to switch your choice to the other door. The answer is yes (and yes, I thought it does not matter while watching the movie). When asked to explain why it is a good idea to change the door the main character utters some gibberish about all the variables being changed, etc. Afterwards I checked this problem out on Wikipedia (follow the link above) which gave a few strict proofs making it clear that indeed changing the door increases your chances of winning the car by 1/3. While the formal proofs do their jobs just fine I always prefer to have an intuitive feeling of why a seemingly counter-intuitive answer is actually correct. In this case I wanted to understand what changes once the host opens one of the doors, what extra information is added that makes the difference.

The part of the rule which says that the host has to reveal the other goat brings in the extra information. This happens in the case when you initially selected the door with a goat behind it. In this situation the host is forced to eliminate the other goat: he cannot open the door you have selected and he cannot reveal where the car is. In other words we have two possible outcomes:

  • if you selected a goat then the remaining door hides the car
  • if you selected the car then the remaining door hides a goat

The probability of initially selecting the goat is 2/3 (two doors out of three hide goats) and the car — 1/3. Thus it is more likely that you will first select a goat instead of the car. And in this more likely case the host is forced to single-out the door which hides the car. Thus changing your selection gives you a better chance of winning the car.

Note also that the probability of your initial choice being the car remains 1/3 even after the host opened one of the doors. It is the probability of the other remaining door hiding the car that has changed (from 1/3 to 2/3) due to the rules of the game forcing the host not to reveal the car.