Archive for the ‘Development’ Category

Are you using a real XML parser

Monday, May 19th, 2008

Recently there’s been a bunch of announcements of new XML parsers that claim to be very fast, very small or both. I also see a lot of people get very enthusiastic about using them in their applications. Just the other day I got an email from a user asking if it was possible to use CodeSynthesis XSD with a light-weight XML parser that he found instead of Xerces-C++. Out of curiosity I checked the parser’s description and there I saw a number of common traits of most new, fast, and small XML parsers these days: no support for DTD (internal subset) and CDATA sections, limited support for character and entity references.

Once I tell people about these problems with their choice of XML parser, some say they don’t care since nobody uses these features anyway. It is true that most of these features are seldom used. You can get away with using a non-conforming XML parser if you control the production of XML you are planning to parse and thus restrict the set of XML constructs that can appear in your documents. This, however, is not a very common situation. If you control both the production and consumption of the data then you might as well choose a more natural (for your application and environment) and efficient (that’s why choose this new parser) exchange format than XML. A more common scenario is when you have to parse XML supplied by various third parties and once you say your format is XML then all bets are off; it is only a matter of time before someone sends you a perfectly valid XML document that your application won’t be able to handle.

The W3C XML specification defines two types of conforming XML parsers: validating and non-validating (see Section 5.1, “Validating and Non-Validating Processors”). Any parser that wants to be called capable of parsing XML 1.0 documents must at least satisfy the non-validating parser requirements. Besides the expected things like being able to parser elements and attributes as well as making sure they are well-formed, a conforming non-validating XML parser should also support the following:

  • At least UTF-8 and UTF-16 character encodings
  • CDATA sections (<![CDATA[<greeting>Hello, world!</greeting>]]>)
  • Character references (&#x20;)
  • Entity references including predefined (&amp;) and user-defined in the internal DTD subset
  • Parse and check for well-formedness the internal DTD subset
  • Normalize and supply default attribute values according to the internal DTD subset

The internal DTD subset consist of the DTD declarations that are found in the XML document itself as opposed to the external subset which consists of the declarations placed into separate DTD files and referenced from the XML documents. Here is a sample document that uses most of the above features (download it to test your favorite parser: test.xml):

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<!DOCTYPE hello [
<!ENTITY greeting-text "hello">
<!ATTLIST greeting lang NMTOKEN "en">
<!ATTLIST name lang NMTOKEN "en">
]>
<hello>
  <greeting>&greeting-text;</greeting>
  <name lang="  fr  ">tout&#x20;le&#x20;<![CDATA[monde]]></name>
</hello>

Parsing this document with a conforming non-validating XML parser should be equivalent to parsing the following document:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<hello>
  <greeting lang="en">hello</greeting>
  <name lang="fr">tout le monde</name>
</hello>

Here is the list of C and C++ XML parsers that are either conforming or strive to be conforming:

  • Xerces-C++: C++, DOM, SAX2, validating (DTD and XML Schema)
  • Libxml2: C, DOM, SAX2, Pull, validating (DTD)
  • Expat: C, SAX2, non-validating, small & fast
  • Faxpp: C, Pull, non-validating (no default/normalized attributes), small & fast

And here is a list of parsers that while calling themselves XML parsers, are actually parsers for markup languages that are subsets of XML (based on their description at the time of writing):

  • VTD-XML
  • Rapidxml
  • TinyXML
  • XML parser provided by applied-mathematics.net

End user or development-oriented build system?

Monday, March 24th, 2008

I spent the past three weeks working on Xerces-C++ 3.0.0 which uses automake-based build system. Our own projects here at Code Synthesis all use the build system called build. The work on Xerces-C++ made me realize just how awkward the automake-based build systems are to develop with. It also made me realize that most build systems can be placed into one of the two categories: the ones that are optimized for the end user and the ones that are optimized for development (the Boost build system is a notable exception for it is a pain to use for both end users and, I suspect, the Boost developers).

The primary goal of an end user-oriented build system is to make once-off builds from scratch as straightforward as possible. Because the user can choose to build the software on any platform and installation of additional tools is an inconvenience, the following requirements are imposed on user-oriented build systems:

  • Support for a wide range of platforms
  • Least common denominators in the tools and features used

On the other hand, the primary goal of a development-oriented build system is to make the common development tasks as easy and fast as possible. This translates to the following requirements:

  • Ease of adding/removing files from the build
  • Complete dependency tracking for fast incremental builds

To realize how big a difference a development-oriented build system can make, let’s examine the fairly common development task of implementing a new feature in a library and adding a test for it. Assuming we already made the changes in the library source code as well as added the directory with the new test, here is the list of steps required in an automake-based project:

  1. Add the new test directory into higher-level Makefile.am
  2. Add the new test Makefile.am to configure.ac
  3. Run the bootstrapping script to generate configure, Makefile.in, etc.
  4. Run configure
  5. Run make in the library source directory to update the library
  6. Run make in the test directory

Instead of the last two steps one can run make in the top-level directory which will update the library, update (at least relink) all the tests and examples and finally run all the tests. In my experience, some people prefer this method because while taking longer it requires less manual work and ensures that everything that the test may depend on is up to date. In contrast, here is the list of steps required in a build-based project:

  1. Add the new test directory into higher-level Makefile
  2. Run make in the test directory

The last step automatically updates the library as well as any other parts of the project on which this test depends and which are out of date.

The steps in the build-based project take hardly one-tenth of the time required by the automake-based project. Someone may say that the task of adding a new test is not very frequent in most projects. Let’s then consider another common task: making a change in the library source code and running a specific test. For automake the list is as follows:

  1. Run make in the library source directory to update the library
  2. Run make in the test directory

As in the previous example, instead of these two steps some people prefer to just run make check from the top-level directory. The equivalent one step for the build-based project is:

  1. Run make in the test directory

The automake steps take at least several times longer to complete and can be much more than that if make is run from the top-level directory. In my experience these delays result in a much smaller number of development iterations I could do on a project as well as reluctance to make changes that are not absolutely necessary (e.g., code quality improvements).

It is clear that the constraints imposed by the two orientations are often incompatible: the development-oriented build system requires powerful tools while the user-oriented one requires us not to depend on anything but the bare minimum.

It is hard to say which build system a project should prefer if the goal is to be successful. On one hand, if the speed of development is restricted to a crawl by the build system, then you are unlikely to produce something worth using in a reasonable time. On the other hand, if potential users are bogged down with numerous build-time dependencies that your project imposes then they are less likely to try it.

Another alternative, which we are using in some of our projects, is to provide two parallel build systems. The obvious drawback of this approach is the need to maintain the two systems. In our case the second build system is only provided for a small sub-set of the project (examples) which helps minimize the negative impact of this approach.

The natural improvement of the two build systems idea is the development-oriented build system that can automatically generate makefiles for the end user build system. Note that this is not the same solution as offered by some build system generators (for example, CMake and MPC) since the overhead of running the generator every time a file is added or removed from the project makes them much less suitable for a development-oriented build system.