Xerces-C++ 3.1.1 released

April 27th, 2010

Xerces-C++ 3.1.1 was released today. This version is a bug-fix release and is binary-compatible with Xerces-C++ 3.1.0. For the complete list of changes refer to the official announcement on the project’s mailing list.

One notable new feature in this version is the addition of the Visual Studio 2010 (10.0) project and solution files for the library, examples, and tests. As anyone who tried to convert a fairly large project from previous versions of Visual Studio can attest, this is no small matter. We have also successfully tested the library built with Visual Studio 2010 against various test suites, including the XML Schema Test Suite (XSTS).

As with previous versions, this release has been tested on all major platforms and comes with precompiled libraries (total 18) for various CPU architectures, operating systems, and C++ compilers. For most platforms 32-bit and 64-bit variants are provided.

Validating against external schemas in Xerces-C++

March 15th, 2010

One of the recurring questions on the Xerces-C++ mailing lists is this: given an XML document and a schema as two files, how to validate one against the other. The examples that come with Xerces-C++ are of no help. They only support validation if the schema file is specified in the XML document with the xsi:schemaLocation or xsi:noNamespaceSchemaLocation attributes.

If you think about it, the schemaLocation attributes are quite useless as the location specification mechanism. The schemas that the application is going to use in 99.9% of cases are bundled with the application itself. Otherwise what use is it to know that the XML the application is about to process is valid with regards to some schema? We need to make sure it is valid against the schema that the application was built to handle. So in essence we are asking the producers of XML documents to embed location information that points to schemas that are installed somewhere with our application. Doesn’t make much sense, does it?

Most production applications would use the following strategy instead: (1) pre-load the schemas that are supplied with the application, and (2) disable loading of schemas specified using any other methods (e.g., the schemaLocation attributes). The second part in this strategy is there to remove security concerns that arise when an application tries to load something specified by an untrusted source.

So how does one achieve this with Xerces-C++? All the parsers that come with Xerces-C++ (DOM, SAX, SAX2) include the API that allows you to load the schemas from the filesystem and then use them during parsing for validation. There is also a way to disable loading of any external schemas. To demonstrate all this, I have created two simple command line programs that allow one to load a bunch of schemas and then validate a bunch of XML files against them. The SAX2 version is load-grammar-sax.cxx and the DOM one is load-grammar-dom.cxx. To build, compile and link with the Xerces-C++ library.

Two typical multi-threaded programming mistakes

February 7th, 2010

A couple of days ago I had to fix two threading-related mistakes that I thought were very typical of people who are new to the “threaded way of thinking”. Here is the simplified version of that code. Don’t read too much into whether this is the best way to implement a string pool:

class string_pool
{
public:
  typedef std::map<std::string, std::size_t> map;
 
  string_pool (const map& ro_map)
    : ro_map_ (ro_map)
  {
  }
 
  std::size_t
  find_or_add (const std::string& s)
  {
    // First check the read-only map.
    //
    map::const_iterator i = ro_map_.find (s);
 
    if (i != ro_map_.end ())
      return i->second_;
 
    // Check the writable map.
    //
    i = map_.find (s);
 
    if (i != map_.end ())
      return i->second;
    else
    {
      // We are about to modify the map so we have to
      // synchronize this part.
      //
      auto_lock l (mutex_);
      std::size_t id = ro_map_.size () + map_.size () + 1;
      map_.insert (std::make_pair (s, id));
    }
  }
 
private:
  map map_;
  const map& ro_map_;
  mutex mutex_;
};

Can you spot the two problems with the above code? The first one is related to the misconception that the sole purpose of mutexes and similar primitives is to make sure that no two threads try to modify the same data simultaneously. This leads to the incorrect idea that we only need to synchronize the parts of code that modify the data while the parts that only read it can do so freely. On modern CPU architectures mutexes play another important role: they make sure that the changes made by one thread are actually visible to others and visible in the order they were made.

In the above code the problematic place is the unsynchronized check for the existence of the string in the writable map. What are some undesirable things that can happen here? For example, one thread could add a string while others won’t “see” this change and will all try to re-add the same string. Or, worse yet, a thread could only see part of the change. For example, the map_.insert() implementation could allocate a new, uninitialized map entry, update all the house-keeping information, and then initialize it with the passed pair. Imagine what will happen if the reading threads only see the result of the first part of this operation.

The other problem is more subtle. Compare these two code fragments:

  std::size_t id = ro_map_.size () + map_.size () + 1;
  std::size_t n = ro_map_.size ();
  std::size_t id = n + map_.size () + 1;

There is little semantic difference between the two. Now consider what happens when we add synchronization:

  auto_lock l (mutex_);
  std::size_t id = ro_map_.size () + map_.size () + 1;
  std::size_t n = ro_map_.size ();
  auto_lock l (mutex_);
  std::size_t id = n + map_.size () + 1;

Do you see the difference now? In the first fragment we call ro_map_.size(), which doesn’t need any synchronization, while having the mutex locked. This can result in higher contention between threads since we are holding the mutex for longer. While the call to size() here is probably negligible (the actual code I had to fix was calling virtual functions), it underlines a general principle: you should try to execute all the code that doesn’t need synchronization before acquiring the mutex.

Here is how we can fix these two problems in the above code:

  std::size_t
  find_or_add (const std::string& s)
  {
    // First check the read-only map.
    //
    map::const_iterator i = ro_map_.find (s);
 
    if (i != ro_map_.end ())
      return i->second_;
 
    // Perform operations that can be done in
    // parallel with other threads.
    //
    std::size_t n = ro_map_.size ();
    map::value_type pair (s, n + 1);
 
    // We are about to access the writable map so
    // synchronize this part.
    //
    auto_lock l (mutex_);
    i = map_.find (s);
 
    if (i != map_.end ())
      return i->second;
    else
    {
      pair.second += map_.size ();
      map_.insert (pair);
    }
  }