Archive for January, 2010

Multi-threaded XML parsing with Xerces-C++

Monday, January 25th, 2010

The Xerces-C++ examples only show how to parse and validate XML documents in a single-threaded manner. This is mainly due to a lack of a portable and clean way to work with threads across all the supported platforms rather than lack of support for multi-threaded applications. The straightforward way to parse and validate from multiple threads is to simply create a parser, load the schemas, and parse one or more documents. However, this approach is not the most efficient because each thread is going to parse the same set of schemas with each thread keeping the resulting schema grammar in memory. So if you have a 100 threads, your application will parse the same schemas a 100 times and contain a 100 copies of the same schema grammar. This is definitely wasteful since after the schemas are loaded the schema grammar is effectively read-only and can be reused by multiple threads.

There is a little-known way to do this more efficiently in Xerces-C++. It involves creating an XMLGrammarPool object that can then be passed to the parsers in order to first load it with schemas and then use the resulting grammar for validation. The code below uses Xerces-C++ 3-series and shows how to parse XML to DOM. The SAX setup will be similar. Since we will be creating quite a few parsers, it makes sense to factor this operation out into a separate function:

#include <xercesc/dom/DOM.hpp>
#include <xercesc/util/XMLUni.hpp>
#include <xercesc/util/XMLString.hpp>
#include <xercesc/util/PlatformUtils.hpp>
 
#include <xercesc/framework/MemBufInputSource.hpp>
#include <xercesc/framework/XMLGrammarPoolImpl.hpp>
#include <xercesc/framework/Wrapper4InputSource.hpp>
 
#include <xercesc/validators/common/Grammar.hpp>
 
DOMLSParser*
create_parser (XMLGrammarPool* pool)
{
  const XMLCh ls_id [] = {chLatin_L, chLatin_S, chNull};
 
  DOMImplementation* impl (
    DOMImplementationRegistry::getDOMImplementation (ls_id));
 
  DOMLSParser* parser (
    impl->createLSParser (
      DOMImplementationLS::MODE_SYNCHRONOUS,
      0,
      XMLPlatformUtils::fgMemoryManager,
      pool));
 
  DOMConfiguration* conf (parser->getDomConfig ());
 
  // Commonly useful configuration.
  //
  conf->setParameter (XMLUni::fgDOMComments, false);
  conf->setParameter (XMLUni::fgDOMDatatypeNormalization, true);
  conf->setParameter (XMLUni::fgDOMEntities, false);
  conf->setParameter (XMLUni::fgDOMNamespaces, true);
  conf->setParameter (XMLUni::fgDOMElementContentWhitespace, false);
 
  // Enable validation.
  //
  conf->setParameter (XMLUni::fgDOMValidate, true);
  conf->setParameter (XMLUni::fgXercesSchema, true);
  conf->setParameter (XMLUni::fgXercesSchemaFullChecking, false);
 
  // Xerces-C++ 3.1.0 is the first version with working multi
  // import support.
  //
#if _XERCES_VERSION >= 30100
  conf->setParameter (XMLUni::fgXercesHandleMultipleImports, true);
#endif
 
  // Use the loaded grammar during parsing.
  //
  conf->setParameter (XMLUni::fgXercesUseCachedGrammarInParse, true);
 
  // Disable loading schemas via other means (e.g., schemaLocation).
  //
  conf->setParameter (XMLUni::fgXercesLoadSchema, false);
 
  // We will release the DOM document ourselves.
  //
  conf->setParameter (XMLUni::fgXercesUserAdoptsDOMDocument, true);
 
  return parser;
}

Now let’s look at the thread function. Each thread is passed a pointer to the shared grammar pool object which can then be used to create the parser and parse some XML documents:

void*
thread_func (void* arg)
{
  XMLGrammarPool* pool (static_cast<XMLGrammarPool*> (v));
  DOMLSParser* parser (create_parser (pool));
 
  // Your implementation of DOMErrorHandler.
  //
  error_handler eh;
  parser->getDomConfig ()->setParameter (
    XMLUni::fgDOMErrorHandler, &eh);
 
  //
  // Parse some documents.
  //
 
  parser->release ();
  return 0;
}

The only part left is to create the grammar pool, load the schemas into it and then start the thread. In Xerces-C++ it is not possible to load the schemas directly into XMLGrammarPool. Instead, we will need to create a parser as we did in the thread function and use this parser’s loadGrammar() functions to populate the grammar pool. In effect, this parser is only created for this purpose and is not used to parse any XML documents:

int
main (int argc, char* argv[])
{
  XMLPlatformUtils::Initialize ();
 
  XMLGrammarPool* pool (
    new XMLGrammarPoolImpl (
      XMLPlatformUtils::fgMemoryManager));
 
  // Load the schemas into the grammar pool.
  //
  DOMLSParser* parser (create_parser (pool));
 
  // Your implementation of DOMErrorHandler.
  //
  error_handler eh;
  parser->getDomConfig ()->setParameter (
    XMLUni::fgDOMErrorHandler, &eh);
 
  int i (1);
 
  for (; i < argc; ++i)
  {
    const char* s (argv[1]);
    cerr << "loading " << s << endl;
 
    if (!parser->loadGrammar (
           s,
           Grammar::SchemaGrammarType,
           true) || eh.failed ())
    {
      cerr << s << ": error: unable to load" << endl;
      break;
    }
  }
 
  parser->release ();
 
  // If all the schemas loaded successfully, lock the
  // pool and start the threads.
  //
  if (i == argc)
  {
    pool->lockPool ();
 
    // Start the threads passing pool as the argument
    // and wait for them to finish.
 
    pool->unlockPool ();
  }
 
  delete pool;
  XMLPlatformUtils::Terminate ();
}

Note that before using the grammar pool from multiple threads we need to lock it by calling the lockPool() function. This will disallow any modifications to the pool, such as an attempt by one of the threads to cache additional schemas.

Microsoft DLL export and C++ templates

Monday, January 18th, 2010

The other day I stumbled upon a really dark corner of the Microsoft dllexport/dllimport machinery. I can vividly see Windows toolchain engineers waking up in the middle of the night from a nightmare where they had to patch yet another crack in this DLL symbol export mess. This one has to do with the interaction of dllexport and C++ templates.

It all started with a user reporting duplicate symbol errors when he tried to split the XSD-generated code into two DLLs. The duplicate symbols were reported when linking the second DLL that depends on the “base” DLL and pointed to the destructor and assignment operator of a template instantiation, let’s say std::vector<int>. There were two additional strange things about this case: the errors only occurred in the debug build and there were a number of other users that have done a similar thing but never got any errors. The fact that the errors only appeared in the debug build got me thinking that in the release build these functions were inlined. The second strange aspect was harder to figure out: there was something special about this particular codebase that caused the error. After some investigation the following code fragment in the first DLL turned out to make the difference (BASE_EXPORT expands to either __declspec(dllexport) or __declspec(dllimport)):

class BASE_EXPORT ints: public std::vector<int>
{
  ...
};

As it turns out (see at the end of the General Rules and Limitations article in MSDN), if an exported class inherits from a template instantiation that is not explicitly exported (yes, you can export certain instantiations of a template, see below), then the compiler implicitly applies dllexport to this template instantiation. So the above code fragment exports both the ints class and the std::vector<int> instantiation. On the surface this automatic exporting looks like a good idea. After all, if you export the derived class you will also need to export all its public bases since they are part of the interface. In the case of the non-template bases you need to use the export mechanism explicitly which makes sense. In the case of templates, you don’t want to have to explicitly export every instantiation. Plus, as pointed out in the MSDN article above, it is not always possible.

But here is the other half of the picture: in the second DLL there is a source code file that doesn’t know anything about the ints class (that is, it doesn’t include the ints declaration). It also happens to use std::vector<int> in a fairly common way:

void f ()
{
  std::vector<int> v;
 
  ...
}

When the second DLL is linked, we end up with two sets of symbols for std::vector<int>: the first is exported from the “base” DLL and the second set is the result of the template instantiation in the above source code file. Duplicate symbol errors ensue.

At first it might seem puzzling that the same doesn’t happen with ordinary classes that contain inline functions. What if a class is exported from one DLL and then we use it in another? This doesn’t lead to errors even when inline functions are not inlined because in order to use the class we need to include its declaration. Once we do that all of its functions become imported from the first DLL and instead of “instantiating” an inline function the compiler simply uses the imported version from the first DLL. We get errors in the above scenario because when VC++ compiles the source file in the second DLL it has no knowledge of the fact that the functions it is about to instantiate were exported from the “base” DLL which this DLL happens to link to.

In standard C++ the toolchain is required to weed out the duplicate symbols that result from instantiations of the same template. When DLLs are involved, VC++ is unable to meet this requirement.

There is no clean way to work around this. In the scenario described above we can add an explicit import declaration for the std::vector<int> instantiation:

template class __declspec(dllimport) std::vector<int>;
 
void f ()
{
  std::vector<int> v;
 
  ...
}

Normally one would collect such manual imports in one header file and then include this file into every source file in the DLL.

The major issue with this approach, apart from having to manually track imports, is that if you have two independent DLLs that each happen to auto-export std::vector<int> and you need to link to both of them, there is nothing you can do without changing at least one of those DLLs.

It also appears that Microsoft itself suffered from this pitfall as evident from the Exporting String Classes Using CStringT article in MSDN. The solution that it describes seems to be specific to this particular case, not that I could understand it fully.