Multi-threaded XML parsing with Xerces-C++

The Xerces-C++ examples only show how to parse and validate XML documents in a single-threaded manner. This is mainly due to a lack of a portable and clean way to work with threads across all the supported platforms rather than lack of support for multi-threaded applications. The straightforward way to parse and validate from multiple threads is to simply create a parser, load the schemas, and parse one or more documents. However, this approach is not the most efficient because each thread is going to parse the same set of schemas with each thread keeping the resulting schema grammar in memory. So if you have a 100 threads, your application will parse the same schemas a 100 times and contain a 100 copies of the same schema grammar. This is definitely wasteful since after the schemas are loaded the schema grammar is effectively read-only and can be reused by multiple threads.

There is a little-known way to do this more efficiently in Xerces-C++. It involves creating an XMLGrammarPool object that can then be passed to the parsers in order to first load it with schemas and then use the resulting grammar for validation. The code below uses Xerces-C++ 3-series and shows how to parse XML to DOM. The SAX setup will be similar. Since we will be creating quite a few parsers, it makes sense to factor this operation out into a separate function:

#include <xercesc/dom/DOM.hpp>
#include <xercesc/util/XMLUni.hpp>
#include <xercesc/util/XMLString.hpp>
#include <xercesc/util/PlatformUtils.hpp>
 
#include <xercesc/framework/MemBufInputSource.hpp>
#include <xercesc/framework/XMLGrammarPoolImpl.hpp>
#include <xercesc/framework/Wrapper4InputSource.hpp>
 
#include <xercesc/validators/common/Grammar.hpp>
 
DOMLSParser*
create_parser (XMLGrammarPool* pool)
{
  const XMLCh ls_id [] = {chLatin_L, chLatin_S, chNull};
 
  DOMImplementation* impl (
    DOMImplementationRegistry::getDOMImplementation (ls_id));
 
  DOMLSParser* parser (
    impl->createLSParser (
      DOMImplementationLS::MODE_SYNCHRONOUS,
      0,
      XMLPlatformUtils::fgMemoryManager,
      pool));
 
  DOMConfiguration* conf (parser->getDomConfig ());
 
  // Commonly useful configuration.
  //
  conf->setParameter (XMLUni::fgDOMComments, false);
  conf->setParameter (XMLUni::fgDOMDatatypeNormalization, true);
  conf->setParameter (XMLUni::fgDOMEntities, false);
  conf->setParameter (XMLUni::fgDOMNamespaces, true);
  conf->setParameter (XMLUni::fgDOMElementContentWhitespace, false);
 
  // Enable validation.
  //
  conf->setParameter (XMLUni::fgDOMValidate, true);
  conf->setParameter (XMLUni::fgXercesSchema, true);
  conf->setParameter (XMLUni::fgXercesSchemaFullChecking, false);
 
  // Xerces-C++ 3.1.0 is the first version with working multi
  // import support.
  //
#if _XERCES_VERSION >= 30100
  conf->setParameter (XMLUni::fgXercesHandleMultipleImports, true);
#endif
 
  // Use the loaded grammar during parsing.
  //
  conf->setParameter (XMLUni::fgXercesUseCachedGrammarInParse, true);
 
  // Disable loading schemas via other means (e.g., schemaLocation).
  //
  conf->setParameter (XMLUni::fgXercesLoadSchema, false);
 
  // We will release the DOM document ourselves.
  //
  conf->setParameter (XMLUni::fgXercesUserAdoptsDOMDocument, true);
 
  return parser;
}

Now let’s look at the thread function. Each thread is passed a pointer to the shared grammar pool object which can then be used to create the parser and parse some XML documents:

void*
thread_func (void* arg)
{
  XMLGrammarPool* pool (static_cast<XMLGrammarPool*> (v));
  DOMLSParser* parser (create_parser (pool));
 
  // Your implementation of DOMErrorHandler.
  //
  error_handler eh;
  parser->getDomConfig ()->setParameter (
    XMLUni::fgDOMErrorHandler, &eh);
 
  //
  // Parse some documents.
  //
 
  parser->release ();
  return 0;
}

The only part left is to create the grammar pool, load the schemas into it and then start the thread. In Xerces-C++ it is not possible to load the schemas directly into XMLGrammarPool. Instead, we will need to create a parser as we did in the thread function and use this parser’s loadGrammar() functions to populate the grammar pool. In effect, this parser is only created for this purpose and is not used to parse any XML documents:

int
main (int argc, char* argv[])
{
  XMLPlatformUtils::Initialize ();
 
  XMLGrammarPool* pool (
    new XMLGrammarPoolImpl (
      XMLPlatformUtils::fgMemoryManager));
 
  // Load the schemas into the grammar pool.
  //
  DOMLSParser* parser (create_parser (pool));
 
  // Your implementation of DOMErrorHandler.
  //
  error_handler eh;
  parser->getDomConfig ()->setParameter (
    XMLUni::fgDOMErrorHandler, &eh);
 
  int i (1);
 
  for (; i < argc; ++i)
  {
    const char* s (argv[1]);
    cerr << "loading " << s << endl;
 
    if (!parser->loadGrammar (
           s,
           Grammar::SchemaGrammarType,
           true) || eh.failed ())
    {
      cerr << s << ": error: unable to load" << endl;
      break;
    }
  }
 
  parser->release ();
 
  // If all the schemas loaded successfully, lock the
  // pool and start the threads.
  //
  if (i == argc)
  {
    pool->lockPool ();
 
    // Start the threads passing pool as the argument
    // and wait for them to finish.
 
    pool->unlockPool ();
  }
 
  delete pool;
  XMLPlatformUtils::Terminate ();
}

Note that before using the grammar pool from multiple threads we need to lock it by calling the lockPool() function. This will disallow any modifications to the pool, such as an attempt by one of the threads to cache additional schemas.

Comments are closed.