Archive for the ‘XML’ Category

Xerces-C++ 3.1.0 released

Tuesday, February 2nd, 2010

Xerces-C++ 3.1.0 was released yesterday. This version includes a number of new features, performance improvements, and a large number of bug fixes, particularly in the XML Schema spec conformance area. For the complete list of changes in this version refer to the official announcement on the project’s mailing list. In this post I am going to cover some of the more interesting new features in more detail.

As with previous versions, this release has been tested on all major platforms and comes with precompiled libraries (total 16) for various CPU architectures, operating systems, and C++ compilers. For most platforms 32-bit and 64-bit variants are provided.

The first new feature that I would like to talk more about is the multi-import support. But first some background on the problem that we are trying to solve here. As you may know, XML Schema provides two mechanisms for splitting and reusing schema files: the xs:include and xs:import directives. The include directive is used to include one schema into another when both schemas have the same target namespace. All that you specify inside an include directive is the schema file:

  <xs:include schemaLocation="base.xsd"/>

The import directive is used to import declarations from a schema with one target namespace into a schema with another target namespace. When you use it, you have to specify both the namespace being imported and the the schema file:

  <xs:import namespace="http://example.com/base"
             schemaLocation="base.xsd"/>

The include directives are normally used to split a complex schema for a particular XML vocabulary into multiple files. While the import directives are normally used to reuse one XML vocabulary in another.

One challenge that the XML Schema processors have to overcome while handling these directives is duplicate includes and imports. In the case of include the approach is straightforward: re-inclusions are detected and ignored based on the absolute path or URI of the schema file.

For the import directive there are two possible ways to handle this. One way is for the processor to use the target namespace of the vocabulary being imported to detect and ignore duplicates. This approach is simple, fast, and makes perfect sense. After all, who would want to import only half of a vocabulary now, and another half later? While most schema authors would agree with this assessment, there are some that want to do such “half-importing” (in most cases this happens when a single XML vocabulary uses multiple target namespaces). The only way to support this is to use the second alternative, which is to use both the namespace and the schema location to detect duplicates. This approach is more complex and slower.

As if this wasn’t confusing enough, the XML Schema specification doesn’t state which approach should be used. It says that an implementation using either alternative is conforming. While the first approach is cleaner and makes more sense, the consensus among the XML Schema processor developers is to use the second approach. I think the fairly lengthy and regular emails that would be required to explain why certain schemas don’t compile outweigh the difficulties of implementing the second approach.

Previous versions of Xerces-C++ only fully supported the first approach. With this release the second approach is also supported. To enable it, you will need to set the XMLUni::fgXercesHandleMultipleImports parameter to true. Furthermore, the same logic was extended to the loadGrammar() function as well as the schemaLocation and noNamespaceSchemaLocation attributes. This way you can load several schemas with the same target namespace and/or “add” more declarations with the schemaLocation attributes.

The other new feature worth mentioning is the ability to configure the XML parser’s buffer low water mark (XMLUni::fgXercesLowWaterMark property). By default, to improve performance, XML parsers in Xerces-C++ don’t parse the data as soon as it becomes available. Instead, the parsers buffer the data until a certain limit is reached after which they parse all the accumulated data in one go. This works well in most situations except for cases where you are using the SAX/SAX2 interface and would like the parsing events to be triggered as soon as the data becomes available. For example, imagine an application which reads the XML document to be parsed from a socket. The document is delivered in chunks with potentially long delays. The application wants to process the available data as soon as possible. In this situation we probably don’t want the XML parser sitting and waiting for the next chunk to fill up the buffer. Instead, we would like the available data to be parsed and the corresponding SAX callbacks called (that is, startElement(), characters(), etc.) regardless of how small a chunk this is. To achieve such immediate parsing, the parser’s buffer low water mark would need to be set to zero.

Multi-threaded XML parsing with Xerces-C++

Monday, January 25th, 2010

The Xerces-C++ examples only show how to parse and validate XML documents in a single-threaded manner. This is mainly due to a lack of a portable and clean way to work with threads across all the supported platforms rather than lack of support for multi-threaded applications. The straightforward way to parse and validate from multiple threads is to simply create a parser, load the schemas, and parse one or more documents. However, this approach is not the most efficient because each thread is going to parse the same set of schemas with each thread keeping the resulting schema grammar in memory. So if you have a 100 threads, your application will parse the same schemas a 100 times and contain a 100 copies of the same schema grammar. This is definitely wasteful since after the schemas are loaded the schema grammar is effectively read-only and can be reused by multiple threads.

There is a little-known way to do this more efficiently in Xerces-C++. It involves creating an XMLGrammarPool object that can then be passed to the parsers in order to first load it with schemas and then use the resulting grammar for validation. The code below uses Xerces-C++ 3-series and shows how to parse XML to DOM. The SAX setup will be similar. Since we will be creating quite a few parsers, it makes sense to factor this operation out into a separate function:

#include <xercesc/dom/DOM.hpp>
#include <xercesc/util/XMLUni.hpp>
#include <xercesc/util/XMLString.hpp>
#include <xercesc/util/PlatformUtils.hpp>
 
#include <xercesc/framework/MemBufInputSource.hpp>
#include <xercesc/framework/XMLGrammarPoolImpl.hpp>
#include <xercesc/framework/Wrapper4InputSource.hpp>
 
#include <xercesc/validators/common/Grammar.hpp>
 
DOMLSParser*
create_parser (XMLGrammarPool* pool)
{
  const XMLCh ls_id [] = {chLatin_L, chLatin_S, chNull};
 
  DOMImplementation* impl (
    DOMImplementationRegistry::getDOMImplementation (ls_id));
 
  DOMLSParser* parser (
    impl->createLSParser (
      DOMImplementationLS::MODE_SYNCHRONOUS,
      0,
      XMLPlatformUtils::fgMemoryManager,
      pool));
 
  DOMConfiguration* conf (parser->getDomConfig ());
 
  // Commonly useful configuration.
  //
  conf->setParameter (XMLUni::fgDOMComments, false);
  conf->setParameter (XMLUni::fgDOMDatatypeNormalization, true);
  conf->setParameter (XMLUni::fgDOMEntities, false);
  conf->setParameter (XMLUni::fgDOMNamespaces, true);
  conf->setParameter (XMLUni::fgDOMElementContentWhitespace, false);
 
  // Enable validation.
  //
  conf->setParameter (XMLUni::fgDOMValidate, true);
  conf->setParameter (XMLUni::fgXercesSchema, true);
  conf->setParameter (XMLUni::fgXercesSchemaFullChecking, false);
 
  // Xerces-C++ 3.1.0 is the first version with working multi
  // import support.
  //
#if _XERCES_VERSION >= 30100
  conf->setParameter (XMLUni::fgXercesHandleMultipleImports, true);
#endif
 
  // Use the loaded grammar during parsing.
  //
  conf->setParameter (XMLUni::fgXercesUseCachedGrammarInParse, true);
 
  // Disable loading schemas via other means (e.g., schemaLocation).
  //
  conf->setParameter (XMLUni::fgXercesLoadSchema, false);
 
  // We will release the DOM document ourselves.
  //
  conf->setParameter (XMLUni::fgXercesUserAdoptsDOMDocument, true);
 
  return parser;
}

Now let’s look at the thread function. Each thread is passed a pointer to the shared grammar pool object which can then be used to create the parser and parse some XML documents:

void*
thread_func (void* arg)
{
  XMLGrammarPool* pool (static_cast<XMLGrammarPool*> (v));
  DOMLSParser* parser (create_parser (pool));
 
  // Your implementation of DOMErrorHandler.
  //
  error_handler eh;
  parser->getDomConfig ()->setParameter (
    XMLUni::fgDOMErrorHandler, &eh);
 
  //
  // Parse some documents.
  //
 
  parser->release ();
  return 0;
}

The only part left is to create the grammar pool, load the schemas into it and then start the thread. In Xerces-C++ it is not possible to load the schemas directly into XMLGrammarPool. Instead, we will need to create a parser as we did in the thread function and use this parser’s loadGrammar() functions to populate the grammar pool. In effect, this parser is only created for this purpose and is not used to parse any XML documents:

int
main (int argc, char* argv[])
{
  XMLPlatformUtils::Initialize ();
 
  XMLGrammarPool* pool (
    new XMLGrammarPoolImpl (
      XMLPlatformUtils::fgMemoryManager));
 
  // Load the schemas into the grammar pool.
  //
  DOMLSParser* parser (create_parser (pool));
 
  // Your implementation of DOMErrorHandler.
  //
  error_handler eh;
  parser->getDomConfig ()->setParameter (
    XMLUni::fgDOMErrorHandler, &eh);
 
  int i (1);
 
  for (; i < argc; ++i)
  {
    const char* s (argv[1]);
    cerr << "loading " << s << endl;
 
    if (!parser->loadGrammar (
           s,
           Grammar::SchemaGrammarType,
           true) || eh.failed ())
    {
      cerr << s << ": error: unable to load" << endl;
      break;
    }
  }
 
  parser->release ();
 
  // If all the schemas loaded successfully, lock the
  // pool and start the threads.
  //
  if (i == argc)
  {
    pool->lockPool ();
 
    // Start the threads passing pool as the argument
    // and wait for them to finish.
 
    pool->unlockPool ();
  }
 
  delete pool;
  XMLPlatformUtils::Terminate ();
}

Note that before using the grammar pool from multiple threads we need to lock it by calling the lockPool() function. This will disallow any modifications to the pool, such as an attempt by one of the threads to cache additional schemas.

Running XPath on a C++/Tree object model

Monday, May 18th, 2009

One interesting feature of the C++/Tree mapping in XSD is the ability to maintain an association between C++ object model nodes and corresponding DOM nodes. Consider the following XML document as an example:

<p:directory xmlns:p="http://www.example.com/people"
  <person>
    <first-name>John</first-name>
    <last-name>Doe</last-name>
    <gender>male</gender>
    <age>32</age>
  </person>
 
  <person>
    <first-name>Jane</first-name>
    <last-name>Doe</last-name>
    <gender>female</gender>
    <age>28</age>
  </person>
</p:directory>

Provided we requested the DOM association during parsing, having the person object we can obtain the DOMElement node corresponding to this object. We can also go the other way, that is, having a DOM node from a DOM document associated with a C++/Tree object model we can obtain the corresponding object model node.

One technique that is made possible thanks to the DOM association is the use of XPath queries to locate object model nodes. This is especially useful if you have a deeply nested document and you only need to access a small part of it buried deep inside.

The idea is to run an XPath query on the underlying DOM document, obtain the result as a collection of DOM nodes and then “move up” from these DOM nodes to the object model nodes. While the DOM implementation provided by Xerces-C++ does not support XPath, there are complimentary libraries, such as XQilla, that provide this functionality. The following code fragment shows how to locate all the people from the above XML file that are older than 30. It uses XQilla and the DOM XPath API from Xerces-C++ 2.8.0:

directory& d = ...
 
// Obtain the root element and document corresponding
// to the directory object.
//
DOMElement* root (static_cast<DOMElement*> (d._node ()));
DOMDocument* doc (root->getOwnerDocument ());
 
// Obtain namespace resolver.
//
dom::auto_ptr<XQillaNSResolver> resolver (
  (XQillaNSResolver*)doc->createNSResolver (root));
 
// Set the namespace prefix for the people namespace that
// we can use reliably in XPath expressions regardless of
// what is used in XML documents.
//
resolver->addNamespaceBinding (
  xml::string ("p").c_str (),
  xml::string ("http://www.example.com/people").c_str ());
 
// Create XPath expression.
//
dom::auto_ptr<const XQillaExpression> expr (
  static_cast<const XQillaExpression*> (
    doc->createExpression (
      xml::string ("p:directory/person[age > 30]").c_str (),
      resolver.get ())));
 
// Execute the query.
//
dom::auto_ptr<XPath2Result> r (
  static_cast<XPath2Result*> (
    expr->evaluate (
      doc, XPath2Result::ITERATOR_RESULT, 0)));
 
// Iterate over the result.
//
while (r->iterateNext ())
{
  const DOMNode* n (r->asNode ());
 
  // Obtain the object model node corresponding to
  // this DOM node.
  //
  person* p (
    static_cast<person*> (
      n->getUserData (dom::tree_node_key)));
 
  // Print the data using the object model.
  //
  cout << endl
       << "First  : " << p->first_name () << endl
       << "Last   : " << p->last_name () << endl
       << "Gender : " << p->gender () << endl
       << "Age    : " << p->age () << endl;
}

As you can see the code is littered with casts to XQilla-specific types such as XQillaNSResolver, XQillaExpression, and XPath2Result. This is necessary because the DOM interface in Xerces-C++ 2-series only supports the XPath 1.0 query model and is not sufficient for XPath 2.0 implemented by XQilla.

To make the integration of XQilla with Xerces-C++ cleaner, the Xerces-C++ and XQilla developers came up with an extended DOM XPath interface that accommodated both XPath 1.0 and 2.0 query models. On the Xerces-C++ side this interface was first made public in version 3.0.0. Soon after that XQilla 2.2.0 was released with the implementation of the new interface. The above code fragment rewritten to use the new interface is shown below:

directory& d = ...
 
// Obtain the root element and document corresponding
// to the directory object.
//
DOMElement* root (static_cast<DOMElement*> (d._node ()));
DOMDocument* doc (root->getOwnerDocument ());
 
// Obtain namespace resolver.
//
dom::auto_ptr<DOMXPathNSResolver> resolver (
  doc->createNSResolver (root));
 
// Set the namespace prefix for the people namespace that
// we can use reliably in XPath expressions regardless of
// what is used in XML documents.
//
resolver->addNamespaceBinding (
  xml::string ("p").c_str (),
  xml::string ("http://www.example.com/people").c_str ());
 
// Create XPath expression.
//
dom::auto_ptr<DOMXPathExpression> expr (
  doc->createExpression (
    xml::string ("p:directory/person[age > 30]").c_str (),
    resolver.get ()));
 
// Execute the query.
//
dom::auto_ptr<DOMXPathResult> r (
  expr->evaluate (
    doc, DOMXPathResult::ITERATOR_RESULT_TYPE, 0));
 
// Iterate over the result.
//
while (r->iterateNext ())
{
  DOMNode* n (r->getNodeValue ());
 
  // Obtain the object model node corresponding to
  // this DOM node.
  //
  person* p (
    static_cast<person*> (
      n->getUserData (dom::tree_node_key)));
 
  // Print the data using the object model.
  //
  cout << endl
       << "First  : " << p->first_name () << endl
       << "Last   : " << p->last_name () << endl
       << "Gender : " << p->gender () << endl
       << "Age    : " << p->age () << endl;
}