Archive for the ‘C++’ Category

Running XPath on a C++/Tree object model

Monday, May 18th, 2009

One interesting feature of the C++/Tree mapping in XSD is the ability to maintain an association between C++ object model nodes and corresponding DOM nodes. Consider the following XML document as an example:

<p:directory xmlns:p="http://www.example.com/people"
  <person>
    <first-name>John</first-name>
    <last-name>Doe</last-name>
    <gender>male</gender>
    <age>32</age>
  </person>
 
  <person>
    <first-name>Jane</first-name>
    <last-name>Doe</last-name>
    <gender>female</gender>
    <age>28</age>
  </person>
</p:directory>

Provided we requested the DOM association during parsing, having the person object we can obtain the DOMElement node corresponding to this object. We can also go the other way, that is, having a DOM node from a DOM document associated with a C++/Tree object model we can obtain the corresponding object model node.

One technique that is made possible thanks to the DOM association is the use of XPath queries to locate object model nodes. This is especially useful if you have a deeply nested document and you only need to access a small part of it buried deep inside.

The idea is to run an XPath query on the underlying DOM document, obtain the result as a collection of DOM nodes and then “move up” from these DOM nodes to the object model nodes. While the DOM implementation provided by Xerces-C++ does not support XPath, there are complimentary libraries, such as XQilla, that provide this functionality. The following code fragment shows how to locate all the people from the above XML file that are older than 30. It uses XQilla and the DOM XPath API from Xerces-C++ 2.8.0:

directory& d = ...
 
// Obtain the root element and document corresponding
// to the directory object.
//
DOMElement* root (static_cast<DOMElement*> (d._node ()));
DOMDocument* doc (root->getOwnerDocument ());
 
// Obtain namespace resolver.
//
dom::auto_ptr<XQillaNSResolver> resolver (
  (XQillaNSResolver*)doc->createNSResolver (root));
 
// Set the namespace prefix for the people namespace that
// we can use reliably in XPath expressions regardless of
// what is used in XML documents.
//
resolver->addNamespaceBinding (
  xml::string ("p").c_str (),
  xml::string ("http://www.example.com/people").c_str ());
 
// Create XPath expression.
//
dom::auto_ptr<const XQillaExpression> expr (
  static_cast<const XQillaExpression*> (
    doc->createExpression (
      xml::string ("p:directory/person[age > 30]").c_str (),
      resolver.get ())));
 
// Execute the query.
//
dom::auto_ptr<XPath2Result> r (
  static_cast<XPath2Result*> (
    expr->evaluate (
      doc, XPath2Result::ITERATOR_RESULT, 0)));
 
// Iterate over the result.
//
while (r->iterateNext ())
{
  const DOMNode* n (r->asNode ());
 
  // Obtain the object model node corresponding to
  // this DOM node.
  //
  person* p (
    static_cast<person*> (
      n->getUserData (dom::tree_node_key)));
 
  // Print the data using the object model.
  //
  cout << endl
       << "First  : " << p->first_name () << endl
       << "Last   : " << p->last_name () << endl
       << "Gender : " << p->gender () << endl
       << "Age    : " << p->age () << endl;
}

As you can see the code is littered with casts to XQilla-specific types such as XQillaNSResolver, XQillaExpression, and XPath2Result. This is necessary because the DOM interface in Xerces-C++ 2-series only supports the XPath 1.0 query model and is not sufficient for XPath 2.0 implemented by XQilla.

To make the integration of XQilla with Xerces-C++ cleaner, the Xerces-C++ and XQilla developers came up with an extended DOM XPath interface that accommodated both XPath 1.0 and 2.0 query models. On the Xerces-C++ side this interface was first made public in version 3.0.0. Soon after that XQilla 2.2.0 was released with the implementation of the new interface. The above code fragment rewritten to use the new interface is shown below:

directory& d = ...
 
// Obtain the root element and document corresponding
// to the directory object.
//
DOMElement* root (static_cast<DOMElement*> (d._node ()));
DOMDocument* doc (root->getOwnerDocument ());
 
// Obtain namespace resolver.
//
dom::auto_ptr<DOMXPathNSResolver> resolver (
  doc->createNSResolver (root));
 
// Set the namespace prefix for the people namespace that
// we can use reliably in XPath expressions regardless of
// what is used in XML documents.
//
resolver->addNamespaceBinding (
  xml::string ("p").c_str (),
  xml::string ("http://www.example.com/people").c_str ());
 
// Create XPath expression.
//
dom::auto_ptr<DOMXPathExpression> expr (
  doc->createExpression (
    xml::string ("p:directory/person[age > 30]").c_str (),
    resolver.get ()));
 
// Execute the query.
//
dom::auto_ptr<DOMXPathResult> r (
  expr->evaluate (
    doc, DOMXPathResult::ITERATOR_RESULT_TYPE, 0));
 
// Iterate over the result.
//
while (r->iterateNext ())
{
  DOMNode* n (r->getNodeValue ());
 
  // Obtain the object model node corresponding to
  // this DOM node.
  //
  person* p (
    static_cast<person*> (
      n->getUserData (dom::tree_node_key)));
 
  // Print the data using the object model.
  //
  cout << endl
       << "First  : " << p->first_name () << endl
       << "Last   : " << p->last_name () << endl
       << "Gender : " << p->gender () << endl
       << "Age    : " << p->age () << endl;
}

A profiling war story

Monday, May 4th, 2009

The other day I ran across an XML schema that took the XSD compiler unusually long to translate to C++. While the schema itself was quite large, almost 2MB of definitions spread over 80 files, the compilation time on my machine was excessive, about 160 seconds. We also had another report from several months ago where XSD would seem to run indefinitely while trying to compile a really large schema (15K element declarations across about 200 files). Unfortunately, the reporter couldn’t provide us with that schema for debugging due to distribution restrictions. All we could do was give them a new version of the compiler and they had to tell us whether it worked or not.

I set out to figure out what was going on using the schema I had suspecting that it was the same problem as the one affecting the other schema. The first step was to understand which part of the compilation process takes so much time. A good profiler can be very helpful in accomplishing this. I personally prefer OProfile mainly because: 1) it is system-wide so you can see the complete picture (your application, libraries, even the kernel), 2) it is transparent in the sense that you don’t need a special build of your application and all the libraries that it depends on, and 3) it has a low overhead which means that your application doesn’t take 10 times as long to execute when profiling.

I ran OProfile while the XSD compiler was compiling the schema. My first step in analyzing the program execution was to get a flat symbol list that shows how much time is spent in each function. Here is the top of the listing for XSD:

samples  %        image name  symbol name
 
405527   14.4254  libc.so     _int_malloc
334761   11.9081  libc.so     _int_free
287232   10.2174  libc.so     free
285081   10.1409  xsd         Cult::MM::allocate
199565    7.0989  libc.so     malloc
124309    4.4219  xsd         Schema::find
119779    4.2608  xsd         operator delete
104893    3.7312  xsd         std::map<...>::find
94964     3.3780  libc.so     wmemcmp
85689     3.0481  xsd         Cult::MM::Key::size
80308     2.8567  xsd         operator new
62390     2.2193  libc.so     strlen
37687     1.3406  xsd         std::string::_S_create
35205     1.2523  libc.so     memcpy

When you read articles about profiling you are normally presented with a listing similar to the one above where one or two functions at the top account for 90% percent of the execution time. This, however, doesn’t happen very often for complex programs that were built by competent engineers. Complex programs perform several operations in a pipe-line manner instead of just one algorithmic task (image conversion is good example of an algorithmic application where you may see a single function at the top taking 90% of the time). Competent engineering means that there are no silly programming or architectural mistakes that result in single-function bottlenecks. Instead, you normally see a listing like the one above: execution time is spread over many functions without any time standing out as unreasonable. Such information is completely useless in understanding what’s going on in the program.

What we need instead are the aggregate times for each function. That is, a time spent in the function itself and in all the functions that it calls. We also need to know which other functions each function calls with their respective times. With this information we can find which operation in our program takes most of the execution time as well as figure out which parts of this operation are responsible for the slowdown.

The best way to get this information from OProfile is to collect the samples with the call graph information (see the --callgraph option), then convert them to the gprof format with the opgprof command, and finally generate the call graph with the gprof -q command. One limitation of the call graph information is that it is very inaccurate if your application is optimized. To get a useful result you will need to recompile your application with optimization turned off.

The call graph output is quite a bit more complex and can be a lot more verbose compared to the flat symbol list presented above. The gprof call graph format is explained in detail in the gprof documentation.

One way to get a summary of the gprof call graph that just tells us which functions have the largest aggregate times is to run the following sed script on the call graph output:

sed -e '/^\[/p;d' callgraph >summary

Here is the top of the summary listing for XSD:

% time self children called   name
 
99.8      0  645554           main
94.0  16021  591541  588759+  Schema::find
                     3889477
82.6   1024  533243  512793+  <cycle 4 as a whole>
                     36104907
82.4      0  532686  512443   Parser::parse
55.5      8  358965  465329   Node<Element>::trampoline
49.5  27896  291869  265608+  <cycle 1 as a whole>
                     36968
47.9      0  309389  299838   Resolver::resolve_element
42.6  18427  256696  215619   operator new
38.6      1  249727  242001   Resolver::traverse

From this listing it is immediately apparent that the problem is most likely in the Schema::find operation. It takes up 94% of the program execution time and is called half a million times from other functions and almost four million times recursively.

The Schema::find() function looks up namespaces with the specified XML namespace name in the schema graph. To understand how it works one needs a basic understanding of how the schema graph is organized. Each schema file is represented by a Schema object which contains a list of references to included and imported schema files (also represented as Schema objects) along with the Namespace object which contains the actual type, element, etc., declarations. Schema::find() returns a list of Namespace object corresponding to a namespace name. To collect this list, it traverses the include/import graph to make sure it visits every Schema object visible, directly or indirectly, from the initial schema.

There could be two general reasons why Schema::find() takes a disproportional amount of time. It can be because it is called too many times by other functions. Or the problem could be in its implementation. Or it can be both. Seeing that Schema::find() is called almost 8 times more often recursively than from outside, the suspicion is that it is the implementation. Here is the fragment of the call graph listing corresponding to this function:

self  children    called       name
 
16021  591541  588759+3889477  Schema::find
 8275  217657  208177/208268    Context::set
 3528  174629  158008/158008    std::list<...>::insert
 3190   45372  65030/65055      Context::remove
  346   47712  43879/43880      std::list<...>::clear
 3500   28218  32082/37821      Scope::find
 3560    9619  20461/20586      Context::count

As you can see the time is dominated by two sets of operations. The first are Context operations which are used to detect recursive import/inclusion (that is, a flag is set in the Schema’s context to indicate that it is being traversed). The second are the list insertion and removal operations. This tells us two things: a lot of Schema objects are being traversed and a lot of Namespace objects are being inserted into the resulting list.

The size of the schema, and in particular the number of files which corresponds to the number of Schema objects, could not by itself explain the long compilation time since there are comparable or even larger schemas that compile much faster. There had to be something else special about this schema. As it turned out, and this is something I just had to figure out on my own since no profiling tool is available for this last step, the schema had a very expansive include/import tree. Most files in this schema included or imported a handful of other files which in turn did the same. The end result was that each schema file was included or imported, directly and indirectly, hundreds of times. That meant that the Schema::find() implementation had to traverse the same files over and over again (thus all the Context operations) and insert the same Namespace objects into the resulting list over and over again (thus all the list operations).

The solution was fairly straightforward. I added a set which keeps track of all the Schema objects that has already been seen in this execution of Schema::find(). The compilation time dropped from 160 seconds to 1.5 seconds which is more than 100 times speedup. And, as it turned out, this was the same problem that affected the larger schema which we could not get.

CodeSynthesis XSD/e 3.0.x released

Monday, April 20th, 2009

XSD/e 3.1.0 was released a couple of days ago. In fact, we released 3.0.0 about two months ago but I haven’t talked much about it. This is because after the 3.0.0 release we got quite a bit of very positive feedback along with requests for additional, more advanced features that we promised to add but haven’t yet implemented. So we decided to do another quick iteration and release 3.1.0. In this post I will highlight what’s new in both XSD/e 3.0.0 and 3.1.0 (official announcements: XSD/e 3.0.0 and XSD/e 3.1.0).

Prior to the 3.0.0 release, XSD/e only supported the event-driven XML parsing/serialization mode where you had to process/supply data as the document was being parsed/serialized. While this mode is particularly suitable for mobile and embedded systems due to low memory consumption, many users asked for an easier to use in-memory, tree-like representation of data stored in XML. As a result, XSD/e 3.0.0 shipped with a new XML Schema to C++ mapping: C++/Hybrid.

There were a number of challenges that we had to overcome before introducing such a mapping into XSD/e. Unlike the general-purpose platforms, embedded systems are often severely constrained by the amount of memory available to the application. In fact, for single-purpose, massively-produced devices such as network modems the goal is to use as little RAM as possible since every megabyte not present in the device translates into huge savings for the manufacturer.

Thus, the first goal of the new mapping was to provide an in-memory representation of XML data using the least amount of RAM possible. For example, we couldn’t adopt the approach used in C++/Tree, our general-purpose in-memory mapping, where each node in the object model is allocated dynamically, because it wastes too much memory in extra pointers, heap management data, etc. At the same time we couldn’t allocate everything statically either since the copying involved in passing by value may be too expensive for some objects. As a result, the C++/Hybrid mapping divides all types into two categories: fixed-length and variable-length (if you are familiar with the IDL to C++ mapping in CORBA, you probably recognize the concept). Fixed-length types are allocated statically and returned by value while variable-length types are allocated dynamically and returned as pointers. This approach minimizes the memory usage while avoiding expensive copying. Consider the following schema fragment as an example:

<complexType name="point_t">
  <sequence>
    <element name="x" type="float"/>
    <element name="y" type="float"/>
    <element name="z" type="float"/>
  </sequence>
</complexType>
 
<complexType name="series_t">
  <sequence>
    <element name="value" type="int" maxOccurs="unbounded"/>
  </sequence>
</complexType>
 
<complexType name="measure_t">
  <sequence>
    <element name="point" type="point_t"/>
    <element name="series" type="series_t"/>
  </sequence>
</complexType>

The corresponding C++/Hybrid object model is shown below:

  // point_t (fixed-length)
  //
  class point_t
  {
  public:
    float x () const;
    float& x ();
    void x (float);
 
    float y () const;
    float& y ();
    void y (float);
 
    float z () const;
    float& z ();
    void z (float);
 
  private:
    float x_;
    float y_;
    float z_;
  };
 
  // series_t (variable-length)
  //
  class series_t
  {
  public:
    typedef pod_sequence<int> value_sequence;
    typedef value_sequence::iterator value_iterator;
    typedef value_sequence::const_iterator value_const_iterator;
 
    const value_sequence& value () const;
    value_sequence& value ();
 
  private:
    value_sequence value_;
  };
 
  // measure_t (variable-length)
  //
  class measure_t
  {
  public:
    const point_t& point () const;
    point_t& point ();
    void point (const point_t&);
 
    const series_t& series () const;
    series_t& series ();
    void series (series_t*);
 
  private:
    point_t point_;
    series_t* series_;
  };

In the above example the point_t class is fixed-length and contained by value in the measure_t class. In contrast, series_t contains a sequence of ints which makes it variable-length (and expensive to copy). Instances of this class are dynamically allocated and stored as pointers in measure_t.

But even with the optimal memory usage an in-memory mapping may not be usable in an embedded environment for all but very small XML documents. A 100Kb document is trivial by today’s desktop or server standards. But loading such a document all at once into the memory on an embedded system may be prohibitively expensive. So we have the harder to use, especially for larger XML vocabularies, event-driven mode that uses very little RAM. And we have the more convenient, in-memory mode that for all but fairly small documents requires too much memory. In C++/Hybrid we solved this by supporting a hybrid (thus the mapping name) partially in-memory, partially event-driven mode. In this mode your application is supplied (in case of parsing) or it supplies (in case of serialization) the XML document in fragments represented as in-memory object models. The following example will help illustrate how this works. Let’s extend the schema presented above with the data_t type:

<complexType name="data_t">
  <sequence>
    <element name="measure" type="measure_t" 
             maxOccurs="unbounded"/>
  </sequence>
</complexType>
 
<element name="data" type="data_t"/>

The corresponding XML document might look like this:

<data>
  <measure>
    <point>
      <x>12.3</x>
      <y>45.6</y>
      <z>78.9</z>
    </point>
    <series>
      <value>28.8</value>
      <value>29.9</value>
      <value>27.7</value>
    </series>
  </measure>
 
  ...
 
</data>

Let’s assume the XML document above contains a couple of thousand measure records which makes it too large to load into memory all at once. With C++/Hybrid you can setup parsing/serialization so that your application receives/supplies each measure one by one as an instance of the measure_t class. The depth in the XML document at which point you “switch” from event-driven to in-memory processing is arbitrary and is not limited to the top level. For example, if instead of having thousands of measure records we only had a few but each containing hundreds of thousands of value records, we could have setup parsing/serialization in such a way that the application receives/supplies the point data as an instance of point_t and then each value one by one as float.

So that was XSD/e 3.0.0. After its release a number of people started using the new mapping and providing us with feedback. It became apparent that a couple of more advanced features that we left out from the initial C++/Hybrid release were needed. These were added in XSD/e 3.1.0 with the major two being the support for XML Schema polymorphism and binary serialization.

Support for polymorphism allows C++/Hybrid to handle XML vocabularies that use substitution groups and/or xsi:type dynamic typing. To minimize the generated code size we used a new approach where only certain type hierarchies (automatically detected in case of substitution groups and indicated by the user in case of xsi:type) are treated as polymorphic.

Binary serialization provides an extensible, high-performance mechanism for saving the object model to and loading it from compact binary formats for storage or over-the-wire transfer. Binary representations contain only the data without any meta information or markup. Consequently, saving to and loading from a binary format can be an order of magnitude faster as well as result in a much smaller application footprint compared to parsing and serializing the same data in XML. Plus, the resulting representation is normally several times smaller than the equivalent XML.

Built-in support is provided for XDR (via Sun RPC API) and CDR (via the ACE library) and custom formats can be easily added. XDR appears to be a particularly good choice for a portable format since it is part of the operating systems on most commonly-used embedded platforms (for example, Linux, VxWorks, QNX, LynxOS, IPhone OS).

One common use-case for binary serialization is an embedded system that needs to consume and/or supply data in XML format but cannot afford to include an XML parser and/or serializer due to performance or footprint constraints. The requirement to use XML may come from the use of existing or third party desktop/server applications on the other end or from the use of industry-standard, XML-based formats. In this situation a control or gateway application running in a non-embedded environment translates the XML data sent to the embedded systems to a binary representation and then translates the binary representation received from the embedded systems back to XML.

There is a number of other interesting features in the C++/Hybrid mapping that I didn’t cover in this post, including:

  • Precise reproduction of the XML vocabulary structure and element order
  • Filtering of XML data during parsing and object model during serialization
  • Customizable object model classes as well as parsing and serialization code

If you would like more information on these and other features, the C++/Hybrid Mapping page is a good starting point.