Archive for the ‘XML’ Category

Running XPath on a C++/Tree object model

Monday, May 18th, 2009

One interesting feature of the C++/Tree mapping in XSD is the ability to maintain an association between C++ object model nodes and corresponding DOM nodes. Consider the following XML document as an example:

<p:directory xmlns:p="http://www.example.com/people"
  <person>
    <first-name>John</first-name>
    <last-name>Doe</last-name>
    <gender>male</gender>
    <age>32</age>
  </person>
 
  <person>
    <first-name>Jane</first-name>
    <last-name>Doe</last-name>
    <gender>female</gender>
    <age>28</age>
  </person>
</p:directory>

Provided we requested the DOM association during parsing, having the person object we can obtain the DOMElement node corresponding to this object. We can also go the other way, that is, having a DOM node from a DOM document associated with a C++/Tree object model we can obtain the corresponding object model node.

One technique that is made possible thanks to the DOM association is the use of XPath queries to locate object model nodes. This is especially useful if you have a deeply nested document and you only need to access a small part of it buried deep inside.

The idea is to run an XPath query on the underlying DOM document, obtain the result as a collection of DOM nodes and then “move up” from these DOM nodes to the object model nodes. While the DOM implementation provided by Xerces-C++ does not support XPath, there are complimentary libraries, such as XQilla, that provide this functionality. The following code fragment shows how to locate all the people from the above XML file that are older than 30. It uses XQilla and the DOM XPath API from Xerces-C++ 2.8.0:

directory& d = ...
 
// Obtain the root element and document corresponding
// to the directory object.
//
DOMElement* root (static_cast<DOMElement*> (d._node ()));
DOMDocument* doc (root->getOwnerDocument ());
 
// Obtain namespace resolver.
//
dom::auto_ptr<XQillaNSResolver> resolver (
  (XQillaNSResolver*)doc->createNSResolver (root));
 
// Set the namespace prefix for the people namespace that
// we can use reliably in XPath expressions regardless of
// what is used in XML documents.
//
resolver->addNamespaceBinding (
  xml::string ("p").c_str (),
  xml::string ("http://www.example.com/people").c_str ());
 
// Create XPath expression.
//
dom::auto_ptr<const XQillaExpression> expr (
  static_cast<const XQillaExpression*> (
    doc->createExpression (
      xml::string ("p:directory/person[age > 30]").c_str (),
      resolver.get ())));
 
// Execute the query.
//
dom::auto_ptr<XPath2Result> r (
  static_cast<XPath2Result*> (
    expr->evaluate (
      doc, XPath2Result::ITERATOR_RESULT, 0)));
 
// Iterate over the result.
//
while (r->iterateNext ())
{
  const DOMNode* n (r->asNode ());
 
  // Obtain the object model node corresponding to
  // this DOM node.
  //
  person* p (
    static_cast<person*> (
      n->getUserData (dom::tree_node_key)));
 
  // Print the data using the object model.
  //
  cout << endl
       << "First  : " << p->first_name () << endl
       << "Last   : " << p->last_name () << endl
       << "Gender : " << p->gender () << endl
       << "Age    : " << p->age () << endl;
}

As you can see the code is littered with casts to XQilla-specific types such as XQillaNSResolver, XQillaExpression, and XPath2Result. This is necessary because the DOM interface in Xerces-C++ 2-series only supports the XPath 1.0 query model and is not sufficient for XPath 2.0 implemented by XQilla.

To make the integration of XQilla with Xerces-C++ cleaner, the Xerces-C++ and XQilla developers came up with an extended DOM XPath interface that accommodated both XPath 1.0 and 2.0 query models. On the Xerces-C++ side this interface was first made public in version 3.0.0. Soon after that XQilla 2.2.0 was released with the implementation of the new interface. The above code fragment rewritten to use the new interface is shown below:

directory& d = ...
 
// Obtain the root element and document corresponding
// to the directory object.
//
DOMElement* root (static_cast<DOMElement*> (d._node ()));
DOMDocument* doc (root->getOwnerDocument ());
 
// Obtain namespace resolver.
//
dom::auto_ptr<DOMXPathNSResolver> resolver (
  doc->createNSResolver (root));
 
// Set the namespace prefix for the people namespace that
// we can use reliably in XPath expressions regardless of
// what is used in XML documents.
//
resolver->addNamespaceBinding (
  xml::string ("p").c_str (),
  xml::string ("http://www.example.com/people").c_str ());
 
// Create XPath expression.
//
dom::auto_ptr<DOMXPathExpression> expr (
  doc->createExpression (
    xml::string ("p:directory/person[age > 30]").c_str (),
    resolver.get ()));
 
// Execute the query.
//
dom::auto_ptr<DOMXPathResult> r (
  expr->evaluate (
    doc, DOMXPathResult::ITERATOR_RESULT_TYPE, 0));
 
// Iterate over the result.
//
while (r->iterateNext ())
{
  DOMNode* n (r->getNodeValue ());
 
  // Obtain the object model node corresponding to
  // this DOM node.
  //
  person* p (
    static_cast<person*> (
      n->getUserData (dom::tree_node_key)));
 
  // Print the data using the object model.
  //
  cout << endl
       << "First  : " << p->first_name () << endl
       << "Last   : " << p->last_name () << endl
       << "Gender : " << p->gender () << endl
       << "Age    : " << p->age () << endl;
}

CodeSynthesis XSD/e 3.0.x released

Monday, April 20th, 2009

XSD/e 3.1.0 was released a couple of days ago. In fact, we released 3.0.0 about two months ago but I haven’t talked much about it. This is because after the 3.0.0 release we got quite a bit of very positive feedback along with requests for additional, more advanced features that we promised to add but haven’t yet implemented. So we decided to do another quick iteration and release 3.1.0. In this post I will highlight what’s new in both XSD/e 3.0.0 and 3.1.0 (official announcements: XSD/e 3.0.0 and XSD/e 3.1.0).

Prior to the 3.0.0 release, XSD/e only supported the event-driven XML parsing/serialization mode where you had to process/supply data as the document was being parsed/serialized. While this mode is particularly suitable for mobile and embedded systems due to low memory consumption, many users asked for an easier to use in-memory, tree-like representation of data stored in XML. As a result, XSD/e 3.0.0 shipped with a new XML Schema to C++ mapping: C++/Hybrid.

There were a number of challenges that we had to overcome before introducing such a mapping into XSD/e. Unlike the general-purpose platforms, embedded systems are often severely constrained by the amount of memory available to the application. In fact, for single-purpose, massively-produced devices such as network modems the goal is to use as little RAM as possible since every megabyte not present in the device translates into huge savings for the manufacturer.

Thus, the first goal of the new mapping was to provide an in-memory representation of XML data using the least amount of RAM possible. For example, we couldn’t adopt the approach used in C++/Tree, our general-purpose in-memory mapping, where each node in the object model is allocated dynamically, because it wastes too much memory in extra pointers, heap management data, etc. At the same time we couldn’t allocate everything statically either since the copying involved in passing by value may be too expensive for some objects. As a result, the C++/Hybrid mapping divides all types into two categories: fixed-length and variable-length (if you are familiar with the IDL to C++ mapping in CORBA, you probably recognize the concept). Fixed-length types are allocated statically and returned by value while variable-length types are allocated dynamically and returned as pointers. This approach minimizes the memory usage while avoiding expensive copying. Consider the following schema fragment as an example:

<complexType name="point_t">
  <sequence>
    <element name="x" type="float"/>
    <element name="y" type="float"/>
    <element name="z" type="float"/>
  </sequence>
</complexType>
 
<complexType name="series_t">
  <sequence>
    <element name="value" type="int" maxOccurs="unbounded"/>
  </sequence>
</complexType>
 
<complexType name="measure_t">
  <sequence>
    <element name="point" type="point_t"/>
    <element name="series" type="series_t"/>
  </sequence>
</complexType>

The corresponding C++/Hybrid object model is shown below:

  // point_t (fixed-length)
  //
  class point_t
  {
  public:
    float x () const;
    float& x ();
    void x (float);
 
    float y () const;
    float& y ();
    void y (float);
 
    float z () const;
    float& z ();
    void z (float);
 
  private:
    float x_;
    float y_;
    float z_;
  };
 
  // series_t (variable-length)
  //
  class series_t
  {
  public:
    typedef pod_sequence<int> value_sequence;
    typedef value_sequence::iterator value_iterator;
    typedef value_sequence::const_iterator value_const_iterator;
 
    const value_sequence& value () const;
    value_sequence& value ();
 
  private:
    value_sequence value_;
  };
 
  // measure_t (variable-length)
  //
  class measure_t
  {
  public:
    const point_t& point () const;
    point_t& point ();
    void point (const point_t&);
 
    const series_t& series () const;
    series_t& series ();
    void series (series_t*);
 
  private:
    point_t point_;
    series_t* series_;
  };

In the above example the point_t class is fixed-length and contained by value in the measure_t class. In contrast, series_t contains a sequence of ints which makes it variable-length (and expensive to copy). Instances of this class are dynamically allocated and stored as pointers in measure_t.

But even with the optimal memory usage an in-memory mapping may not be usable in an embedded environment for all but very small XML documents. A 100Kb document is trivial by today’s desktop or server standards. But loading such a document all at once into the memory on an embedded system may be prohibitively expensive. So we have the harder to use, especially for larger XML vocabularies, event-driven mode that uses very little RAM. And we have the more convenient, in-memory mode that for all but fairly small documents requires too much memory. In C++/Hybrid we solved this by supporting a hybrid (thus the mapping name) partially in-memory, partially event-driven mode. In this mode your application is supplied (in case of parsing) or it supplies (in case of serialization) the XML document in fragments represented as in-memory object models. The following example will help illustrate how this works. Let’s extend the schema presented above with the data_t type:

<complexType name="data_t">
  <sequence>
    <element name="measure" type="measure_t" 
             maxOccurs="unbounded"/>
  </sequence>
</complexType>
 
<element name="data" type="data_t"/>

The corresponding XML document might look like this:

<data>
  <measure>
    <point>
      <x>12.3</x>
      <y>45.6</y>
      <z>78.9</z>
    </point>
    <series>
      <value>28.8</value>
      <value>29.9</value>
      <value>27.7</value>
    </series>
  </measure>
 
  ...
 
</data>

Let’s assume the XML document above contains a couple of thousand measure records which makes it too large to load into memory all at once. With C++/Hybrid you can setup parsing/serialization so that your application receives/supplies each measure one by one as an instance of the measure_t class. The depth in the XML document at which point you “switch” from event-driven to in-memory processing is arbitrary and is not limited to the top level. For example, if instead of having thousands of measure records we only had a few but each containing hundreds of thousands of value records, we could have setup parsing/serialization in such a way that the application receives/supplies the point data as an instance of point_t and then each value one by one as float.

So that was XSD/e 3.0.0. After its release a number of people started using the new mapping and providing us with feedback. It became apparent that a couple of more advanced features that we left out from the initial C++/Hybrid release were needed. These were added in XSD/e 3.1.0 with the major two being the support for XML Schema polymorphism and binary serialization.

Support for polymorphism allows C++/Hybrid to handle XML vocabularies that use substitution groups and/or xsi:type dynamic typing. To minimize the generated code size we used a new approach where only certain type hierarchies (automatically detected in case of substitution groups and indicated by the user in case of xsi:type) are treated as polymorphic.

Binary serialization provides an extensible, high-performance mechanism for saving the object model to and loading it from compact binary formats for storage or over-the-wire transfer. Binary representations contain only the data without any meta information or markup. Consequently, saving to and loading from a binary format can be an order of magnitude faster as well as result in a much smaller application footprint compared to parsing and serializing the same data in XML. Plus, the resulting representation is normally several times smaller than the equivalent XML.

Built-in support is provided for XDR (via Sun RPC API) and CDR (via the ACE library) and custom formats can be easily added. XDR appears to be a particularly good choice for a portable format since it is part of the operating systems on most commonly-used embedded platforms (for example, Linux, VxWorks, QNX, LynxOS, IPhone OS).

One common use-case for binary serialization is an embedded system that needs to consume and/or supply data in XML format but cannot afford to include an XML parser and/or serializer due to performance or footprint constraints. The requirement to use XML may come from the use of existing or third party desktop/server applications on the other end or from the use of industry-standard, XML-based formats. In this situation a control or gateway application running in a non-embedded environment translates the XML data sent to the embedded systems to a binary representation and then translates the binary representation received from the embedded systems back to XML.

There is a number of other interesting features in the C++/Hybrid mapping that I didn’t cover in this post, including:

  • Precise reproduction of the XML vocabulary structure and element order
  • Filtering of XML data during parsing and object model during serialization
  • Customizable object model classes as well as parsing and serialization code

If you would like more information on these and other features, the C++/Hybrid Mapping page is a good starting point.

Xerces-C++ 3.0.0 Released

Monday, October 6th, 2008

Quite a few people believed this will never happen but after many years of development Xerces-C++ 3.0.0 is finally out. This major release includes a large number of new features, bug fixes, and clean-ups. It also happens to break a few interfaces (especially in DOM) so application adjustments may be required. For the complete list of changes in this version refer to the official announcement on the project’s mailing lists. In this post I am going to cover some of the major improvements in more detail.

As with 2.8.0, this release comes with a wide range of precompiled libraries (total 17) for various CPU architectures, operating systems, and C++ compilers. For most platforms 32-bit and 64-bit variants are provided. Note also that while the libraries are built using specific C++ compiler versions, most of them will also work with newer versions of the same compilers. For example, libraries built with GCC 3.4.x will also work with GCC 4.x.y. Similarly, libraries built with Sun C++ 5.7 (Studio 10) will work with Sun C++ 5.8.

The first thing GNU/Linux, UNIX, and Mac OS X users will notice is the new, automake-based build system for these platforms. There is no more XERCESCROOT or runConfigure and all the standard configure options are supported. There is also a number of options specific to Xerces-C++ which can all be viewed by executing configure --help. For Windows users the distribution comes with VC++ project files. In this release a set for VC++ 9.0 (2008) was added. Additionally, project files for VC++ 7.1, 8.0, and 9.0 now include targets to build Xerces-C++ with the ICU library as a character transcoder.

Other infrastructure work includes the removal of deprecated components (DepDOM, COM) as well as project files for unmaintained compilers. The documentation was cleaned-up and split into the website and library categories with the Xerces-C++ distributions now only including the library documentation (build instructions, programming guides, etc). Overall, I believe all of this will get the Xerces-C++ project back on the regular release track with the next release (3.1.0) in about a year.

Now to the new functionality in the library itself. The Xerces-C++ component that got the most work in this release is probably XML Schema. It includes a large number of bug fixes and errata changes. In particular the long-standing bug that resulted in long execution times and stack overflows on schemas with large minOccurs and maxOccurs values has been fixed. Also the new interpretation of the ##other namespace designator has been implemented. Related but not limited to XML Schema is the work done to review and clean-up all the diagnostics messages issued by Xerces-C++. They were all clarified and now consistently start with a lower-case letter and do not include a period at the end.

Prior to the 3.0.0 release Xerces-C++ included the draft DOM XPath 1 interfaces that were barely usable and required a lot of casting to the implementation when used with XPath 2 processors such as XQilla. In 3.0.0, the DOM XPath interfaces were extended to support both XPath 1 and XPath 2 data models. As a result, the application can now depend only on interfaces. The 2.2.0 release of XQilla, due in a few weeks, will include support for Xerces-C++ 3.0.0. Furthermore, the 3.0.0 release implements the XML Schema subset of XPath 1 in DOM. This allows you to execute basic XPath queries without requiring a separate XPath processor library.

Another major change in Xerces-C++ 3.0.0 is the porting of all public interfaces and a major part of the implementation to use 64-bit safe types. This means that if you design your application to be 64-bit safe (e.g., use std::size_t for indexes, lengths, etc.), then you don’t need to perform any casts when interfacing with Xerces-C++.

Finally, a number performance-critical parts were optimized for speed in this release. This resulted, for example, in both DOM parsing and XML Schema validation showing about 25%-30% improvement compared to 2.8.0.

When I first started working on the 3.0.0 code base it was in quite a mess with the automake-based build system still unfinished and having most of the source code 64-bit ignorant. At that point I decided that we will need to maintain both 2.8.0 and 3.0.0 in parallel in case the 3.0.0 release happens to be a disaster. This is why the Xerces-C++ project website now includes two sections, one for 2.8.0 and one for 3.0.0. As a release manager, my primary goals for Xerces-C++ 3.0.0 became to make it cleaner, easier to build, better tested, as well as to provide better XML Schema support. Two betas later and I think 3.0.0 came out to be a very solid release, better than 2.8.0 in every aspect and, in retrospect, making that website split probably unnecessary. In fact, we were confident enough to build all our XSD 3.2.0 binary distributions with Xerces-C++ 3.0.0.