CodeSynthesis XSD/e 3.0.x released

XSD/e 3.1.0 was released a couple of days ago. In fact, we released 3.0.0 about two months ago but I haven’t talked much about it. This is because after the 3.0.0 release we got quite a bit of very positive feedback along with requests for additional, more advanced features that we promised to add but haven’t yet implemented. So we decided to do another quick iteration and release 3.1.0. In this post I will highlight what’s new in both XSD/e 3.0.0 and 3.1.0 (official announcements: XSD/e 3.0.0 and XSD/e 3.1.0).

Prior to the 3.0.0 release, XSD/e only supported the event-driven XML parsing/serialization mode where you had to process/supply data as the document was being parsed/serialized. While this mode is particularly suitable for mobile and embedded systems due to low memory consumption, many users asked for an easier to use in-memory, tree-like representation of data stored in XML. As a result, XSD/e 3.0.0 shipped with a new XML Schema to C++ mapping: C++/Hybrid.

There were a number of challenges that we had to overcome before introducing such a mapping into XSD/e. Unlike the general-purpose platforms, embedded systems are often severely constrained by the amount of memory available to the application. In fact, for single-purpose, massively-produced devices such as network modems the goal is to use as little RAM as possible since every megabyte not present in the device translates into huge savings for the manufacturer.

Thus, the first goal of the new mapping was to provide an in-memory representation of XML data using the least amount of RAM possible. For example, we couldn’t adopt the approach used in C++/Tree, our general-purpose in-memory mapping, where each node in the object model is allocated dynamically, because it wastes too much memory in extra pointers, heap management data, etc. At the same time we couldn’t allocate everything statically either since the copying involved in passing by value may be too expensive for some objects. As a result, the C++/Hybrid mapping divides all types into two categories: fixed-length and variable-length (if you are familiar with the IDL to C++ mapping in CORBA, you probably recognize the concept). Fixed-length types are allocated statically and returned by value while variable-length types are allocated dynamically and returned as pointers. This approach minimizes the memory usage while avoiding expensive copying. Consider the following schema fragment as an example:

<complexType name="point_t">
  <sequence>
    <element name="x" type="float"/>
    <element name="y" type="float"/>
    <element name="z" type="float"/>
  </sequence>
</complexType>
 
<complexType name="series_t">
  <sequence>
    <element name="value" type="int" maxOccurs="unbounded"/>
  </sequence>
</complexType>
 
<complexType name="measure_t">
  <sequence>
    <element name="point" type="point_t"/>
    <element name="series" type="series_t"/>
  </sequence>
</complexType>

The corresponding C++/Hybrid object model is shown below:

  // point_t (fixed-length)
  //
  class point_t
  {
  public:
    float x () const;
    float& x ();
    void x (float);
 
    float y () const;
    float& y ();
    void y (float);
 
    float z () const;
    float& z ();
    void z (float);
 
  private:
    float x_;
    float y_;
    float z_;
  };
 
  // series_t (variable-length)
  //
  class series_t
  {
  public:
    typedef pod_sequence<int> value_sequence;
    typedef value_sequence::iterator value_iterator;
    typedef value_sequence::const_iterator value_const_iterator;
 
    const value_sequence& value () const;
    value_sequence& value ();
 
  private:
    value_sequence value_;
  };
 
  // measure_t (variable-length)
  //
  class measure_t
  {
  public:
    const point_t& point () const;
    point_t& point ();
    void point (const point_t&);
 
    const series_t& series () const;
    series_t& series ();
    void series (series_t*);
 
  private:
    point_t point_;
    series_t* series_;
  };

In the above example the point_t class is fixed-length and contained by value in the measure_t class. In contrast, series_t contains a sequence of ints which makes it variable-length (and expensive to copy). Instances of this class are dynamically allocated and stored as pointers in measure_t.

But even with the optimal memory usage an in-memory mapping may not be usable in an embedded environment for all but very small XML documents. A 100Kb document is trivial by today’s desktop or server standards. But loading such a document all at once into the memory on an embedded system may be prohibitively expensive. So we have the harder to use, especially for larger XML vocabularies, event-driven mode that uses very little RAM. And we have the more convenient, in-memory mode that for all but fairly small documents requires too much memory. In C++/Hybrid we solved this by supporting a hybrid (thus the mapping name) partially in-memory, partially event-driven mode. In this mode your application is supplied (in case of parsing) or it supplies (in case of serialization) the XML document in fragments represented as in-memory object models. The following example will help illustrate how this works. Let’s extend the schema presented above with the data_t type:

<complexType name="data_t">
  <sequence>
    <element name="measure" type="measure_t" 
             maxOccurs="unbounded"/>
  </sequence>
</complexType>
 
<element name="data" type="data_t"/>

The corresponding XML document might look like this:

<data>
  <measure>
    <point>
      <x>12.3</x>
      <y>45.6</y>
      <z>78.9</z>
    </point>
    <series>
      <value>28.8</value>
      <value>29.9</value>
      <value>27.7</value>
    </series>
  </measure>
 
  ...
 
</data>

Let’s assume the XML document above contains a couple of thousand measure records which makes it too large to load into memory all at once. With C++/Hybrid you can setup parsing/serialization so that your application receives/supplies each measure one by one as an instance of the measure_t class. The depth in the XML document at which point you “switch” from event-driven to in-memory processing is arbitrary and is not limited to the top level. For example, if instead of having thousands of measure records we only had a few but each containing hundreds of thousands of value records, we could have setup parsing/serialization in such a way that the application receives/supplies the point data as an instance of point_t and then each value one by one as float.

So that was XSD/e 3.0.0. After its release a number of people started using the new mapping and providing us with feedback. It became apparent that a couple of more advanced features that we left out from the initial C++/Hybrid release were needed. These were added in XSD/e 3.1.0 with the major two being the support for XML Schema polymorphism and binary serialization.

Support for polymorphism allows C++/Hybrid to handle XML vocabularies that use substitution groups and/or xsi:type dynamic typing. To minimize the generated code size we used a new approach where only certain type hierarchies (automatically detected in case of substitution groups and indicated by the user in case of xsi:type) are treated as polymorphic.

Binary serialization provides an extensible, high-performance mechanism for saving the object model to and loading it from compact binary formats for storage or over-the-wire transfer. Binary representations contain only the data without any meta information or markup. Consequently, saving to and loading from a binary format can be an order of magnitude faster as well as result in a much smaller application footprint compared to parsing and serializing the same data in XML. Plus, the resulting representation is normally several times smaller than the equivalent XML.

Built-in support is provided for XDR (via Sun RPC API) and CDR (via the ACE library) and custom formats can be easily added. XDR appears to be a particularly good choice for a portable format since it is part of the operating systems on most commonly-used embedded platforms (for example, Linux, VxWorks, QNX, LynxOS, IPhone OS).

One common use-case for binary serialization is an embedded system that needs to consume and/or supply data in XML format but cannot afford to include an XML parser and/or serializer due to performance or footprint constraints. The requirement to use XML may come from the use of existing or third party desktop/server applications on the other end or from the use of industry-standard, XML-based formats. In this situation a control or gateway application running in a non-embedded environment translates the XML data sent to the embedded systems to a binary representation and then translates the binary representation received from the embedded systems back to XML.

There is a number of other interesting features in the C++/Hybrid mapping that I didn’t cover in this post, including:

  • Precise reproduction of the XML vocabulary structure and element order
  • Filtering of XML data during parsing and object model during serialization
  • Customizable object model classes as well as parsing and serialization code

If you would like more information on these and other features, the C++/Hybrid Mapping page is a good starting point.

Comments are closed.