XSD 3.3.0 released

XSD 3.3.0 was released yesterday. For an exhaustive list of the new features see the official announcement. In this post I am going to cover a few major features in more detail and include some insights into what motivated their addition.

Besides the new features, XSD 3.3.0 includes a large number of bug fixes and performance improvements. The performance improvements should be especially welcome by those who have very large and complex schemas (the speedup can be up to 100 times in some cases; for a detailed account of one such optimization see this earlier post).

This release also coincides with the release of Xerces-C++ 3.1.1 which is a bugfix-only release for 3.1.0. Compared to 3.0.x, Xerces-C++ 3.1.x includes a number of new features and a large number of bug fixes, particularly in the XML Schema processor. XSD 3.3.0 has been extensively tested with this version of Xerces-C++ and all the pre-compiled binaries are built with 3.1.1.

This release also adds support for a number of new OS versions (AIX 6, Windows 7/Server 2008) and C++ compiler updates (Visual Studio 2010, GNU g++ 4.5.0, Intel C++ 11, Sun Studio 12.1, and IBM XL C++ 11). In particular, the distribution includes Visual Studio 2010 custom build rule files as well as the project and solution files for all the examples. And if you haven’t had a chance to try Visual Studio 2010 and think that upgrading a solution from previous versions is a smooth process, I am sorry to disappoint you. VS 2010 now uses MSBuild for doing the compilation and conversion from previous versions is a very slow and brittle process. I had to hand-fix the auto-converted project files on multiple occasions for both Xerces-C++ and XSD.

Configurable application character encoding

We have been getting quite a few emails where someone would try to set a string value in the object model and then get the invalid_utf8_string exception when serializing this object model to XML. This happens because the string value contains a non-ASCII character in some other encoding, usually ISO-8859-1. Since the object model expects all text data to be in UTF-8, such a character would be treated as part of a bogus multi-byte sequence. This was considered a major inconvenience by quite a few users.

Starting with XSD 3.3.0 you can configure the character encoding that should be used by the object model (--char-encoding). The default is still UTF-8 (for the char character type). But you can also specify iso8859-1, lcp (Xerces-C++ local code page), and custom.

The custom option allows you to support a custom encoding. For this to work you will need to implement the transcoder interface for your encoding (see the libxsd/xsd/cxx/xml/char-* files for examples) and include this implementation’s header at the beginning of the generated header files (see the --hxx-prologue option).

Note also that this mechanism replaces the XSD_USE_LCP macro that was used to select the Xerces-C++ local code page encoding in previous versions of XSD.

Uniform handling of multiple root elements

By default in the C++/Tree mapping you get a set of parsing/serialization functions for the document root element. You can then call one of these functions to parse/serialize the object model. If you have a single root element then this approach works very well. But what if your documents can have varying root elements. This is a fairly common scenario when the schema describes some kind of messaging protocol. The root elements can then correspond to messages, as in balance, withdraw, and deposit.

Prior to XSD 3.3.0, in order to handle such a vocabulary, you would need to first parse the document to DOM, check which root element it has, and then call the corresponding parsing function. Similarly, for serialization, you would have to determine which message it is, and call the corresponding serialization function. If you have tens or hundreds of root elements to handle, writing and maintaining such code manually quickly becomes burdensome.

In XSD 3.3.0 you can instruct the compiler to generate wrapper types instead of parsing/serialization functions for root elements in your vocabulary (--generate-element-type). You can also request the generation of an element map for uniform parsing/serialization of the element types (--generate-element-map). The application code would then look like this:

auto_ptr<xml_schema::element_type> req;
 
// Parse the XML request to a DOM document using
// the parse() function from dom-parse.hxx.
//
xml_schema::dom::auto_ptr<DOMDocument> doc (parse (...));
DOMElement& root (*doc->getDocumentElement ());
 
req = xml_schema::element_map::parse (root);
 
// We can test which request we've got either using RTTI
// or by comparing the element names, as shown below.
//
if (balance* b = dynamic_cast<balance*> (req.get ()))
{
  account_t& a (b->value ());
  ...
}
else if (req->_name () == withdraw::name ())
{
  withdraw& w (static_cast<withdraw&> (*req));
  ...
}
else if (req->_name () == deposit::name ())
{
  deposit& d (static_cast<deposit&> (*req));
  ...
}

For more information on the element types and map see the messaging example in the XSD distribution as well as Section 2.9.1, “Element Types” and Section 2.9.2, “Element Map” in the C++/Tree Mapping User Manual.

Generation of the detach functions

XSD 3.3.0 adds the --generate-detach option which instructs the compiler to generate detach functions for required elements and attributes. For optional and sequence cardinalities the detach functions are provided by the respective containers (and even without this option). These functions allow you to detach a sub-tree from an object model (returned as std::auto_ptr) and then re-attach it either in the same object model or in a different one using one of the std::auto_ptr-taking modifiers or constructors all without making any copies. For more information on this feature, refer to Section 2.8 “Mapping for Local Elements and Attributes” in the C++/Tree Mapping User Manual.

Smaller and faster code for polymorphic schemas

With XSD, schemas that use XML Schema polymorphism features (xsi:type and substitution groups) have to be compiled with the --generate-polymoprhic option. This results in two major changes in the generated code: all types are registered in type maps and parsing/serialization of elements has to go through these maps. As a result, such generated code is bigger and generally slower than the non-polymorphic version.

The major drawback of this approach is that it treats all types as potentially polymorphic while in most vocabularies only a handful of types are actually meant to be polymorphic (XML Schema has no way of distinguishing between polymorphic and non-polymorphic types — all types are potentially polymorphic). To address this problem in XSD 3.3.0 we have changed the way the compiler decides which types are polymorphic. Now, unless the --polymorphic-type-all option is specified (in which case the old behavior is used), only type hierarchies that are used in substitution groups or that are explicitly marked with the new --polymorphic-type option are treated as polymorphic.

There are two situations where you might need to use the --polymorphic-type option. The first is when your vocabulary uses the xsi:type-based dynamic typing. In this case the XSD compiler has no way of knowing which types are polymorphic. The second situation involves multiple schema files with one file defining the type and the second including/importing the first file and using the type in a substitution group. In this case the XSD compiler has no knowledge of the substitution group while compiling the first file and, as a result, has no way of knowing that the type is polymorphic. To help you identify the second situation the XSD compiler will issue a warning for each such case. Note also that you only need to specify the base of a polymorphic type hierarchy with the --polymorphic-type option. All the derived types will be assumed polymorphic automatically.

For more information on this change see Section 2.11, “Mapping for xsi:type and Substitution Groups” in the C++/Tree Mapping User Manual.

New examples: embedded, compression, and streaming

A number of new examples have been added in this release with the most interesting ones being embedded, compression, and streaming.

The embedded example shows how to embed the binary representation of the schema grammar into an application and then use it to parse and validate XML documents. It uses the little-known Xerces-C++ feature that allows one to load a number of schemas into the grammar cache and then serialize this grammar cache into a binary representation. The example provides a small utility, xsdbin, that creates this representation and then writes it out as a pair of C++ files containing an array with the binary data. This pair of files is then compiled and linked into the application. The main advantages of this approach over having a set of external schema files are that the application becomes self-sufficient (no need to locate the schema files) and the grammar loading is done from a pre-parsed state which can be much faster for larger schemas.

The compression example shows how to perform on-the-fly compression and decompression of XML documents during serialization and parsing, respectively. It uses the compression functionality provided by the zlib library and writes the data in the standard gzip format.

The streaming example is not really new but it has been significantly reworked. While the in-memory representation offered by C++/Tree is quite convenient, it may not be usable if the XML documents to be parsed or serialized are too big to fit into memory. There is, however, a way to still use C++/Tree which boils down to performing partially in-memory XML processing by only having a portion of the object model in memory at any given time. With this approach we can process parts of the document as they become available as well as handle documents that are too large to fit into memory all at once.

The parsing part in this example is handled by a stream-oriented DOM parser implementation that is built on top of the Xerces-C++ SAX2 parser in the progressive parsing mode. This parser allows us to parse an XML document as a series of DOM fragments which are then converted to object model fragments. Similarly, the serialization part is handled by a stream-oriented DOM serializer implementation that allows us to serialize an XML Document as a series of object model fragments.

Improvements in the file-per-type mode

With the introduction of the file-per-type mode in XSD 3.1.0 people started trying to compile very “hairy” (for the lack of a better word) schemas. Such schemas contain files that are not legal by themselves (lacking some include or import directories) and that have include/import cycles. Some of these schemas also contain a large number of files that are spread over a multi-level directory hierarchy.

This uncovered the following problem with the file-per-type mode. In this mode the compiler generates a set of C++ source files for each schema type. It also generates C++ files corresponding to each schema file that simply includes the header files corresponding to the types defined in this schema. All these files are generated into the same directory. While the compiler automatically resolves conflicts between the generated type files, it assumed that the schema files would be unique. This proved not to be the case — there are schemas that contain identically named files in different sub-directories.

Working on the fix for this problem made us think that some people might actually prefer to place the generated code for such schemas into sub-directories that model the original schema hierarchy. In order to support this scenario we have added the --schema-file-regex option which, together with the existing --type-file-regex, can be used to place the generated files into subdirectories. For an example that shows how to do this, see the GML 3.2.1 section on the GML Wiki page.

Comments are closed.