Archive for the ‘Development’ Category

XSD 3.3.0 released

Thursday, April 29th, 2010

XSD 3.3.0 was released yesterday. For an exhaustive list of the new features see the official announcement. In this post I am going to cover a few major features in more detail and include some insights into what motivated their addition.

Besides the new features, XSD 3.3.0 includes a large number of bug fixes and performance improvements. The performance improvements should be especially welcome by those who have very large and complex schemas (the speedup can be up to 100 times in some cases; for a detailed account of one such optimization see this earlier post).

This release also coincides with the release of Xerces-C++ 3.1.1 which is a bugfix-only release for 3.1.0. Compared to 3.0.x, Xerces-C++ 3.1.x includes a number of new features and a large number of bug fixes, particularly in the XML Schema processor. XSD 3.3.0 has been extensively tested with this version of Xerces-C++ and all the pre-compiled binaries are built with 3.1.1.

This release also adds support for a number of new OS versions (AIX 6, Windows 7/Server 2008) and C++ compiler updates (Visual Studio 2010, GNU g++ 4.5.0, Intel C++ 11, Sun Studio 12.1, and IBM XL C++ 11). In particular, the distribution includes Visual Studio 2010 custom build rule files as well as the project and solution files for all the examples. And if you haven’t had a chance to try Visual Studio 2010 and think that upgrading a solution from previous versions is a smooth process, I am sorry to disappoint you. VS 2010 now uses MSBuild for doing the compilation and conversion from previous versions is a very slow and brittle process. I had to hand-fix the auto-converted project files on multiple occasions for both Xerces-C++ and XSD.

Configurable application character encoding

We have been getting quite a few emails where someone would try to set a string value in the object model and then get the invalid_utf8_string exception when serializing this object model to XML. This happens because the string value contains a non-ASCII character in some other encoding, usually ISO-8859-1. Since the object model expects all text data to be in UTF-8, such a character would be treated as part of a bogus multi-byte sequence. This was considered a major inconvenience by quite a few users.

Starting with XSD 3.3.0 you can configure the character encoding that should be used by the object model (--char-encoding). The default is still UTF-8 (for the char character type). But you can also specify iso8859-1, lcp (Xerces-C++ local code page), and custom.

The custom option allows you to support a custom encoding. For this to work you will need to implement the transcoder interface for your encoding (see the libxsd/xsd/cxx/xml/char-* files for examples) and include this implementation’s header at the beginning of the generated header files (see the --hxx-prologue option).

Note also that this mechanism replaces the XSD_USE_LCP macro that was used to select the Xerces-C++ local code page encoding in previous versions of XSD.

Uniform handling of multiple root elements

By default in the C++/Tree mapping you get a set of parsing/serialization functions for the document root element. You can then call one of these functions to parse/serialize the object model. If you have a single root element then this approach works very well. But what if your documents can have varying root elements. This is a fairly common scenario when the schema describes some kind of messaging protocol. The root elements can then correspond to messages, as in balance, withdraw, and deposit.

Prior to XSD 3.3.0, in order to handle such a vocabulary, you would need to first parse the document to DOM, check which root element it has, and then call the corresponding parsing function. Similarly, for serialization, you would have to determine which message it is, and call the corresponding serialization function. If you have tens or hundreds of root elements to handle, writing and maintaining such code manually quickly becomes burdensome.

In XSD 3.3.0 you can instruct the compiler to generate wrapper types instead of parsing/serialization functions for root elements in your vocabulary (--generate-element-type). You can also request the generation of an element map for uniform parsing/serialization of the element types (--generate-element-map). The application code would then look like this:

auto_ptr<xml_schema::element_type> req;
 
// Parse the XML request to a DOM document using
// the parse() function from dom-parse.hxx.
//
xml_schema::dom::auto_ptr<DOMDocument> doc (parse (...));
DOMElement& root (*doc->getDocumentElement ());
 
req = xml_schema::element_map::parse (root);
 
// We can test which request we've got either using RTTI
// or by comparing the element names, as shown below.
//
if (balance* b = dynamic_cast<balance*> (req.get ()))
{
  account_t& a (b->value ());
  ...
}
else if (req->_name () == withdraw::name ())
{
  withdraw& w (static_cast<withdraw&> (*req));
  ...
}
else if (req->_name () == deposit::name ())
{
  deposit& d (static_cast<deposit&> (*req));
  ...
}

For more information on the element types and map see the messaging example in the XSD distribution as well as Section 2.9.1, “Element Types” and Section 2.9.2, “Element Map” in the C++/Tree Mapping User Manual.

Generation of the detach functions

XSD 3.3.0 adds the --generate-detach option which instructs the compiler to generate detach functions for required elements and attributes. For optional and sequence cardinalities the detach functions are provided by the respective containers (and even without this option). These functions allow you to detach a sub-tree from an object model (returned as std::auto_ptr) and then re-attach it either in the same object model or in a different one using one of the std::auto_ptr-taking modifiers or constructors all without making any copies. For more information on this feature, refer to Section 2.8 “Mapping for Local Elements and Attributes” in the C++/Tree Mapping User Manual.

Smaller and faster code for polymorphic schemas

With XSD, schemas that use XML Schema polymorphism features (xsi:type and substitution groups) have to be compiled with the --generate-polymoprhic option. This results in two major changes in the generated code: all types are registered in type maps and parsing/serialization of elements has to go through these maps. As a result, such generated code is bigger and generally slower than the non-polymorphic version.

The major drawback of this approach is that it treats all types as potentially polymorphic while in most vocabularies only a handful of types are actually meant to be polymorphic (XML Schema has no way of distinguishing between polymorphic and non-polymorphic types — all types are potentially polymorphic). To address this problem in XSD 3.3.0 we have changed the way the compiler decides which types are polymorphic. Now, unless the --polymorphic-type-all option is specified (in which case the old behavior is used), only type hierarchies that are used in substitution groups or that are explicitly marked with the new --polymorphic-type option are treated as polymorphic.

There are two situations where you might need to use the --polymorphic-type option. The first is when your vocabulary uses the xsi:type-based dynamic typing. In this case the XSD compiler has no way of knowing which types are polymorphic. The second situation involves multiple schema files with one file defining the type and the second including/importing the first file and using the type in a substitution group. In this case the XSD compiler has no knowledge of the substitution group while compiling the first file and, as a result, has no way of knowing that the type is polymorphic. To help you identify the second situation the XSD compiler will issue a warning for each such case. Note also that you only need to specify the base of a polymorphic type hierarchy with the --polymorphic-type option. All the derived types will be assumed polymorphic automatically.

For more information on this change see Section 2.11, “Mapping for xsi:type and Substitution Groups” in the C++/Tree Mapping User Manual.

New examples: embedded, compression, and streaming

A number of new examples have been added in this release with the most interesting ones being embedded, compression, and streaming.

The embedded example shows how to embed the binary representation of the schema grammar into an application and then use it to parse and validate XML documents. It uses the little-known Xerces-C++ feature that allows one to load a number of schemas into the grammar cache and then serialize this grammar cache into a binary representation. The example provides a small utility, xsdbin, that creates this representation and then writes it out as a pair of C++ files containing an array with the binary data. This pair of files is then compiled and linked into the application. The main advantages of this approach over having a set of external schema files are that the application becomes self-sufficient (no need to locate the schema files) and the grammar loading is done from a pre-parsed state which can be much faster for larger schemas.

The compression example shows how to perform on-the-fly compression and decompression of XML documents during serialization and parsing, respectively. It uses the compression functionality provided by the zlib library and writes the data in the standard gzip format.

The streaming example is not really new but it has been significantly reworked. While the in-memory representation offered by C++/Tree is quite convenient, it may not be usable if the XML documents to be parsed or serialized are too big to fit into memory. There is, however, a way to still use C++/Tree which boils down to performing partially in-memory XML processing by only having a portion of the object model in memory at any given time. With this approach we can process parts of the document as they become available as well as handle documents that are too large to fit into memory all at once.

The parsing part in this example is handled by a stream-oriented DOM parser implementation that is built on top of the Xerces-C++ SAX2 parser in the progressive parsing mode. This parser allows us to parse an XML document as a series of DOM fragments which are then converted to object model fragments. Similarly, the serialization part is handled by a stream-oriented DOM serializer implementation that allows us to serialize an XML Document as a series of object model fragments.

Improvements in the file-per-type mode

With the introduction of the file-per-type mode in XSD 3.1.0 people started trying to compile very “hairy” (for the lack of a better word) schemas. Such schemas contain files that are not legal by themselves (lacking some include or import directories) and that have include/import cycles. Some of these schemas also contain a large number of files that are spread over a multi-level directory hierarchy.

This uncovered the following problem with the file-per-type mode. In this mode the compiler generates a set of C++ source files for each schema type. It also generates C++ files corresponding to each schema file that simply includes the header files corresponding to the types defined in this schema. All these files are generated into the same directory. While the compiler automatically resolves conflicts between the generated type files, it assumed that the schema files would be unique. This proved not to be the case — there are schemas that contain identically named files in different sub-directories.

Working on the fix for this problem made us think that some people might actually prefer to place the generated code for such schemas into sub-directories that model the original schema hierarchy. In order to support this scenario we have added the --schema-file-regex option which, together with the existing --type-file-regex, can be used to place the generated files into subdirectories. For an example that shows how to do this, see the GML 3.2.1 section on the GML Wiki page.

Two typical multi-threaded programming mistakes

Sunday, February 7th, 2010

A couple of days ago I had to fix two threading-related mistakes that I thought were very typical of people who are new to the “threaded way of thinking”. Here is the simplified version of that code. Don’t read too much into whether this is the best way to implement a string pool:

class string_pool
{
public:
  typedef std::map<std::string, std::size_t> map;
 
  string_pool (const map& ro_map)
    : ro_map_ (ro_map)
  {
  }
 
  std::size_t
  find_or_add (const std::string& s)
  {
    // First check the read-only map.
    //
    map::const_iterator i = ro_map_.find (s);
 
    if (i != ro_map_.end ())
      return i->second_;
 
    // Check the writable map.
    //
    i = map_.find (s);
 
    if (i != map_.end ())
      return i->second;
    else
    {
      // We are about to modify the map so we have to
      // synchronize this part.
      //
      auto_lock l (mutex_);
      std::size_t id = ro_map_.size () + map_.size () + 1;
      map_.insert (std::make_pair (s, id));
    }
  }
 
private:
  map map_;
  const map& ro_map_;
  mutex mutex_;
};

Can you spot the two problems with the above code? The first one is related to the misconception that the sole purpose of mutexes and similar primitives is to make sure that no two threads try to modify the same data simultaneously. This leads to the incorrect idea that we only need to synchronize the parts of code that modify the data while the parts that only read it can do so freely. On modern CPU architectures mutexes play another important role: they make sure that the changes made by one thread are actually visible to others and visible in the order they were made.

In the above code the problematic place is the unsynchronized check for the existence of the string in the writable map. What are some undesirable things that can happen here? For example, one thread could add a string while others won’t “see” this change and will all try to re-add the same string. Or, worse yet, a thread could only see part of the change. For example, the map_.insert() implementation could allocate a new, uninitialized map entry, update all the house-keeping information, and then initialize it with the passed pair. Imagine what will happen if the reading threads only see the result of the first part of this operation.

The other problem is more subtle. Compare these two code fragments:

  std::size_t id = ro_map_.size () + map_.size () + 1;
  std::size_t n = ro_map_.size ();
  std::size_t id = n + map_.size () + 1;

There is little semantic difference between the two. Now consider what happens when we add synchronization:

  auto_lock l (mutex_);
  std::size_t id = ro_map_.size () + map_.size () + 1;
  std::size_t n = ro_map_.size ();
  auto_lock l (mutex_);
  std::size_t id = n + map_.size () + 1;

Do you see the difference now? In the first fragment we call ro_map_.size(), which doesn’t need any synchronization, while having the mutex locked. This can result in higher contention between threads since we are holding the mutex for longer. While the call to size() here is probably negligible (the actual code I had to fix was calling virtual functions), it underlines a general principle: you should try to execute all the code that doesn’t need synchronization before acquiring the mutex.

Here is how we can fix these two problems in the above code:

  std::size_t
  find_or_add (const std::string& s)
  {
    // First check the read-only map.
    //
    map::const_iterator i = ro_map_.find (s);
 
    if (i != ro_map_.end ())
      return i->second_;
 
    // Perform operations that can be done in
    // parallel with other threads.
    //
    std::size_t n = ro_map_.size ();
    map::value_type pair (s, n + 1);
 
    // We are about to access the writable map so
    // synchronize this part.
    //
    auto_lock l (mutex_);
    i = map_.find (s);
 
    if (i != map_.end ())
      return i->second;
    else
    {
      pair.second += map_.size ();
      map_.insert (pair);
    }
  }

CLI 1.1.0 released

Tuesday, December 15th, 2009

CLI 1.1.0 is now available from the project’s web page. The automatic usage and documentation generation is by far the biggest new feature in this release. The usage information is nicely formatted during compilation and the documentation can be generated in the HTML and man page formats. For an example of the HTML output, see the CLI Compiler Command Line Manual. You may also want to check the cli.1 man page and the usage information printed by the CLI compiler. For more information on this feature see Section 3.3, “Option Documentation” in the Getting Started Guide.

Other new features in this release are the optional generation of modifiers in addition to accessors, support for erasing the parsed elements from the argv array, and the new scanner interface. The scanner allows one to supply command line arguments from custom sources. It has the following interface:

namespace cli
{
  class scanner
  {
  public:
    virtual bool
    more () = 0;
 
    virtual const char*
    peek () = 0;
 
    virtual const char*
    next () = 0;
 
    virtual void
    skip () = 0;
  };
}
  

The two standard scanner implementations provided by CLI are argv_scanner and argv_file_scanner. The first implementation is a simple scanner for the argv array. The argv_file_scanner implementation provides support for reading command line arguments from the argv array as well as files specified with command line options. Also, the generated option classes now have a new constructor which accepts the abstract scanner interface. For more information in this feature see Section 3.1, “Option Class Definition” in the Getting Started Guide.

For a more detailed list of new features and changes in this release see the NEWS file inside the CLI source distribution or the announcement on the cli-users mailing list.