Parsing C++ with GCC plugins, Part 2

May 10th, 2010

By popular demand, here is the second installment in the series of posts on parsing C++ using the new GCC plugin architecture. In the previous post we concentrated on setting up the plugin infrastructure and identifying the point in the compilation sequence where we can perform our own processing. In this post we will see how to work with the GCC AST (abstract syntax tree) in order to access the parsed C++ representation. By the end of this post we will have a plugin implementation that prints the names, types, and source code locations of all the declarations in the translation unit.

First let’s cover a few general things about the GCC internals and AST that are useful to know. GCC C++ compiler, cc1plus, can only process one file at a time (you can pass several files to the compiler driver, g++, but it simply invokes cc1plus separately for each file). As a result, GCC doesn’t bother with encapsulation and instead makes heavy use of global variables. In fact, most of the “data entry points” are accessible as global variables. We have already seen a few such variables in the previous post, notably, error_count (number of compilation errors) and main_input_filename (name of the file being compiled). Perhaps the most commonly used such variable is global_namespace which is the root of the AST.

The GCC AST itself is a curious data structure in that it is an implementation of the polymorphic data type idea in C (next time someone tells you that polymorphism works perfectly in C and they don’t need “bloated” C++ for that, show them the GCC AST). The base “handle” for all the AST nodes is the tree pointer type. Because the actual nodes can be of some “extended” types, access to the data stored in the AST nodes is done via macros. All such macros are spelled in capital letters and normally perform two operations: they check that the actual node type is compatible with the request and, if so, they return the data requested. A large number of macros defined for the AST are predicates. That is, they check for a certain condition and return true or false. Such macros normally end with _P.

Each tree node in the AST has a tree code of type int which identifies what kind of node it is. To get the tree code you use the TREE_CODE macro. Another useful global variable available to you is tree_code_name which is an array of strings containing human-readable tree code names. It is quite useful during development to see what kind of tree nodes you are getting, for example:

tree decl = ...
int tc (TREE_CODE (decl));
cerr << "got " << tree_code_name[tc] << endl;

Each tree node type has a tree code constant defined for it, for example, TYPE_DECL (type declaration), VAR_DECL (variable declaration), ARRAY_TYPE (array type), and RECORD_TYPE (class/struct type). Oftentimes macros that only apply to a specific kind of nodes have their names start with the corresponding prefix, for example, macro DECL_NAME can only be used on *_DECL nodes and macro TYPE_NAME can only be used on *_TYPE nodes.

To allow the construction of the AST out of the tree nodes, the tree type supports chaining nodes in linked lists. To traverse such lists you would use the TREE_CHAIN macro, for example:

tree decl = ...
 
for (; decl != 0; decl = TREE_CHAIN (decl))
{
  ...
}

The AST type system also supports two dedicated container nodes: vector (TREE_VEC tree code) and two-value linked list (TREE_LIST tree code). However, these containers are used less often and will be covered as we encounter them.

One major class of nodes in the GCC AST is declarations. A declaration in C++ names an entity in a scope. Examples of declarations include a type declaration, a function declaration, a variable declaration, and a namespace declaration. To get to the declaration’s name we use the DECL_NAME macro. This macro returns a tree node of the IDENTIFIER_NODE type. To get the declaration’s name as const char* we can use the IDENTIFIER_POINTER macro. For example:

tree decl = ...;
tree id (DECL_NAME (decl));
const char* name (IDENTIFIER_POINTER (id));

While most declarations have names, there are certain cases, for example an unnamed namespace declaration, where DECL_NAME can return NULL.

Other macros that are useful when dealing with declarations include TREE_TYPE, DECL_SOURCE_FILE, and DECL_SOURCE_LINE. TREE_TYPE returns the tree node (with one of the *_TYPE tree codes) corresponding to the type of entity being declared. The DECL_SOURCE_FILE and DECL_SOURCE_LINE macros return the file and line information for the declaration.

Let’s now see how we can use all this information to traverse the AST and print some information about the declarations that we encounter. The first thing that we need is a way to get the list of declarations for a namespace. The GCC Internals documentation states that we can call the cp_namespace_decls function to get “the declarations contained in the namespace, including types, overloaded functions, other namespaces, and so forth.” However, this is not the case. With this function you can get to all the declarations except nested namespaces. This is because nested namespace declarations are stored in a different list in the cp_binding_level struct. If you want to know what the cp_binding_level is for, I suggest that you read its description in the GCC headers. Otherwise, you can just treat it as magic and use the following code to access all the declarations in a namespace:

void
traverse (tree ns)
{
  tree decl;
  cp_binding_level* level (NAMESPACE_LEVEL (ns));
 
  // Traverse declarations.
  //
  for (decl = level->names;
       decl != 0;
       decl = TREE_CHAIN (decl))
  {
    if (DECL_IS_BUILTIN (decl))
      continue;
 
    print_decl (decl);
  }
 
  // Traverse namespaces.
  //
  for(decl = level->namespaces;
      decl != 0;
      decl = TREE_CHAIN (decl))
  {
    if (DECL_IS_BUILTIN (decl))
      continue;
 
    print_decl (decl);
    traverse (decl);
  }
}

You may be wondering what the DECL_IS_BUILTIN checks are for. Besides the declarations that come from the file being compiled, the GCC AST also contains a number of implicit declarations for RTTI, exceptions, and static construction/destruction support code as well as compiler builtin declarations. Normally we would want to skip such declarations since we are not interested in them. But feel free to disable the above checks and see what happens.

The print_decl() function is shown below:

void
print_decl (tree decl)
{
  int tc (TREE_CODE (decl));
  tree id (DECL_NAME (decl));
  const char* name (id
                    ? IDENTIFIER_POINTER (id)
                    : "<unnamed>");
 
  cerr << tree_code_name[tc] << " " << name << " at "
       << DECL_SOURCE_FILE (decl) << ":"
       << DECL_SOURCE_LINE (decl) << endl;
}

Let’s now plug this code into the GCC plugin skeleton that we developed last time. All we need to do is add the traverse(global_namespace); call after the following statement in gate_callback():

  //
  // Process AST. Issue diagnostics and set r
  // to 1 in case of an error.
  //
  cerr << "processing " << main_input_filename << endl;

We can now try to process some C++ code with our plugin. Let’s try the following few declarations:

void f ();
 
namespace n
{
  class c {};
}
 
typedef n::c t;
int v;

The output from running our plugin on the above code will be something along these lines:

starting plugin
processing test.cxx
var_decl v at test.cxx:10
type_decl t at test.cxx:8
function_decl f at test.cxx:1
namespace_decl n at test.cxx:4
type_decl c at test.cxx:5

When I just started working with the GCC AST, I expected that I would be iterating over declarations in the same order as they were declared in the source code. As you can see from the above output this is clearly not the case. While having multiple lists for declarations (for example, names and namespaces in the namespace node) would already not allow such ordered iteration, the order of declarations in the same list is not preserved either, as evident from the above output. And it gets worse. Consider the following C++ fragment:

namespace n
{
  class a {};
}
 
void f ();
 
namespace n
{
  class b {};
}

The output from our plugin looks like this:

function_decl f at test.cxx:6
namespace_decl n at test.cxx:2
type_decl b at test.cxx:10
type_decl a at test.cxx:3

What happens is GCC merges all namespace declarations for the same namespace into a single AST node.

If you think about what GCC does with the AST, this organization is not really surprising. In the end, all GCC cares about are function bodies for which it needs to generate machine code. And for that the order of declarations is not important. However, if you are going to produce any kind of human-readable information from the AST, then you will probably want this information to be in the declaration order as found in the source code.

There is a way to iterate over declarations in the source code order, however, it requires a bit of extra effort. In a nutshell, the idea is to first collect all the declarations, then sort them according to the source code order, and finally traverse that sorted list of declarations. But how can we sort the declarations according to the source code order? We have seen how to get the file name and line information for a declaration, however, we cannot compare this information without a complete knowledge of the #include hierarchy. To make this work we need to understand how GCC tracks location information in the AST.

Storing file/line/column information with each tree node would require too much memory so instead GCC stores an instance of the location_t type (currently defined as unsigned int) in tree nodes. The location_t values consist of three bit-fields: the index into the line map, line offset, and column number. The line map stores entries that represent continuous file fragments, that is, file fragments that are not interrupted by #include directives. Line map entries contain information such as the file name and start line position. Using the location_t value one can look up the line map entry and get the file name, line number (start line plus offset) and column number. One property of the location_t values that we are going to exploit is that values for locations further down in the translation unit have greater values. As a result we can create the following container that will automatically keep declarations that we insert into it in the source code order:

struct decl_comparator
{
  bool
  operator() (tree x, tree y) const
  {
    location_t xl (DECL_SOURCE_LOCATION (x));
    location_t yl (DECL_SOURCE_LOCATION (y));
 
    return xl < yl;
  }
};
 
typedef std::multiset<tree, decl_comparator> decl_set;

Now we can implement the collect() function which adds all the declarations into the set:

void
collect (tree ns, decl_set& set)
{
  tree decl;
  cp_binding_level* level (NAMESPACE_LEVEL (ns));
 
  // Collect declarations.
  //
  for (decl = level->names;
       decl != 0;
       decl = TREE_CHAIN (decl))
  {
    if (DECL_IS_BUILTIN (decl))
      continue;
 
    set.insert (decl);
  }
 
  // Traverse namespaces.
  //
  for(decl = level->namespaces;
      decl != 0;
      decl = TREE_CHAIN (decl))
  {
    if (DECL_IS_BUILTIN (decl))
      continue;
 
    collect (decl, set);
  }
}

The new traverse() implementation will then look like this:

void
traverse (tree ns)
{
  decl_set set;
  collect (ns, set);
 
  for (decl_set::iterator i (set.begin ()),
       e (set.end ()); i != e; ++i)
  {
    print_decl (*i);
  }
}

If we now run this new implementation of our plugin on the C++ fragment presented earlier, we will get the following output:

function_decl f at test.cxx:1
type_decl c at test.cxx:5
type_decl t at test.cxx:8
var_decl v at test.cxx:9

Note that now we don’t track namespace declaration nodes since they are merged into one anyway. If you need to recreate the original namespace hierarchy, the best approach is to use the namespace information that can be inferred from declaration nodes using the CP_DECL_CONTEXT macro. For example, the following function returns the namespace name for a declaration:

std::string
decl_namespace (tree decl)
{
  string s, tmp;
 
  for (tree scope (CP_DECL_CONTEXT (decl));
       scope != global_namespace;
       scope = CP_DECL_CONTEXT (scope))
  {
    tree id (DECL_NAME (scope));
 
    tmp = "::";
    tmp += (id != 0
            ? IDENTIFIER_POINTER (id)
            : "<unnamed>");
    tmp += s;
    s.swap (tmp);
  }
 
  return s;
}

And that’s it for today. If you have any questions or comments, you are welcome to leave them below. The complete source code for the plugin we have developed in this post is available as the plugin-2.cxx file (it is fun to try to run it on some real C++ source files). In the next post we will talk about types (*_TYPE tree codes) and in particular how to traverse classes.

Parsing C++ with GCC plugins, Part 1

May 3rd, 2010

You have probably heard about the recent release of GCC 4.5.0. One of the new features in this version is the support for plugins. You can now write a shared object (.so) that can be loaded into GCC and hooked into various stages of the compilation process.

In the past couple of months I have been working on a new project (what it’s about is a secret, for now; UDATE: no longer a secret ) that uses GCC and the new plugin feature in order to parse C++ and then to generate some code based on it.

Writing a plugin to accomplish this was both fun and frustrating. Fun because GCC has a very rich abstract syntax tree (AST, sometimes called C++ Tree in GCC documentation). The amount of information available about parsed C++ is amazing; there isn’t much you can’t infer about the code. It was frustrating because this AST is very complex and very poorly documented. So is the plugin API. Most of the time I was reading the AST headers to learn more about the API and studied the GCC compiler source code to understand how to use it.

While there are a few other plugins around (and more will probably be written in the future), most of them concentrate on either optimizations or code generation (a good example of the latter is LLVM’s DragonEgg plugin). The only exception is probably Mozilla’s Dehydra/Treehydra set of plugins. However, Dehydra simply exposes a flattened subset of GCC’s AST as a set of JavaScript objects (for example, there is no namespace or #include information). Treehydra relies on GIMPLE which is a representation one level below (towards the machine code) from the parsed C++.

As a result, there isn’t much information or source code examples that show how to work with the GCC’s C++ AST. And since I have already figured out most of the basics, I was thinking about writing a series of blog posts that show how to use GCC plugins to parse C++. What you do based on this information is up to you. Some of the potential applications include static analysis, (source) code generation, documentation generation, binding to other languages, editor/IDE support, etc. In today’s post I am going to show how to set up the plugin infrastructure for this kind of tasks. If there is interest, future posts will cover various aspects of working with GCC’s AST. So if you would like to read more on this topic, drop a line in the comments and if there is enough interest, I will write more on GCC plugins.

GCC plugin API is covered in Chapter 23, “Plugins” in the GCC Internals documentation. As described in this chapter, there are several compilation events (or phases) that the plugin can register for. Unfortunately none of the existing events are suitable for the kind of task that we want to perform. What we want is to be called just after the AST has been constructed and before any other passes are performed. We don’t want to perform any other passes since that would only be a waste of time. All we need is the C++ AST. At first it may seem that PLUGIN_FINISH_UNIT is a good place to run our code. However, a number of passes are performed before it (you can test this by registering a callback for the PLUGIN_OVERRIDE_GATE event which will allow you to see all the passes that are being executed).

One way to achieve what we want would be to register a callback for the PLUGIN_OVERRIDE_GATE event. This callback is called before every pass and it allows the plugin to decide whether to run the pass in question. The first call to this callback will then by definition be before any other pass has run. We can then call our code from this first execution of the callback and then terminate GCC. Here is the skeleton for this callback:

extern "C" void
gate_callback (void* gcc_data, void*)
{
  // If there were errors during compilation,
  // let GCC handle the exit.
  //
  if (errorcount || sorrycount)
    return;
 
  int r (0);
 
  //
  // Process AST. Issue diagnostics and set r
  // to 1 in case of an error.
  //
 
  // Terminate GCC.
  //
  exit (r);
}

errorcount and sorrycount are GCC variables that contain the error counts. The plugin API includes all the internal GCC headers so a plugin can access all the data and call all the functions that the code in the GCC compiler itself can.

Now we have set up the entry point for our plugin in the overall compilation process. There is, however, another thing that we need to take care of: the compiler output. When you execute something like this:

g++ -fplugin=plugin.so -c test.cxx

g++ isn’t the executable that will actually load plugin.so. g++ is a compiler driver that runs several other programs under the hood in order to translate test.cxx to test.o (use the -v option to see what’s actually being executed by g++). It first runs the program called cc1plus which is the actual C++ compiler and which will load the plugin. The output of cc1plus is an assembly file. Once the assembly file is generated, g++ invokes as to translate the assembly file to test.o.

Our plugin is altering the GCC compilation process. Instead of the assembly file we want to generate something else (or maybe no output files at all in case of a static analysis tool). Do you see the problem now? While our plugin is producing some other output, g++ assumes it will produce an assembly file which it will then try to pass to the assembler.

While we can try to invoke cc1plus directly, it is an internal program of GCC and is invoked by g++ with some additional options which we would rather not deal with. Instead, we can ask g++ to produce an assembly file by passing -S instead of -c. In this case g++ is not going to invoke the assembler and nobody will care that the output assembly file does not exist.

So this part is sorted out then. Well, not quite. While we terminate GCC quite early, before any assembly can actually be generated, the output assembly file is still created. To get rid of this file we need to add the following line in our plugin_init():

asm_file_name = HOST_BIT_BUCKET;

HOST_BIT_BUCKET is defined as "/dev/null". Here is the complete source code for the skeleton of our plugin:

// GCC header includes to get the parse tree
// declarations. The order is important and
// doesn't follow any kind of logic.
//
 
#include <stdlib.h>
#include <gmp.h>
 
#include <cstdlib> // Include before GCC poisons
                   // some declarations.
 
extern "C"
{
#include "gcc-plugin.h"
 
#include "config.h"
#include "system.h"
#include "coretypes.h"
#include "tree.h"
#include "intl.h"
 
#include "tm.h"
 
#include "diagnostic.h"
#include "c-common.h"
#include "c-pragma.h"
#include "cp/cp-tree.h"
}
 
#include <iostream>
 
using namespace std;
 
int plugin_is_GPL_compatible;
 
extern "C" void
gate_callback (void*, void*)
{
  // If there were errors during compilation,
  // let GCC handle the exit.
  //
  if (errorcount || sorrycount)
    return;
 
  int r (0);
 
  //
  // Process AST. Issue diagnostics and set r
  // to 1 in case of an error.
  //
  cerr << "processing " << main_input_filename << endl;
 
  exit (r);
}
 
extern "C" int
plugin_init (plugin_name_args* info,
             plugin_gcc_version* ver)
{
  int r (0);
 
  cerr << "starting " << info->base_name << endl;
 
  //
  // Parse options if any.
  //
 
  // Disable assembly output.
  //
  asm_file_name = HOST_BIT_BUCKET;
 
  // Register callbacks.
  //
  register_callback (info->base_name,
                     PLUGIN_OVERRIDE_GATE,
                     &gate_callback,
                     0);
  return r;
}

You can compile and try it out like so:

$ g++-4.5 -I`g++-4.5 -print-file-name=plugin`/include \
-fPIC -shared plugin.cxx -o plugin.so
 
$ g++-4.5 -S -fplugin=./plugin.so test.cxx
starting plugin
processing test.cxx

Update: Starting with version 4.7.0, GCC can be built either in C or C++ mode. And starting with version 4.8.0, it is always built as C++. If you try to run the above example using GCC built in the C++ mode, you will get an error saying that the plugin cannot be loaded because one or more symbols are undefined. The reason for this error is that now all the GCC symbols have C++ linkage while we include them as extern "C". The solution to this problem is to remove the extern "C" { } block around the include directives at the beginning of our plugin source code (note that the following functions should still remain extern "C").

Another option that you will probably want to add to the plugin invocation is -x c++. It tells GCC that what’s being compiled is C++ regardless of the file extension. This is useful if you plan to compile, for example, C++ header files (in this case and without this option, GCC will try to generate a precompiled header instead of an assembly file). Having to remember to specify the two options (-S -x c++) could be quite inconvenient for the users of our plugin.

The plugin can also have options of its own which are specified on the g++ command line in the following form:

-fplugin-arg-<plugin-name>-<key>[=<value>]

This is quite verbose and can also become a major inconvenience for the users of our plugin. To address the above two problems it makes sense to create a driver for our plugin, similar to how g++ is a driver for cc1plus. The driver will automatically pass the -S -x c++ -fplugin=./plugin.so options to g++ and convert plugin options to the -fplugin-arg- format before passing them to g++.

For my project I wrote a plugin driver that uses the following conventions. The driver recognizes the commonly used options such as -I, -D, etc., and passes them to g++ as is. Otherwise the -x option can be used to pass extra options to g++ (for example, -x -m32 ). If an argument to -x does not start with ‘-‘, then it is treated as the g++ executable name. Everything else is converted to the -fplugin-arg- format and passed as plugin options which are then handled in the plugin code with the help of cli. So if you execute:

driver -x g++-4.5 -x m32 --foo bar test.cxx

Then the g++ command line will look like this:

g++-4.5 -m32 -S -x c++ -fplugin=./plugin.so \
-fplugin-arg-plugin-foo=bar test.cxx

And that’s it for today. Remember to drop a line in the comments if you would like to read more about parsing C++ with GCC plugins.

XSD 3.3.0 released

April 29th, 2010

XSD 3.3.0 was released yesterday. For an exhaustive list of the new features see the official announcement. In this post I am going to cover a few major features in more detail and include some insights into what motivated their addition.

Besides the new features, XSD 3.3.0 includes a large number of bug fixes and performance improvements. The performance improvements should be especially welcome by those who have very large and complex schemas (the speedup can be up to 100 times in some cases; for a detailed account of one such optimization see this earlier post).

This release also coincides with the release of Xerces-C++ 3.1.1 which is a bugfix-only release for 3.1.0. Compared to 3.0.x, Xerces-C++ 3.1.x includes a number of new features and a large number of bug fixes, particularly in the XML Schema processor. XSD 3.3.0 has been extensively tested with this version of Xerces-C++ and all the pre-compiled binaries are built with 3.1.1.

This release also adds support for a number of new OS versions (AIX 6, Windows 7/Server 2008) and C++ compiler updates (Visual Studio 2010, GNU g++ 4.5.0, Intel C++ 11, Sun Studio 12.1, and IBM XL C++ 11). In particular, the distribution includes Visual Studio 2010 custom build rule files as well as the project and solution files for all the examples. And if you haven’t had a chance to try Visual Studio 2010 and think that upgrading a solution from previous versions is a smooth process, I am sorry to disappoint you. VS 2010 now uses MSBuild for doing the compilation and conversion from previous versions is a very slow and brittle process. I had to hand-fix the auto-converted project files on multiple occasions for both Xerces-C++ and XSD.

Configurable application character encoding

We have been getting quite a few emails where someone would try to set a string value in the object model and then get the invalid_utf8_string exception when serializing this object model to XML. This happens because the string value contains a non-ASCII character in some other encoding, usually ISO-8859-1. Since the object model expects all text data to be in UTF-8, such a character would be treated as part of a bogus multi-byte sequence. This was considered a major inconvenience by quite a few users.

Starting with XSD 3.3.0 you can configure the character encoding that should be used by the object model (--char-encoding). The default is still UTF-8 (for the char character type). But you can also specify iso8859-1, lcp (Xerces-C++ local code page), and custom.

The custom option allows you to support a custom encoding. For this to work you will need to implement the transcoder interface for your encoding (see the libxsd/xsd/cxx/xml/char-* files for examples) and include this implementation’s header at the beginning of the generated header files (see the --hxx-prologue option).

Note also that this mechanism replaces the XSD_USE_LCP macro that was used to select the Xerces-C++ local code page encoding in previous versions of XSD.

Uniform handling of multiple root elements

By default in the C++/Tree mapping you get a set of parsing/serialization functions for the document root element. You can then call one of these functions to parse/serialize the object model. If you have a single root element then this approach works very well. But what if your documents can have varying root elements. This is a fairly common scenario when the schema describes some kind of messaging protocol. The root elements can then correspond to messages, as in balance, withdraw, and deposit.

Prior to XSD 3.3.0, in order to handle such a vocabulary, you would need to first parse the document to DOM, check which root element it has, and then call the corresponding parsing function. Similarly, for serialization, you would have to determine which message it is, and call the corresponding serialization function. If you have tens or hundreds of root elements to handle, writing and maintaining such code manually quickly becomes burdensome.

In XSD 3.3.0 you can instruct the compiler to generate wrapper types instead of parsing/serialization functions for root elements in your vocabulary (--generate-element-type). You can also request the generation of an element map for uniform parsing/serialization of the element types (--generate-element-map). The application code would then look like this:

auto_ptr<xml_schema::element_type> req;
 
// Parse the XML request to a DOM document using
// the parse() function from dom-parse.hxx.
//
xml_schema::dom::auto_ptr<DOMDocument> doc (parse (...));
DOMElement& root (*doc->getDocumentElement ());
 
req = xml_schema::element_map::parse (root);
 
// We can test which request we've got either using RTTI
// or by comparing the element names, as shown below.
//
if (balance* b = dynamic_cast<balance*> (req.get ()))
{
  account_t& a (b->value ());
  ...
}
else if (req->_name () == withdraw::name ())
{
  withdraw& w (static_cast<withdraw&> (*req));
  ...
}
else if (req->_name () == deposit::name ())
{
  deposit& d (static_cast<deposit&> (*req));
  ...
}

For more information on the element types and map see the messaging example in the XSD distribution as well as Section 2.9.1, “Element Types” and Section 2.9.2, “Element Map” in the C++/Tree Mapping User Manual.

Generation of the detach functions

XSD 3.3.0 adds the --generate-detach option which instructs the compiler to generate detach functions for required elements and attributes. For optional and sequence cardinalities the detach functions are provided by the respective containers (and even without this option). These functions allow you to detach a sub-tree from an object model (returned as std::auto_ptr) and then re-attach it either in the same object model or in a different one using one of the std::auto_ptr-taking modifiers or constructors all without making any copies. For more information on this feature, refer to Section 2.8 “Mapping for Local Elements and Attributes” in the C++/Tree Mapping User Manual.

Smaller and faster code for polymorphic schemas

With XSD, schemas that use XML Schema polymorphism features (xsi:type and substitution groups) have to be compiled with the --generate-polymoprhic option. This results in two major changes in the generated code: all types are registered in type maps and parsing/serialization of elements has to go through these maps. As a result, such generated code is bigger and generally slower than the non-polymorphic version.

The major drawback of this approach is that it treats all types as potentially polymorphic while in most vocabularies only a handful of types are actually meant to be polymorphic (XML Schema has no way of distinguishing between polymorphic and non-polymorphic types — all types are potentially polymorphic). To address this problem in XSD 3.3.0 we have changed the way the compiler decides which types are polymorphic. Now, unless the --polymorphic-type-all option is specified (in which case the old behavior is used), only type hierarchies that are used in substitution groups or that are explicitly marked with the new --polymorphic-type option are treated as polymorphic.

There are two situations where you might need to use the --polymorphic-type option. The first is when your vocabulary uses the xsi:type-based dynamic typing. In this case the XSD compiler has no way of knowing which types are polymorphic. The second situation involves multiple schema files with one file defining the type and the second including/importing the first file and using the type in a substitution group. In this case the XSD compiler has no knowledge of the substitution group while compiling the first file and, as a result, has no way of knowing that the type is polymorphic. To help you identify the second situation the XSD compiler will issue a warning for each such case. Note also that you only need to specify the base of a polymorphic type hierarchy with the --polymorphic-type option. All the derived types will be assumed polymorphic automatically.

For more information on this change see Section 2.11, “Mapping for xsi:type and Substitution Groups” in the C++/Tree Mapping User Manual.

New examples: embedded, compression, and streaming

A number of new examples have been added in this release with the most interesting ones being embedded, compression, and streaming.

The embedded example shows how to embed the binary representation of the schema grammar into an application and then use it to parse and validate XML documents. It uses the little-known Xerces-C++ feature that allows one to load a number of schemas into the grammar cache and then serialize this grammar cache into a binary representation. The example provides a small utility, xsdbin, that creates this representation and then writes it out as a pair of C++ files containing an array with the binary data. This pair of files is then compiled and linked into the application. The main advantages of this approach over having a set of external schema files are that the application becomes self-sufficient (no need to locate the schema files) and the grammar loading is done from a pre-parsed state which can be much faster for larger schemas.

The compression example shows how to perform on-the-fly compression and decompression of XML documents during serialization and parsing, respectively. It uses the compression functionality provided by the zlib library and writes the data in the standard gzip format.

The streaming example is not really new but it has been significantly reworked. While the in-memory representation offered by C++/Tree is quite convenient, it may not be usable if the XML documents to be parsed or serialized are too big to fit into memory. There is, however, a way to still use C++/Tree which boils down to performing partially in-memory XML processing by only having a portion of the object model in memory at any given time. With this approach we can process parts of the document as they become available as well as handle documents that are too large to fit into memory all at once.

The parsing part in this example is handled by a stream-oriented DOM parser implementation that is built on top of the Xerces-C++ SAX2 parser in the progressive parsing mode. This parser allows us to parse an XML document as a series of DOM fragments which are then converted to object model fragments. Similarly, the serialization part is handled by a stream-oriented DOM serializer implementation that allows us to serialize an XML Document as a series of object model fragments.

Improvements in the file-per-type mode

With the introduction of the file-per-type mode in XSD 3.1.0 people started trying to compile very “hairy” (for the lack of a better word) schemas. Such schemas contain files that are not legal by themselves (lacking some include or import directories) and that have include/import cycles. Some of these schemas also contain a large number of files that are spread over a multi-level directory hierarchy.

This uncovered the following problem with the file-per-type mode. In this mode the compiler generates a set of C++ source files for each schema type. It also generates C++ files corresponding to each schema file that simply includes the header files corresponding to the types defined in this schema. All these files are generated into the same directory. While the compiler automatically resolves conflicts between the generated type files, it assumed that the schema files would be unique. This proved not to be the case — there are schemas that contain identically named files in different sub-directories.

Working on the fix for this problem made us think that some people might actually prefer to place the generated code for such schemas into sub-directories that model the original schema hierarchy. In order to support this scenario we have added the --schema-file-regex option which, together with the existing --type-file-regex, can be used to place the generated files into subdirectories. For an example that shows how to do this, see the GML 3.2.1 section on the GML Wiki page.