Xerces-C++ 3.0.0 beta 1 released

March 14th, 2008

I’ve spent the past three weeks prepping the Xerces-C++ 3.0.0 code for the upcoming release which culminated in the publishing of the first beta yesterday. The major change in 3.0.0 compared to the 2-series releases is the new, autotools-based build system for Linux/UNIX platforms. Other improvements in 3.0.0 include:

  • Project files for VC 9
  • Support for the ICU transcoder in VC 7.1, 8, and 9 project files
  • libcurl-based net accessor
  • Support for XInclude
  • Support for a subset of XPath
  • Conformance to the final DOM Level 3 interface specification
  • Ability to provide custom DOM memory manager
  • Better 64-bit support
  • Cleaned up error messages
  • Better tested, including against W3C XML Schema test suite
  • Removal of the deprecated code

My primary goals in this release are to make it cleaner, easier to build, better tested, as well as to provide better XML Schema support. And it does feel that the 3.0.0 codebase is on track to achieve these goals. If you are planning to upgrade to 3.0.0 once the final version is out, I suggest that you give this beta a try and report any problems so that they can be fixed before the final release. For more details on this beta see the official announcement.

CodeSynthesis XSD 3.1.0 released

February 13th, 2008

XSD 3.1.0 was released a couple of days ago. For an exhaustive list of new features see the official announcement. In this post I would like to go into more detail on a few major features, namely the file-per-type compilation mode and configurable identifier naming conventions.

File-per-type compilation mode

First, some background on the kinds of problems this feature is meant to solve. While in most cases it is natural to generate one set of source files from each schema file and map XML Schema include and import constructs to the preprocessor #include directives, XML Schema include and import mechanisms are quite a bit less strict compared to #include. For example, you can have two schemas each with a type that inherits from a base in another schema (that is, these schemas are dependent on each other and this dependency involves inheritance). Or, you can have a schema that does not include/import definitions for all the types it is referencing. Instead such a schema relies on being included or imported into another schema which provides the missing definitions (while this can also happen in C++, it is not very common). As a result, sometimes it is not possible to compile the schemas separately and/or map XML Scheme include/import to C++ #include. For such situations the file-per-type compilation mode was introduced in addition to the existing file-per-schema mode.

In the new mode (the --file-per-type command option), the XSD compiler generates a separate set of files for each type defined in XML Schema. It still generates a set of source files corresponding to the schema files which now include the header files for the types and contain parsing and serialization functions. In this compilation mode you only need to compile the root schema for your vocabulary; the code will be automatically generated for all included and imported schemas. If your vocabulary has several root schemas which in turn include or import a common subset of schemas then you will need to specify all these root schemas in a single invocation of the compiler.

One reason why the file-per-schema mode should be preferred whenever possible is the potentially large number of source files that are generated in the file-per-type mode (some of the schemas that we have tested contain 1000-1,500 types). To minimize the impact of the file-per-type mode on the C++ compilation time, it is a good idea to generate the XML Schema namespace into a separate header file (see the --generate-xml-schema and --extern-xml-schema options) and to set up a precompiled header.

To help dealing with a potentially large number of files that the new mode produces, the new --file-list option was added to the XSD compiler that allows you to write a list of generated source files into a file. The --file-list-prologue, --file-list-epilogue, and --file-list-delim options allow you to turn this file into, for example, a makefile fragment with the list of files assigned to a variable. The following GNU make fragment shows how to put all of the above information together:

XSD    := ... # path to the XSD compiler
LIBXSD := ... # path to the XSD runtime library
 
driver:
 
# Schema compilation.
#
xsd      := ... # list of all schema files
xsd_root := ... # root schema(s)
 
-include gen.make
 
gen.make: $(xsd)
  $(XSD) cxx-tree --file-per-type --output-dir gen 
--file-list $@ --file-list-prologue "gen := " --file-list-delim " \\n" 
--extern-xml-schema xml-schema.xsd --cxx-prologue '#include "all.hxx"' 
$(xsd_root)
 
gen/xml-schema.hxx:
  $(XSD) cxx-tree --generate-xml-schema --output-dir gen xml-schema.xsd
 
src := driver.cxx $(filter %.cxx,$(gen))
obj := $(src:.cxx=.o)
 
# Precompiled header.
#
$(obj): gen/all.hxx.gch
 
gen/all.hxx.gch: gen/all.hxx gen/xml-schema.hxx
  $(CXX) -I$(LIBXSD) -o $@ $<
 
# Object code and driver.
#
driver: $(obj) -lxerces-c
  $(CXX) -o $@ $^
 
%.o: %.cxx
  $(CXX) -I$(LIBXSD) -c $< -o $@

The gen/all.hxx file is the precompiled header for the project and could look like this:

#ifndef GEN_ALL_HXX
#define GEN_ALL_HXX
 
#warning precompiled header is not used
 
#include "xml-schema.hxx"
 
#endif // GEN_ALL_HXX

Another interesting aspect of the file-per-type compilation mode is how it is implemented in XSD. A straightforward but complex approach would have been to support this mode in the code generators in addition to the file-per-schema mode. Instead, an internal schema graph transformation was implemented that transforms the semantic graph to make it appear as if each type is in a separate schema file. After this transformation the unchanged code generators are used as in the file-per-schema mode.

Configurable identifier naming conventions

One common objection to using automatic code generation is the difference between the identifier naming conventions used in a project and in the generated code. To address this concern, the XSD compiler allows you to specify a naming convention that should be used in the generated code for the C++/Tree mapping.

The two new options, --type-naming and --function-naming, allow you to select type and function naming conventions from a predefined set of widely-used styles. You can also provide regular expressions to customize or completely override one of the predefined styles.

Available type naming conventions are K&R (for example, test_type), upper-camel-case (for example, TestType), and Java (the same as upper-camel-case). Available function naming conventions are K&R (for example, test_function), lower-camel-case (for example, testFunction), and Java (for example, getTestFunction for accessors and setTestFunction for modifiers).

For more information see the NAMING CONVENTION section in the XSD Compiler Command Line Manual (man pages).

Xerces-C++ 2.8.0 released

September 2nd, 2007

After two years of development, Xerces-C++ 2.8.0 is finally out. This release doesn’t add any new features compared to 2.7.0 but is rather focused on the bug fixes, optimizations, and build system improvements (for new features, 3.0.0 is underway). 2.8.0 is interface-compatible with 2.7.0 which means all you will need to do to take advantage of all the improvements is to recompile your applications. For the complete list of changes in this version refer to the official Release Information page on the project’s website. I am especially interested in all the XML Schema fixes that allow Xerces-C++ 2.8.0 to handle widely-used schemas and standards such as Geography Markup Language (GML) and COLLADA (COLLAborative Design Activity). In this post I am going to discuss a number of user-visible improvements that many Xerces-C++ users may want to know more about.

Compared to the previous release, Xerces-C++ 2.8.0 comes with a wide range of precompiled libraries (total 23) for various CPU architectures, operation systems, and C++ compilers. For most platforms 32 bit and 64 bit versions are provided. Note also that while the libraries are built using specific C++ compiler versions, most of them will also work with newer versions of the same compilers. For example, libraries built with GCC 3.4.x will also work with GCC 4.0.x, 4.1.x, and 4.2.x. Similarly, libraries built with Sun C++ 5.7 (Studio 10) will work with Sun C++ 5.8. For HP-UX on PA-RISC two versions of libraries are provided: one built in the “Classic” mode (-AP) and the other in the “Standard C++” mode (-AA -mt). The latter version’s archive has the _AA suffix. Also new in this release are the 64 bit Windows libraries built with Visual C++ 8.0 (2005).

On the source code level, there are three user-visible changes that I would like to cover in more detail. First, the XML to DOM parsing code was optimized with the speed gain ranging between 25-30%, depending on the XML documents used. The SAX2 parser was also improved to allocate additional memory only if the existing buffers cannot be reused. Overall, a statically-linked (see below) Xerces-C++ 2.8.0 library on GNU/Linux parses XML to DOM about 40% faster than a dynamically-linked 2.7.0 thanks to various code optimizations as well as the default optimization level change from -O to -O2 for GCC on GNU/Linux.

The second source-level change that may affect your applications is the exponential growth of memory blocks implemented in the DOM heap. Now the size of memory blocks that the DOM heap allocates at a time grows from 16KB to 128KB as the document requires more memory (in the previous versions that size was fixed at 64KB). This change will help applications with a large number of small XML documents as well as applications that handle very large documents.

Finally, two important bugs have been fixed in the DOM cloning and importing logic. First, when the complete DOM document is being cloned, the NODE_CLONED notification is sent to each node’s user data handler. This allows you to copy the user data into the new document. Second, type information (such as PSVI) that may be associated with DOM nodes is now properly copied when nodes are cloned or imported.

The build system in Xerces-C++ 2.8.0 has also been improved in a number of ways: The Visual Studio 8.0 (2005) project and solution files with support for the 64 bit builds now come with the Xerces-C++ distribution. GCC is now supported on HP-UX and AIX in addition to GNU/Linux and Solaris. The build system now automatically detects aCC3 and aCC6 on HP-UX as well as passes the correct 64 bit options to Sun C++ on SPARC and x86-64.

Furthermore, a new option, -s, was added to the runConfigure script that instructs the build system to build static archives (libxerces-c.a) instead of shared libraries. The code for the static archives is compiled without positions-independent options and therefore is faster than the shared library version.

The compilation itself can now be performed in verbose mode which is useful when you want to see the exact options that are passed to the compiler during build. To trigger verbose mode, add VERBOSE=1 to the make command line, for example:

make VERBOSE=1

Xerces-C++ 2.8.0 automatically detects necessary 64 bit options for most of the platforms and compilers that it supports. For GCC, however, it is not possible because the exact options depend on the CPU architecture and GCC version. GCC on most 32 bit GNU/Linux distributions produces 32 bit code and on 64 bit distributions—64 bit code. If GCC on your system does not generate the desired code by default, you will need to specify additional compiler and linker options using the runConfigure -z and -l options. For the exact GCC options that will switch the compiler into the desired mode on your architecture, consult GCC documentation. For the x86-64, PowerPC, and SPARC architectures these options are -m64 (64 bit mode) and -m32 (32 bit mode).

For example, to build 32 bit Xerces-C++ libraries for x86 architecture on a 64 bit GNU/Linux system that by default produces 64 bit code, you will need to use the following runConfigure options:

runConfigure -p linux -c gcc -x g++ -z -m32 -l -m32

GCC on Solaris for both x86-64 and SPARC architectures by default generates 32 bit code. To build 64 bit Xerces-C++ libraries with GCC on Solaris, use the following runConfigure options:

runConfigure -p solaris -c gcc -x g++ -b 64 -z -m64 -l -m64

As another example, to build 32 bit universal libraries for Mac OS X use the following runConfigure options:

runConfigure -p macosx -c gcc -x g++ -z -arch -z i386 \
-z -arch -z ppc -l -arch -l i386 -l -arch -l ppc