Writing 64-bit safe code

October 13th, 2008

There is a number of disadvantages in having your code being unaware of 64-bit platforms. By unaware I mean using 32-bit types such as int and long (in Microsoft land long is 32-bit even in the 64-bit mode) to store memory-related values such as indexes, lengths, sizes, etc. The most obvious disadvantage is the possibility of a user of your application trying to handle a workload that does not fit into the 32-bit memory model. Even if they have a 64-bit machine and recompile your application in the 64-bit mode, the application would still be limited to 32-bit.

There are also less obvious disadvantages that affect you as a developer. You are probably using third party APIs in your application. As most high quality APIs and libraries (e.g., UNIX APIs, STL, Boots, etc.) have already been changed or are changing to support 64-bit platforms, you may find yourself having to litter your code with more and more type casts in order to suppress warnings about the potential data loss that some C++ compilers issue:

std::string s = ...;
unsigned int n = static_cast<unsigned int> (s.size ());

Furthermore, if you are developing a library that is used by other developers then you are running the risk of upsetting those that make their applications 64-bit safe. They are now facing the same type of casting problem when interfacing with your code:

size_t i = ...;
your_container c = ...;
c.at (static_cast<unsigned int> (i));

Finally, as you become more aware of the 64-bit safety issues, every time you are writing an int to hold an index or size, an annoying doubt will cross your mind prompting you to stop and think whether it is possible that someone would need more than 32 bits in this particular case. Firstly, you cannot predict how much RAM computers will have and what people will want to do with that RAM in the future. Do you think in 1995, when Windows 95 was released with the then leading edge Win32 API, Microsoft imagined that only five years later, in 2000, the 64-bit extension to the x86 architecture will be announced and a few years later 64-bit desktop systems will start appearing? Secondly, it is just easier to consistently use 64-bit safe types for all memory-related values without having to stop and analyze individual cases.

The most straightforward way to make your C++ application 64-bit safe is to use the std::size_t (unsigned) and std::ssize_t (signed) types found in the standard C++ cstddef header. These types are automatically aliased to 32-bit integers on 32-bit platforms and to 64-bit integers on the 64-bit ones. Furthermore, when operating system and C++ compilers are ported to support 96-bit or 128-bit architectures, you won’t need to change anything in your code.

Use std::size_t for anything that relates directly or indirectly to RAM. This includes indexes, lengths, sizes, offsets, etc. For offsets that can be negative, use std::ssize_t.

One common mistake is to use std::size_t for a file length or offset. These values are not related to RAM and, even on 32-bit systems, can be much greater than what a 32-bit integer can hold (e.g., a disk file can be larger than 4GB). In this situation it may make sense to use a 64-bit integer even on 32-bit platforms.

Some APIs use signed int to return an index with -1 indicating some sort of error or “not found” conditions, for example:

class string_pool
{
  // Return an index of the string or -1 if not found.
  //
  int find (const char*);
};

This approach has two problems. Firstly, it uses a 32-bit int for a memory-related index. Secondly, because negative numbers are reserved for indicating special conditions, this index can only address half of the 32-bit memory space.

One way to resolve the second problem when making this API 64-bit safe is to use ~size_t(0) to indicate the special condition:

#include <cstddef>
 
class string_pool
{
  static const std::size_t invalid_index = ~std::size_t (0);
 
  // Return an index of the string or invalid_index if not found.
  //
  std::size_t find (const char*);
};

This works because a valid memory index can only be in the [0~size_t(0)-1] range. The same approach, for example, is used in std::string.

Strictly speaking the same reasoning does not apply to sizes since a size can be ~size_t(0). In practice, however, it is not possible to allocate a memory block that takes up the whole address space (there would be no space left for OS, for instance) so this approach can also be used for sizes.

The straightforward approach of changing all memory-related values to std::size_t may not work for some situations. The most notable two are binary serialization (e.g., for object persistence) and high memory usage data structures. In the case of binary serialization, the serialized data most likely has to be portable between 32 and 64-bit systems. In this case using types that have the same size on all platforms is the easiest route to portability. C header stdint.h defines a number of such types: int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, uint64_t. C++ TR1 defines the cstdint wrapper header though it may not yet be implemented in all C++ compilers.

In high memory usage data structures changing from 32-bit sizes to 64-bit may result in an unacceptably high overhead. Consider, for example, a string table that has to hold millions of short strings in memory. Having a 64-bit (8 bytes) string size might be too high an overhead. If all the strings are known to be shorter than 255 bytes then uint8_t might be a better choice for storing sizes in this situation.

Xerces-C++ 3.0.0 Released

October 6th, 2008

Quite a few people believed this will never happen but after many years of development Xerces-C++ 3.0.0 is finally out. This major release includes a large number of new features, bug fixes, and clean-ups. It also happens to break a few interfaces (especially in DOM) so application adjustments may be required. For the complete list of changes in this version refer to the official announcement on the project’s mailing lists. In this post I am going to cover some of the major improvements in more detail.

As with 2.8.0, this release comes with a wide range of precompiled libraries (total 17) for various CPU architectures, operating systems, and C++ compilers. For most platforms 32-bit and 64-bit variants are provided. Note also that while the libraries are built using specific C++ compiler versions, most of them will also work with newer versions of the same compilers. For example, libraries built with GCC 3.4.x will also work with GCC 4.x.y. Similarly, libraries built with Sun C++ 5.7 (Studio 10) will work with Sun C++ 5.8.

The first thing GNU/Linux, UNIX, and Mac OS X users will notice is the new, automake-based build system for these platforms. There is no more XERCESCROOT or runConfigure and all the standard configure options are supported. There is also a number of options specific to Xerces-C++ which can all be viewed by executing configure --help. For Windows users the distribution comes with VC++ project files. In this release a set for VC++ 9.0 (2008) was added. Additionally, project files for VC++ 7.1, 8.0, and 9.0 now include targets to build Xerces-C++ with the ICU library as a character transcoder.

Other infrastructure work includes the removal of deprecated components (DepDOM, COM) as well as project files for unmaintained compilers. The documentation was cleaned-up and split into the website and library categories with the Xerces-C++ distributions now only including the library documentation (build instructions, programming guides, etc). Overall, I believe all of this will get the Xerces-C++ project back on the regular release track with the next release (3.1.0) in about a year.

Now to the new functionality in the library itself. The Xerces-C++ component that got the most work in this release is probably XML Schema. It includes a large number of bug fixes and errata changes. In particular the long-standing bug that resulted in long execution times and stack overflows on schemas with large minOccurs and maxOccurs values has been fixed. Also the new interpretation of the ##other namespace designator has been implemented. Related but not limited to XML Schema is the work done to review and clean-up all the diagnostics messages issued by Xerces-C++. They were all clarified and now consistently start with a lower-case letter and do not include a period at the end.

Prior to the 3.0.0 release Xerces-C++ included the draft DOM XPath 1 interfaces that were barely usable and required a lot of casting to the implementation when used with XPath 2 processors such as XQilla. In 3.0.0, the DOM XPath interfaces were extended to support both XPath 1 and XPath 2 data models. As a result, the application can now depend only on interfaces. The 2.2.0 release of XQilla, due in a few weeks, will include support for Xerces-C++ 3.0.0. Furthermore, the 3.0.0 release implements the XML Schema subset of XPath 1 in DOM. This allows you to execute basic XPath queries without requiring a separate XPath processor library.

Another major change in Xerces-C++ 3.0.0 is the porting of all public interfaces and a major part of the implementation to use 64-bit safe types. This means that if you design your application to be 64-bit safe (e.g., use std::size_t for indexes, lengths, etc.), then you don’t need to perform any casts when interfacing with Xerces-C++.

Finally, a number performance-critical parts were optimized for speed in this release. This resulted, for example, in both DOM parsing and XML Schema validation showing about 25%-30% improvement compared to 2.8.0.

When I first started working on the 3.0.0 code base it was in quite a mess with the automake-based build system still unfinished and having most of the source code 64-bit ignorant. At that point I decided that we will need to maintain both 2.8.0 and 3.0.0 in parallel in case the 3.0.0 release happens to be a disaster. This is why the Xerces-C++ project website now includes two sections, one for 2.8.0 and one for 3.0.0. As a release manager, my primary goals for Xerces-C++ 3.0.0 became to make it cleaner, easier to build, better tested, as well as to provide better XML Schema support. Two betas later and I think 3.0.0 came out to be a very solid release, better than 2.8.0 in every aspect and, in retrospect, making that website split probably unnecessary. In fact, we were confident enough to build all our XSD 3.2.0 binary distributions with Xerces-C++ 3.0.0.

Are you using a real XML parser

May 19th, 2008

Recently there’s been a bunch of announcements of new XML parsers that claim to be very fast, very small or both. I also see a lot of people get very enthusiastic about using them in their applications. Just the other day I got an email from a user asking if it was possible to use CodeSynthesis XSD with a light-weight XML parser that he found instead of Xerces-C++. Out of curiosity I checked the parser’s description and there I saw a number of common traits of most new, fast, and small XML parsers these days: no support for DTD (internal subset) and CDATA sections, limited support for character and entity references.

Once I tell people about these problems with their choice of XML parser, some say they don’t care since nobody uses these features anyway. It is true that most of these features are seldom used. You can get away with using a non-conforming XML parser if you control the production of XML you are planning to parse and thus restrict the set of XML constructs that can appear in your documents. This, however, is not a very common situation. If you control both the production and consumption of the data then you might as well choose a more natural (for your application and environment) and efficient (that’s why choose this new parser) exchange format than XML. A more common scenario is when you have to parse XML supplied by various third parties and once you say your format is XML then all bets are off; it is only a matter of time before someone sends you a perfectly valid XML document that your application won’t be able to handle.

The W3C XML specification defines two types of conforming XML parsers: validating and non-validating (see Section 5.1, “Validating and Non-Validating Processors”). Any parser that wants to be called capable of parsing XML 1.0 documents must at least satisfy the non-validating parser requirements. Besides the expected things like being able to parser elements and attributes as well as making sure they are well-formed, a conforming non-validating XML parser should also support the following:

  • At least UTF-8 and UTF-16 character encodings
  • CDATA sections (<![CDATA[<greeting>Hello, world!</greeting>]]>)
  • Character references (&#x20;)
  • Entity references including predefined (&amp;) and user-defined in the internal DTD subset
  • Parse and check for well-formedness the internal DTD subset
  • Normalize and supply default attribute values according to the internal DTD subset

The internal DTD subset consist of the DTD declarations that are found in the XML document itself as opposed to the external subset which consists of the declarations placed into separate DTD files and referenced from the XML documents. Here is a sample document that uses most of the above features (download it to test your favorite parser: test.xml):

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<!DOCTYPE hello [
<!ENTITY greeting-text "hello">
<!ATTLIST greeting lang NMTOKEN "en">
<!ATTLIST name lang NMTOKEN "en">
]>
<hello>
  <greeting>&greeting-text;</greeting>
  <name lang="  fr  ">tout&#x20;le&#x20;<![CDATA[monde]]></name>
</hello>

Parsing this document with a conforming non-validating XML parser should be equivalent to parsing the following document:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<hello>
  <greeting lang="en">hello</greeting>
  <name lang="fr">tout le monde</name>
</hello>

Here is the list of C and C++ XML parsers that are either conforming or strive to be conforming:

  • Xerces-C++: C++, DOM, SAX2, validating (DTD and XML Schema)
  • Libxml2: C, DOM, SAX2, Pull, validating (DTD)
  • Expat: C, SAX2, non-validating, small & fast
  • Faxpp: C, Pull, non-validating (no default/normalized attributes), small & fast

And here is a list of parsers that while calling themselves XML parsers, are actually parsers for markup languages that are subsets of XML (based on their description at the time of writing):

  • VTD-XML
  • Rapidxml
  • TinyXML
  • XML parser provided by applied-mathematics.net