Archive for the ‘C++ Compilers’ Category

Microsoft DLL export and C++ templates

Monday, January 18th, 2010

The other day I stumbled upon a really dark corner of the Microsoft dllexport/dllimport machinery. I can vividly see Windows toolchain engineers waking up in the middle of the night from a nightmare where they had to patch yet another crack in this DLL symbol export mess. This one has to do with the interaction of dllexport and C++ templates.

It all started with a user reporting duplicate symbol errors when he tried to split the XSD-generated code into two DLLs. The duplicate symbols were reported when linking the second DLL that depends on the “base” DLL and pointed to the destructor and assignment operator of a template instantiation, let’s say std::vector<int>. There were two additional strange things about this case: the errors only occurred in the debug build and there were a number of other users that have done a similar thing but never got any errors. The fact that the errors only appeared in the debug build got me thinking that in the release build these functions were inlined. The second strange aspect was harder to figure out: there was something special about this particular codebase that caused the error. After some investigation the following code fragment in the first DLL turned out to make the difference (BASE_EXPORT expands to either __declspec(dllexport) or __declspec(dllimport)):

class BASE_EXPORT ints: public std::vector<int>
{
  ...
};

As it turns out (see at the end of the General Rules and Limitations article in MSDN), if an exported class inherits from a template instantiation that is not explicitly exported (yes, you can export certain instantiations of a template, see below), then the compiler implicitly applies dllexport to this template instantiation. So the above code fragment exports both the ints class and the std::vector<int> instantiation. On the surface this automatic exporting looks like a good idea. After all, if you export the derived class you will also need to export all its public bases since they are part of the interface. In the case of the non-template bases you need to use the export mechanism explicitly which makes sense. In the case of templates, you don’t want to have to explicitly export every instantiation. Plus, as pointed out in the MSDN article above, it is not always possible.

But here is the other half of the picture: in the second DLL there is a source code file that doesn’t know anything about the ints class (that is, it doesn’t include the ints declaration). It also happens to use std::vector<int> in a fairly common way:

void f ()
{
  std::vector<int> v;
 
  ...
}

When the second DLL is linked, we end up with two sets of symbols for std::vector<int>: the first is exported from the “base” DLL and the second set is the result of the template instantiation in the above source code file. Duplicate symbol errors ensue.

At first it might seem puzzling that the same doesn’t happen with ordinary classes that contain inline functions. What if a class is exported from one DLL and then we use it in another? This doesn’t lead to errors even when inline functions are not inlined because in order to use the class we need to include its declaration. Once we do that all of its functions become imported from the first DLL and instead of “instantiating” an inline function the compiler simply uses the imported version from the first DLL. We get errors in the above scenario because when VC++ compiles the source file in the second DLL it has no knowledge of the fact that the functions it is about to instantiate were exported from the “base” DLL which this DLL happens to link to.

In standard C++ the toolchain is required to weed out the duplicate symbols that result from instantiations of the same template. When DLLs are involved, VC++ is unable to meet this requirement.

There is no clean way to work around this. In the scenario described above we can add an explicit import declaration for the std::vector<int> instantiation:

template class __declspec(dllimport) std::vector<int>;
 
void f ()
{
  std::vector<int> v;
 
  ...
}

Normally one would collect such manual imports in one header file and then include this file into every source file in the DLL.

The major issue with this approach, apart from having to manually track imports, is that if you have two independent DLLs that each happen to auto-export std::vector<int> and you need to link to both of them, there is nothing you can do without changing at least one of those DLLs.

It also appears that Microsoft itself suffered from this pitfall as evident from the Exporting String Classes Using CStringT article in MSDN. The solution that it describes seems to be specific to this particular case, not that I could understand it fully.

C++ data alignment and portability

Monday, April 6th, 2009

The upcoming version of XSD/e adds support for serializing the object model to a number of binary data representation formats, such as XDR and CDR. It also supports custom binary formats. One person was beta-testing this functionality with the goal of achieving the fastest serialization/deserialization possible. He was willing to sacrifice the wider format portability across platforms as long as it was interoperable between iPhone OS and Mac OS X.

Since both iPhone OS on ARM and Mac OS X on x86 are little-endian and have compatible fundamental type sizes (e.g., int, long, double, etc., except for long double which is not used in XSD/e), the natural first optimization was to make the custom format’s endianess and type sizes to be those of the target platforms. This allowed optimizations such as reading/writing sequences of fundamental types with a memcpy() call instead of a for loop. After achieving this improvements he then suggested what would seem as a natural next optimization. If we can handle fundamental types with memcpy(), why can’t we do the same for simple classes that don’t have any pointer members (fixed-length types in the XSD/e object model terms)? When designing a “raw” binary format like this, most people are aware of the type size and endianess compatibility issues. But there is another issue that we need to be aware of if we try to do this kind of optimizations: data alignment compatibility.

First, a quick introduction to the data alignment and C++ data structure padding. For a more detailed treatment of this subject, see, for example, Data alignment: Straighten up and fly right. Modern CPUs are capable of reading data from memory in chunks, for example, 2, 4, 8, or 16 bytes at a time. But due to the memory organization, the addresses of these chunks should be multiples of their sizes. If an address satisfies this requirement, then it is said to be properly aligned. The consequences of accessing data via an unaligned address can range from slower execution to program termination, depending on the CPU architecture and operating system.

Now let’s move one level up to C++. The language provides a set of fundamental types of various sizes. To make manipulating variables of these types fast, the generated object code will try to use CPU instructions which read/write the whole data type at once. This in turn means that the variables of these types should be placed in memory in a way that makes their addresses suitably aligned. As a result, besides size, each fundamental type has another property: its alignment requirement. It may seem that the fundamental type’s alignment is the same as its size. This is not generally the case since the most suitable CPU instruction for a particular type may only be able to access a part of its data at a time. For example, a CPU may only be able to read at most 4 bytes at a time so a 64-bit long long type will have a size of 8 and an alignment of 4.

GNU g++ has a language extension that allows you to query a type’s alignment. The following program prints fundamental type sizes and alignment requirements of a platform for which it was compiled:

#include <iostream>
 
using namespace std;
 
template <typename T>
void print (char const* name)
{
  cerr << name
       << " sizeof = " << sizeof (T)
       << " alignof = " << __alignof__ (T)
       << endl;
}
 
int main ()
{
  print<bool>        ("bool          ");
  print<wchar_t>     ("wchar_t       ");
  print<short>       ("short int     ");
  print<int>         ("int           ");
  print<long>        ("long int      ");
  print<long long>   ("long long int ");
  print<float>       ("float         ");
  print<double>      ("double        ");
  print<long double> ("long double   ");
  print<void*>       ("void*         ");
}

The following listing shows the result of running this program on a 32-bit x86 GNU/Linux machine. Notice the size and alignment of the long long, double, and long double types.

bool           sizeof = 1  alignof = 1
wchar_t        sizeof = 4  alignof = 4
short int      sizeof = 2  alignof = 2
int            sizeof = 4  alignof = 4
long int       sizeof = 4  alignof = 4
long long int  sizeof = 8  alignof = 4
float          sizeof = 4  alignof = 4
double         sizeof = 8  alignof = 4
long double    sizeof = 12 alignof = 4
void*          sizeof = 4  alignof = 4

[Actually, the above program shows that the alignment of long long and double is 8. This is, however, not the case since the IA32 ABI specifies that their alignments should be 4. Also, if you wrap long long or double in a struct and take the alignment of the resulting type, it will be 4, not 8.]

And the following listing is for 64-bit x86-64 GNU/Linux:

bool           sizeof = 1  alignof = 1
wchar_t        sizeof = 4  alignof = 4
short int      sizeof = 2  alignof = 2
int            sizeof = 4  alignof = 4
long int       sizeof = 8  alignof = 8
long long int  sizeof = 8  alignof = 8
float          sizeof = 4  alignof = 4
double         sizeof = 8  alignof = 8
long double    sizeof = 16 alignof = 16
void*          sizeof = 8  alignof = 8

The C++ compiler also needs to make sure that member variables in a struct or class are properly aligned. For this, the compiler may insert padding bytes between member variables. Additionally, to make sure that each element in an array of a user-defined type is aligned, the compiler may add some extra padding after the last data member. Consider the following struct as an example:

struct foo
{
  bool a;
  short b;
  long long c;
  bool d;
};

The compiler always assumes that an instance of foo will start at an address aligned to the most strict alignment requirement of all of foo’s members, which is long long in our case. This is actually how the alignment requirements of a user-defined types are calculated. Assuming we are on x86-64 with short having the alignment of 2 and long long — of 8, to make the b member suitably aligned, the compiler needs to insert an extra byte between a and b. Similarly, to align c, the compiler needs to insert four bytes after b. Finally, to make sure the next element in an array of foos starts at an address aligned to 8, the compiler needs to add seven bytes of padding at the end of struct foo. Here is the actual memory image of this struct with the positions of each member when the object is allocated at an example address 8:

                 // addr  alignment
struct foo       // 8     8
{
  bool a;        // 8     1
  char pad1[1];
  short b;       // 10    2
  char pad2[4]
  long long c;   // 16    8
  bool d;        // 24    1
  char pad3[7];
};               // 32    8  (next element in array)

Now back to our question about serializing simple classes with memcpy(). It should be clear by now that to be able to save a user-defined type with memcpy() on one platform and then load it on another, the two platforms not only need to have fundamental types of the same sizes and be of the same endianess, but they also need to be alignment-compatible. Otherwise, the positions of members inside the type and even the size of the type itself can differ. And this is exactly what happens if we try to move the data corresponding to foo between x86 and x86-64 even though the types used in the struct are of the same size. Here is what the padded memory image of foo on x86 looks like:

struct foo
{
  bool a;
  char pad1[1];
  short b;
  long long c;
  bool d;
  char pad2[3];
};

Because the alignment of long long on this platform is 4, padding between b and c is no longer necessary and padding at the end of the struct is 3 bytes instead of 7. The size of this struct is 16 bytes on x86 and 24 bytes on x86-64.

[For those curious about Mac OS X on x86 and iPhone OS on ARM, they are alignment-compatible, as long as you don’t use long double which has different sizes on the two platforms.]

Virtual inheritance overhead in g++

Thursday, April 17th, 2008

By now every C++ engineer worth her salt knows that virtual inheritance is not free. It has object code, runtime (both CPU and memory), as well as compilation time and memory overheads (for an in-depth discussion on how virtual inheritance is implemented in C++ compilers see “Inside the C++ Object Model” by Stanley Lippman). In this post I would like to consider the object code as well as compilation time and memory overheads since in modern C++ implementations these are normally sacrificed for the runtime speed and can present major surprises. Unlike existing studies on this subject, I won’t bore you with “academic” metrics such as per class or per virtual function overhead or synthetic tests. Such metrics and tests have two main problems: they don’t give a feeling of the overhead experienced by real-world applications and they don’t factor in the extra code necessary to account for the lack of functionality otherwise provided by virtual inheritance.

It is hard to come by non-trivial applications that can provide the same functionality with and without virtual inheritance. I happened to have access to such an application and what follows is a quick description of the problem virtual inheritance was used to solve. I will then present some measurements of the overhead by comparing to the same functionality implemented without virtual inheritance.

The application in question is XSD/e, validating XML parser/serializer generator for embedded systems. Given a definition of an XML vocabulary in XML Schema it generates a parser skeleton (C++ class) for each type defined in that vocabulary. Types in XML Schema can derive from each other and if two types are related by inheritance then it is often desirable to be able to reuse the base parser implementation in the derived one. To support this requirement, the current implementation of XSD/e uses the C++ mixin idiom that relies on virtual inheritance:

// Parser skeletons. Generated by XSD/e.
//
struct base
{
  virtual void
  foo () = 0;
};
 
struct derived: virtual base
{
  virtual void
  bar () = 0;
};
 
// Parser implementations. Hand-written.
//
struct base_impl: virtual base
{
  virtual void
  foo ()
  {
    ...
  }
};
 
struct derived_impl: virtual derived,
                     base_impl
{
  virtual void
  bar ()
  {
    ...
  }
};

This approach works well but we quickly found out that for large vocabularies with hundreds of types the resulting object code produced by g++ was unacceptably large. Furthermore, on a schema with a little more than a thousand types, g++ with optimization turned on (-O2) runs out of memory on a machine with 2GB of RAM.

After some analysis we determined that virtual inheritance was to blame. To resolve this problem we have developed an alternative, delegation-based implementation reuse method (will appear in the next release of XSD/e) that is almost as convenient to use as mixin (this is the case because all the support code is automatically generated by the XSD/e compiler). The idea behind the delegation-based approach is illustrated in the following code fragment:

// Parser skeletons. Generated by XSD/e.
//
struct base
{
  virtual void
  foo () = 0;
};
 
struct derived: base
{
  derived (base* impl)
    : impl_ (impl)
  {
  }
 
  virtual void
  bar () = 0;
 
  virtual void
  foo ()
  {
    assert (impl_);
    impl_->foo ();
  }
 
private:
  base* impl_;
};
 
// Parser implementations. Hand-written.
//
struct base_impl: base
{
  virtual void
  foo ()
  {
    ...
  }
};
 
struct derived_impl: derived
{
  derived_impl ()
    : derived (&base_impl_)
  {
  }
 
  virtual void
  bar ()
  {
    ...
  }
 
private:
  base_impl base_impl_;
};

The optimized for size (-Os) and stripped test executable built for the above-mentioned thousand-types schema using virtual inheritance is 15MB in size. It also takes 19 minutes to build and peak memory usage of the C++ compiler is 1.6GB. For comparison, the same executable built using the delegation-based approach is 3.7MB in size, takes 14 minutes to build, and peak memory usage is 348MB. That’s right, the executable is 4 times smaller. Note also that the generated parser skeletons are not just a bunch of pure virtual function signatures. They include XML Schema validation, data conversion, and dispatch code. The measurements also showed that the runtime performance of the two reuse approaches is about the same (most likely because g++ performs a similar delegation under the hood except that it has to handle all possible use-cases thus the object code overhead).