Archive for the ‘C++’ Category

C++ data alignment and portability

Monday, April 6th, 2009

The upcoming version of XSD/e adds support for serializing the object model to a number of binary data representation formats, such as XDR and CDR. It also supports custom binary formats. One person was beta-testing this functionality with the goal of achieving the fastest serialization/deserialization possible. He was willing to sacrifice the wider format portability across platforms as long as it was interoperable between iPhone OS and Mac OS X.

Since both iPhone OS on ARM and Mac OS X on x86 are little-endian and have compatible fundamental type sizes (e.g., int, long, double, etc., except for long double which is not used in XSD/e), the natural first optimization was to make the custom format’s endianess and type sizes to be those of the target platforms. This allowed optimizations such as reading/writing sequences of fundamental types with a memcpy() call instead of a for loop. After achieving this improvements he then suggested what would seem as a natural next optimization. If we can handle fundamental types with memcpy(), why can’t we do the same for simple classes that don’t have any pointer members (fixed-length types in the XSD/e object model terms)? When designing a “raw” binary format like this, most people are aware of the type size and endianess compatibility issues. But there is another issue that we need to be aware of if we try to do this kind of optimizations: data alignment compatibility.

First, a quick introduction to the data alignment and C++ data structure padding. For a more detailed treatment of this subject, see, for example, Data alignment: Straighten up and fly right. Modern CPUs are capable of reading data from memory in chunks, for example, 2, 4, 8, or 16 bytes at a time. But due to the memory organization, the addresses of these chunks should be multiples of their sizes. If an address satisfies this requirement, then it is said to be properly aligned. The consequences of accessing data via an unaligned address can range from slower execution to program termination, depending on the CPU architecture and operating system.

Now let’s move one level up to C++. The language provides a set of fundamental types of various sizes. To make manipulating variables of these types fast, the generated object code will try to use CPU instructions which read/write the whole data type at once. This in turn means that the variables of these types should be placed in memory in a way that makes their addresses suitably aligned. As a result, besides size, each fundamental type has another property: its alignment requirement. It may seem that the fundamental type’s alignment is the same as its size. This is not generally the case since the most suitable CPU instruction for a particular type may only be able to access a part of its data at a time. For example, a CPU may only be able to read at most 4 bytes at a time so a 64-bit long long type will have a size of 8 and an alignment of 4.

GNU g++ has a language extension that allows you to query a type’s alignment. The following program prints fundamental type sizes and alignment requirements of a platform for which it was compiled:

#include <iostream>
 
using namespace std;
 
template <typename T>
void print (char const* name)
{
  cerr << name
       << " sizeof = " << sizeof (T)
       << " alignof = " << __alignof__ (T)
       << endl;
}
 
int main ()
{
  print<bool>        ("bool          ");
  print<wchar_t>     ("wchar_t       ");
  print<short>       ("short int     ");
  print<int>         ("int           ");
  print<long>        ("long int      ");
  print<long long>   ("long long int ");
  print<float>       ("float         ");
  print<double>      ("double        ");
  print<long double> ("long double   ");
  print<void*>       ("void*         ");
}

The following listing shows the result of running this program on a 32-bit x86 GNU/Linux machine. Notice the size and alignment of the long long, double, and long double types.

bool           sizeof = 1  alignof = 1
wchar_t        sizeof = 4  alignof = 4
short int      sizeof = 2  alignof = 2
int            sizeof = 4  alignof = 4
long int       sizeof = 4  alignof = 4
long long int  sizeof = 8  alignof = 4
float          sizeof = 4  alignof = 4
double         sizeof = 8  alignof = 4
long double    sizeof = 12 alignof = 4
void*          sizeof = 4  alignof = 4

[Actually, the above program shows that the alignment of long long and double is 8. This is, however, not the case since the IA32 ABI specifies that their alignments should be 4. Also, if you wrap long long or double in a struct and take the alignment of the resulting type, it will be 4, not 8.]

And the following listing is for 64-bit x86-64 GNU/Linux:

bool           sizeof = 1  alignof = 1
wchar_t        sizeof = 4  alignof = 4
short int      sizeof = 2  alignof = 2
int            sizeof = 4  alignof = 4
long int       sizeof = 8  alignof = 8
long long int  sizeof = 8  alignof = 8
float          sizeof = 4  alignof = 4
double         sizeof = 8  alignof = 8
long double    sizeof = 16 alignof = 16
void*          sizeof = 8  alignof = 8

The C++ compiler also needs to make sure that member variables in a struct or class are properly aligned. For this, the compiler may insert padding bytes between member variables. Additionally, to make sure that each element in an array of a user-defined type is aligned, the compiler may add some extra padding after the last data member. Consider the following struct as an example:

struct foo
{
  bool a;
  short b;
  long long c;
  bool d;
};

The compiler always assumes that an instance of foo will start at an address aligned to the most strict alignment requirement of all of foo’s members, which is long long in our case. This is actually how the alignment requirements of a user-defined types are calculated. Assuming we are on x86-64 with short having the alignment of 2 and long long — of 8, to make the b member suitably aligned, the compiler needs to insert an extra byte between a and b. Similarly, to align c, the compiler needs to insert four bytes after b. Finally, to make sure the next element in an array of foos starts at an address aligned to 8, the compiler needs to add seven bytes of padding at the end of struct foo. Here is the actual memory image of this struct with the positions of each member when the object is allocated at an example address 8:

                 // addr  alignment
struct foo       // 8     8
{
  bool a;        // 8     1
  char pad1[1];
  short b;       // 10    2
  char pad2[4]
  long long c;   // 16    8
  bool d;        // 24    1
  char pad3[7];
};               // 32    8  (next element in array)

Now back to our question about serializing simple classes with memcpy(). It should be clear by now that to be able to save a user-defined type with memcpy() on one platform and then load it on another, the two platforms not only need to have fundamental types of the same sizes and be of the same endianess, but they also need to be alignment-compatible. Otherwise, the positions of members inside the type and even the size of the type itself can differ. And this is exactly what happens if we try to move the data corresponding to foo between x86 and x86-64 even though the types used in the struct are of the same size. Here is what the padded memory image of foo on x86 looks like:

struct foo
{
  bool a;
  char pad1[1];
  short b;
  long long c;
  bool d;
  char pad2[3];
};

Because the alignment of long long on this platform is 4, padding between b and c is no longer necessary and padding at the end of the struct is 3 bytes instead of 7. The size of this struct is 16 bytes on x86 and 24 bytes on x86-64.

[For those curious about Mac OS X on x86 and iPhone OS on ARM, they are alignment-compatible, as long as you don’t use long double which has different sizes on the two platforms.]

Parallel compilation from the command line

Sunday, March 29th, 2009

I often need to compile a bunch of generated C++ files from the command line to make sure everything compiles cleanly and to run a quick test. These are C++ data binding files generated by XSD and XSD/e from various XML schemas. There can be quite a few files (like a thousand) or there can be a handful of them but each is quite large and takes long to compile. I used to run g++ from the command line which compiles one file at a time:

g++ -c *.cxx

This was slow. Annoyingly, I also had three more CPU cores idling while they could have been used to speed things up. If it were a longer-term project, then I would have created a makefile for it and run GNU make in parallel (-j 4). But these were just quick tests and creating a makefile for each set of schemas that I test seemed like a chore (after the initial test the schemas are added to a test repository where a set of special scripts and makefiles are used to automatically check for regressions before each release).

What would then be nice is a single, generic makefile that I could use to compile a set of files in parallel. Ideally, it shouldn’t be much more typing than the simple g++ invocation above. It is quite easy to come up with such a makefile for GNU make:

ifeq ($(src),)
src := *.cxx
endif
 
obj := $(patsubst %.cxx,%.o,$(wildcard $(src)))
 
.PHONY: all clean
all: $(obj)
 
clean:
  rm -f $(obj)
 
%.o: %.cxx
  $(CXX) $(CPPFLAGS) $(CXXFLAGS) -c $< -o $@

Now I can compile all the C++ files in a directory in parallel:

make -f ~/makefile -j 4

I can also specify a custom set of C++ files using the src variable:

make -f ~/makefile -j 4 src="test*.cxx driver.cxx"

shared_ptr: where is the counter

Sunday, January 11th, 2009

There are two common ways of allocating a reference counter that is used to track object ownership in a smart pointer, such as shared_ptr. The oldest method is to place the counter into the object itself, usually via a special base class. Such a counter is sometimes called intrusive. The more recent approach which is used, for example, in Boost, is to allocate the counter on the heap independently from the object. There are also two less commonly used methods of allocating the counter. The first is to allocate the counter as part of the memory block used for the object itself. The second uses the number of automatic pointers pointing to the object as an implicit reference count. Each of the four approaches has a number of drawbacks which I will examine in more detail below. The last two methods pose some implementation challenges which are also discussed below.

We will use the following four main criteria to evaluate each approach: existing practices compatibility, ease of use, performance, and concurrency support. By existing practices compatibility I mean changes that a user may have to make to the design and implementation of their classes in order to support shared_ptr. The ease of use criterion refers to the complexity of rules governing the use of shared_ptr. Performance refers to memory and speed overhead incurred by the sharing semantics. Finally, concurrency support examines how well each approach can be implemented in a multi-threaded environment.

We will start with the intrusive counter approach in which the reference counter is part of the object itself. The following code snippet outlines the essence of this method:

struct shared_base
{
  void inc_ref ();
  void dec_ref ();
 
private:
  size_t counter_;
};
 
class type: public shared_base
{
  ...
};
 
shared_ptr<type> x (new type);

The existing practices compatibility of this approach is quite poor. Each class that needs to be shared has to inherit from shared_base which may be impossible to do for third-party code. Furthermore, the designer of a class now needs to decide whether its users may want to share the objects of this class. If such a class can be used for both shared objects and for non-shared or stack-allocated objects, in the latter two cases the counter will be a waste of space.

The intrusive counter approach is easy to use in the client code. In particular, this approach is “naked” pointer compatible. That is, the object that is referenced by two or more shared_ptr instances can be passed around as a normal pointer, for example:

shared_ptr<type> x (new type); // counter_ == 1
shared_ptr<type> y (x);        // counter_ == 2
type* p = x.release ();        // counter_ == 2
shared_ptr<type> z (p);        // counter_ == 2

There is no speed or memory overhead in this approach (except for non-shared and stack-allocated objects, as mentioned above). The thread-safe version is normally implemented using atomic operations.

Allocating the counter on the heap independently from the object has good existing practices compatibility. There is no need to make any special changes to classes and third party code can be used without any restrictions. The only problem arises when a class implementation needs to manipulate the reference count. For example, a class that needs to register itself in a map that uses shared_ptr.

The external counter approach is easy to use in the client code. However, using “naked” pointers for certain operations can cause problems since there is no way to obtain the counter from a normal pointer to the object. To overcome this, shared_ptr with an external counter usually does not provide release() and instead provides the weak_ptr class template to emulate normal pointers. This can complicate the client code.

There is a substantial speed and memory overhead in this approach since the counter requires a separate allocation on the heap. This can be a serious problem for applications that allocate a large number of relatively small objects. In such situations the counter overhead can become comparable to that of the object since heap implementations usually allocate a memory block of certain minimum size even if the requested size is smaller. The thread-safe version is normally implemented using atomic operations.

The logical improvement to the previous approach’s inefficiency is to somehow allocate the counter in the same memory block as the object itself. For example, we could overload operator new which will allocate an extra space and place the counter before the memory block occupied by the object. The shared_ptr class template could then offset the pointer to object to access the counter. The following example shows how we could have used such an implementation:

shared_ptr<type> x (new (shared) type);

There are a number of challenges that such an implementation would have to overcome when locating the counter based on the object pointer. The problem with simply offsetting the passed pointer is that it may not point to the beginning of the memory region occupied by the object. Consider the following code as an example:

struct base1
{
  int a;
};
 
struct base2
{
  int b;
};
 
struct derived: base1, base2
{
};
 
int main ()
{
  derived d;
  base1* b1 = &d;
  base2* b2 = &d;
 
  cerr << "derived : " << &d << endl
       << "base1   : " << b1 << endl
       << "base2   : " << b2 << endl;
}

On my machine the above example prints:

derived : 0xffb3ba34
base1   : 0xffb3ba34
base2   : 0xffb3ba38

As you can see, the base2 object is offset 4 bytes from the start of the memory region occupied by the derived object. Consider now what happens if the user writes the following and we simply offset the passed pointer to get to the counter:

shared_ptr<base2> x (new (shared) derived);

What we need then is a way to get to the top of the memory region occupied by an object even if we are given a pointer to one of its base sub-objects. One interesting and relatively unknown feature of dynamic_cast can help us with this. The standard specifies that dynamic_cast to void* returns a pointer to the most derived object pointed to by the argument. While this is exactly what we need, dynamic_cast only works on polymorphic types (classes that have virtual functions).

What about non-polymorphic types? When we pass a base pointer to shared_ptr, the object will get deleted via this base pointer. Unless the base type has a virtual destructor (which then makes it a polymorphic type), this operation has undefined behavior since the compiler has no way of calling the derived destructor as well as finding where the memory block, occupied by the object, starts. In other words, in case of non-polymorphic types, it is only valid to use the most-derived type to instantiate shared_ptr. And in that case we don’t need dynamic_cast and can offset the pointer directly.

Another potential problem with using dynamic_cast arises when the class tries to manipulate the counter in its constructor. During the base type construction, the dynamic_cast mechanism is not yet fully functional and an attempt to get a pointer to the most derived object will fail. In this case, however, the class is specifically designed to be shared and reference counted (since the implementation manipulates the counter) and therefore intrusive reference counting can be used without any disadvantages.

The handling of polymorphic and non-polymorphic objects as well as those with intrusive counters can be done transparently with a little bit of meta-programming, as shown in the sample implementation.

The existing practices compatibility of this approach is fairly good since no changes are required to classes to support reference counting. The only problem arises when an object to be shared is allocated by a function (e.g., by a factory object or the clone() function). In this case the function needs to be extended to support sharing, for example:

class type
{
  virtual type*
  clone (share s = exclusive) const
  {
    return new (s) type (*this);
  }
};
 
type x;
type* y = x.clone ();
shared_ptr<type> z = x.clone (shared);

Or, if this interface is to be used by third-party code, something more generic may be more appropriate, for example:

struct allocator
{
  virtual void*
  allocate (size_t) = 0;
};
 
class type
{
  virtual type*
  clone (allocator* a = 0) const
  {
    return a
      ? new (a->allocate ()) type (*this)
      : new type (*this);
  }
};
 
struct shared_allocator
{
  virtual void*
  allocate (size_t n)
  {
    return operator new (n, shared);
  }
};
 
extern shared_allocator* shared_alloc;
 
type x;
type* y = x.clone ();
shared_ptr<type> z = x.clone (shared_alloc);

If such a function is provided by a third-party code then this modification may not be possible and the only general solution is to override default operator new(size_t) to always allocate the counter. This will waste 2 * sizeof(size_t) bytes per allocation for the non-shared cases.

This approach is easy to use in the client code. Similar to the intrusive counter it is “naked” pointer compatible. There is no speed or memory overhead and the thread-safe version is normally implemented using atomic operations.

The final method of allocating the counter that I will discuss also tries to improve the memory overhead of the external counter approach. Instead of allocating the counter on the heap, this method uses the number of shared_ptr instances pointing to a particular object as an implicit reference counter. To achieve this all shared_ptr instances pointing to the same object are arranged into a doubly-linked, circular list. When a shared_ptr instance is copied, the copy adds itself to the list. When a shared_ptr instance is destroyed, it removes itself from the list. If it was the last entry in the list, it deletes the pointed-to object. To return the reference count for an object, the implementation counts the number of entries in the list.

This approach has the same advantages and disadvantages in the existing practices compatibility and ease of use areas as the external counter method (that is, it also requires weak_ptr). It eliminates the counter allocation overhead and therefore has better performance. The major disadvantage of this approach is the difficulty of providing an efficient implement in a multi-threaded environment due to the linked list operations involved.

The following table summarizes the strong and weak points of each of the four counter placement approaches discussed above:

intrusive external object memory implicit
existing practices poor best good best
ease of use best good best good
performance best poor best best
concurrency best best best poor