Archive for January, 2009

shared_ptr: where is the counter

Sunday, January 11th, 2009

There are two common ways of allocating a reference counter that is used to track object ownership in a smart pointer, such as shared_ptr. The oldest method is to place the counter into the object itself, usually via a special base class. Such a counter is sometimes called intrusive. The more recent approach which is used, for example, in Boost, is to allocate the counter on the heap independently from the object. There are also two less commonly used methods of allocating the counter. The first is to allocate the counter as part of the memory block used for the object itself. The second uses the number of automatic pointers pointing to the object as an implicit reference count. Each of the four approaches has a number of drawbacks which I will examine in more detail below. The last two methods pose some implementation challenges which are also discussed below.

We will use the following four main criteria to evaluate each approach: existing practices compatibility, ease of use, performance, and concurrency support. By existing practices compatibility I mean changes that a user may have to make to the design and implementation of their classes in order to support shared_ptr. The ease of use criterion refers to the complexity of rules governing the use of shared_ptr. Performance refers to memory and speed overhead incurred by the sharing semantics. Finally, concurrency support examines how well each approach can be implemented in a multi-threaded environment.

We will start with the intrusive counter approach in which the reference counter is part of the object itself. The following code snippet outlines the essence of this method:

struct shared_base
  void inc_ref ();
  void dec_ref ();
  size_t counter_;
class type: public shared_base
shared_ptr<type> x (new type);

The existing practices compatibility of this approach is quite poor. Each class that needs to be shared has to inherit from shared_base which may be impossible to do for third-party code. Furthermore, the designer of a class now needs to decide whether its users may want to share the objects of this class. If such a class can be used for both shared objects and for non-shared or stack-allocated objects, in the latter two cases the counter will be a waste of space.

The intrusive counter approach is easy to use in the client code. In particular, this approach is “naked” pointer compatible. That is, the object that is referenced by two or more shared_ptr instances can be passed around as a normal pointer, for example:

shared_ptr<type> x (new type); // counter_ == 1
shared_ptr<type> y (x);        // counter_ == 2
type* p = x.release ();        // counter_ == 2
shared_ptr<type> z (p);        // counter_ == 2

There is no speed or memory overhead in this approach (except for non-shared and stack-allocated objects, as mentioned above). The thread-safe version is normally implemented using atomic operations.

Allocating the counter on the heap independently from the object has good existing practices compatibility. There is no need to make any special changes to classes and third party code can be used without any restrictions. The only problem arises when a class implementation needs to manipulate the reference count. For example, a class that needs to register itself in a map that uses shared_ptr.

The external counter approach is easy to use in the client code. However, using “naked” pointers for certain operations can cause problems since there is no way to obtain the counter from a normal pointer to the object. To overcome this, shared_ptr with an external counter usually does not provide release() and instead provides the weak_ptr class template to emulate normal pointers. This can complicate the client code.

There is a substantial speed and memory overhead in this approach since the counter requires a separate allocation on the heap. This can be a serious problem for applications that allocate a large number of relatively small objects. In such situations the counter overhead can become comparable to that of the object since heap implementations usually allocate a memory block of certain minimum size even if the requested size is smaller. The thread-safe version is normally implemented using atomic operations.

The logical improvement to the previous approach’s inefficiency is to somehow allocate the counter in the same memory block as the object itself. For example, we could overload operator new which will allocate an extra space and place the counter before the memory block occupied by the object. The shared_ptr class template could then offset the pointer to object to access the counter. The following example shows how we could have used such an implementation:

shared_ptr<type> x (new (shared) type);

There are a number of challenges that such an implementation would have to overcome when locating the counter based on the object pointer. The problem with simply offsetting the passed pointer is that it may not point to the beginning of the memory region occupied by the object. Consider the following code as an example:

struct base1
  int a;
struct base2
  int b;
struct derived: base1, base2
int main ()
  derived d;
  base1* b1 = &d;
  base2* b2 = &d;
  cerr << "derived : " << &d << endl
       << "base1   : " << b1 << endl
       << "base2   : " << b2 << endl;

On my machine the above example prints:

derived : 0xffb3ba34
base1   : 0xffb3ba34
base2   : 0xffb3ba38

As you can see, the base2 object is offset 4 bytes from the start of the memory region occupied by the derived object. Consider now what happens if the user writes the following and we simply offset the passed pointer to get to the counter:

shared_ptr<base2> x (new (shared) derived);

What we need then is a way to get to the top of the memory region occupied by an object even if we are given a pointer to one of its base sub-objects. One interesting and relatively unknown feature of dynamic_cast can help us with this. The standard specifies that dynamic_cast to void* returns a pointer to the most derived object pointed to by the argument. While this is exactly what we need, dynamic_cast only works on polymorphic types (classes that have virtual functions).

What about non-polymorphic types? When we pass a base pointer to shared_ptr, the object will get deleted via this base pointer. Unless the base type has a virtual destructor (which then makes it a polymorphic type), this operation has undefined behavior since the compiler has no way of calling the derived destructor as well as finding where the memory block, occupied by the object, starts. In other words, in case of non-polymorphic types, it is only valid to use the most-derived type to instantiate shared_ptr. And in that case we don’t need dynamic_cast and can offset the pointer directly.

Another potential problem with using dynamic_cast arises when the class tries to manipulate the counter in its constructor. During the base type construction, the dynamic_cast mechanism is not yet fully functional and an attempt to get a pointer to the most derived object will fail. In this case, however, the class is specifically designed to be shared and reference counted (since the implementation manipulates the counter) and therefore intrusive reference counting can be used without any disadvantages.

The handling of polymorphic and non-polymorphic objects as well as those with intrusive counters can be done transparently with a little bit of meta-programming, as shown in the sample implementation.

The existing practices compatibility of this approach is fairly good since no changes are required to classes to support reference counting. The only problem arises when an object to be shared is allocated by a function (e.g., by a factory object or the clone() function). In this case the function needs to be extended to support sharing, for example:

class type
  virtual type*
  clone (share s = exclusive) const
    return new (s) type (*this);
type x;
type* y = x.clone ();
shared_ptr<type> z = x.clone (shared);

Or, if this interface is to be used by third-party code, something more generic may be more appropriate, for example:

struct allocator
  virtual void*
  allocate (size_t) = 0;
class type
  virtual type*
  clone (allocator* a = 0) const
    return a
      ? new (a->allocate ()) type (*this)
      : new type (*this);
struct shared_allocator
  virtual void*
  allocate (size_t n)
    return operator new (n, shared);
extern shared_allocator* shared_alloc;
type x;
type* y = x.clone ();
shared_ptr<type> z = x.clone (shared_alloc);

If such a function is provided by a third-party code then this modification may not be possible and the only general solution is to override default operator new(size_t) to always allocate the counter. This will waste 2 * sizeof(size_t) bytes per allocation for the non-shared cases.

This approach is easy to use in the client code. Similar to the intrusive counter it is “naked” pointer compatible. There is no speed or memory overhead and the thread-safe version is normally implemented using atomic operations.

The final method of allocating the counter that I will discuss also tries to improve the memory overhead of the external counter approach. Instead of allocating the counter on the heap, this method uses the number of shared_ptr instances pointing to a particular object as an implicit reference counter. To achieve this all shared_ptr instances pointing to the same object are arranged into a doubly-linked, circular list. When a shared_ptr instance is copied, the copy adds itself to the list. When a shared_ptr instance is destroyed, it removes itself from the list. If it was the last entry in the list, it deletes the pointed-to object. To return the reference count for an object, the implementation counts the number of entries in the list.

This approach has the same advantages and disadvantages in the existing practices compatibility and ease of use areas as the external counter method (that is, it also requires weak_ptr). It eliminates the counter allocation overhead and therefore has better performance. The major disadvantage of this approach is the difficulty of providing an efficient implement in a multi-threaded environment due to the linked list operations involved.

The following table summarizes the strong and weak points of each of the four counter placement approaches discussed above:

intrusive external object memory implicit
existing practices poor best good best
ease of use best good best good
performance best poor best best
concurrency best best best poor

Fun and dull of making development tools

Sunday, January 4th, 2009

Writing software development tools, such as compilers, IDE’s, etc., is often considered the top of the software development food chain. You get to create products for like-minded people who are able to appreciate elegant design and quality implementation. As a result, the development tools business is highly competitive and attracts the best minds in the field.

From the outside it may seem that creating this type of software is pure fun. On the inside, however, there are a lot of dull moments that lead to a quality release. Without things like testing, examples, and documentation, the product will be buggy and nobody except the authors will know how to use it. A couple of weeks ago I was working on a feature of medium complexity for CodeSynthesis XSD and decided to capture a high-level list of all the things I had to take care of. This should give anyone a practical idea of the amount of detail involved in creating software development tools.

  1. First I gathered the use-cases and requirements from each interested customer. In this case I had two.
  2. Then I generalized the requirements and came up with an overall design. Most of the time the solutions proposed by customers only cover their particular use-case. I had to think of other similar things that people may want to do and come up with a general enough design to cover all reasonable scenarios. This is also why I prefer to wait until at least two people ask for any non-trivial feature. This way you get two “projections” which makes it easier to “see” the general feature behind them.
  3. After that I ran the overall design by each interested customer to make sure they are satisfied with the solution. This took the form of sample code fragments and explanations that showed how the proposed feature cover their use-cases.
  4. Once all the parties were satisfied with the overall design I had to think it through in more detail. In particular, I had to consider how this feature will interact with other features already in the product. In this case I needed to consider support for polymorphism and customizable naming conventions. While it turned out support for polymorphism didn’t require anything extra, I needed to make changes to the naming convention code.
  5. Once the design had been thought through in detail, it was time to write the code. This involved adding new command line options, updating the name processor, and adding support for the new feature in the code generators.
  6. Once the feature was implemented, I had to add a test to the test suite to make sure everything worked as expected.
  7. Next I added an example which shows how to use the most important aspects of the new feature. Besides implementing the sample code itself, creating an example involves the following main tasks (we have a “new example” checklist):
    1. Write a README file describing the example.
    2. Write a Makefile for the UNIX build system.
    3. Create VC++ 7.1, 8.0, and 9.0 project/solution files.
  8. Since new command line options were added, I needed to update the usage information and man pages.
  9. I then added a section to the User Manual documenting the new feature as well as a note to the Getting Started Guide with a reference to the new section in the Manual.
  10. I then added the information about the new feature to the NEWS file with references to the man pages, the new section in the Manual, and the new example.
  11. Finally, I built pre-release binaries of the product for the two customers so that they could test and/or start using the new feature.