shared_ptr: where is the counter

There are two common ways of allocating a reference counter that is used to track object ownership in a smart pointer, such as shared_ptr. The oldest method is to place the counter into the object itself, usually via a special base class. Such a counter is sometimes called intrusive. The more recent approach which is used, for example, in Boost, is to allocate the counter on the heap independently from the object. There are also two less commonly used methods of allocating the counter. The first is to allocate the counter as part of the memory block used for the object itself. The second uses the number of automatic pointers pointing to the object as an implicit reference count. Each of the four approaches has a number of drawbacks which I will examine in more detail below. The last two methods pose some implementation challenges which are also discussed below.

We will use the following four main criteria to evaluate each approach: existing practices compatibility, ease of use, performance, and concurrency support. By existing practices compatibility I mean changes that a user may have to make to the design and implementation of their classes in order to support shared_ptr. The ease of use criterion refers to the complexity of rules governing the use of shared_ptr. Performance refers to memory and speed overhead incurred by the sharing semantics. Finally, concurrency support examines how well each approach can be implemented in a multi-threaded environment.

We will start with the intrusive counter approach in which the reference counter is part of the object itself. The following code snippet outlines the essence of this method:

struct shared_base
{
  void inc_ref ();
  void dec_ref ();
 
private:
  size_t counter_;
};
 
class type: public shared_base
{
  ...
};
 
shared_ptr<type> x (new type);

The existing practices compatibility of this approach is quite poor. Each class that needs to be shared has to inherit from shared_base which may be impossible to do for third-party code. Furthermore, the designer of a class now needs to decide whether its users may want to share the objects of this class. If such a class can be used for both shared objects and for non-shared or stack-allocated objects, in the latter two cases the counter will be a waste of space.

The intrusive counter approach is easy to use in the client code. In particular, this approach is “naked” pointer compatible. That is, the object that is referenced by two or more shared_ptr instances can be passed around as a normal pointer, for example:

shared_ptr<type> x (new type); // counter_ == 1
shared_ptr<type> y (x);        // counter_ == 2
type* p = x.release ();        // counter_ == 2
shared_ptr<type> z (p);        // counter_ == 2

There is no speed or memory overhead in this approach (except for non-shared and stack-allocated objects, as mentioned above). The thread-safe version is normally implemented using atomic operations.

Allocating the counter on the heap independently from the object has good existing practices compatibility. There is no need to make any special changes to classes and third party code can be used without any restrictions. The only problem arises when a class implementation needs to manipulate the reference count. For example, a class that needs to register itself in a map that uses shared_ptr.

The external counter approach is easy to use in the client code. However, using “naked” pointers for certain operations can cause problems since there is no way to obtain the counter from a normal pointer to the object. To overcome this, shared_ptr with an external counter usually does not provide release() and instead provides the weak_ptr class template to emulate normal pointers. This can complicate the client code.

There is a substantial speed and memory overhead in this approach since the counter requires a separate allocation on the heap. This can be a serious problem for applications that allocate a large number of relatively small objects. In such situations the counter overhead can become comparable to that of the object since heap implementations usually allocate a memory block of certain minimum size even if the requested size is smaller. The thread-safe version is normally implemented using atomic operations.

The logical improvement to the previous approach’s inefficiency is to somehow allocate the counter in the same memory block as the object itself. For example, we could overload operator new which will allocate an extra space and place the counter before the memory block occupied by the object. The shared_ptr class template could then offset the pointer to object to access the counter. The following example shows how we could have used such an implementation:

shared_ptr<type> x (new (shared) type);

There are a number of challenges that such an implementation would have to overcome when locating the counter based on the object pointer. The problem with simply offsetting the passed pointer is that it may not point to the beginning of the memory region occupied by the object. Consider the following code as an example:

struct base1
{
  int a;
};
 
struct base2
{
  int b;
};
 
struct derived: base1, base2
{
};
 
int main ()
{
  derived d;
  base1* b1 = &d;
  base2* b2 = &d;
 
  cerr << "derived : " << &d << endl
       << "base1   : " << b1 << endl
       << "base2   : " << b2 << endl;
}

On my machine the above example prints:

derived : 0xffb3ba34
base1   : 0xffb3ba34
base2   : 0xffb3ba38

As you can see, the base2 object is offset 4 bytes from the start of the memory region occupied by the derived object. Consider now what happens if the user writes the following and we simply offset the passed pointer to get to the counter:

shared_ptr<base2> x (new (shared) derived);

What we need then is a way to get to the top of the memory region occupied by an object even if we are given a pointer to one of its base sub-objects. One interesting and relatively unknown feature of dynamic_cast can help us with this. The standard specifies that dynamic_cast to void* returns a pointer to the most derived object pointed to by the argument. While this is exactly what we need, dynamic_cast only works on polymorphic types (classes that have virtual functions).

What about non-polymorphic types? When we pass a base pointer to shared_ptr, the object will get deleted via this base pointer. Unless the base type has a virtual destructor (which then makes it a polymorphic type), this operation has undefined behavior since the compiler has no way of calling the derived destructor as well as finding where the memory block, occupied by the object, starts. In other words, in case of non-polymorphic types, it is only valid to use the most-derived type to instantiate shared_ptr. And in that case we don’t need dynamic_cast and can offset the pointer directly.

Another potential problem with using dynamic_cast arises when the class tries to manipulate the counter in its constructor. During the base type construction, the dynamic_cast mechanism is not yet fully functional and an attempt to get a pointer to the most derived object will fail. In this case, however, the class is specifically designed to be shared and reference counted (since the implementation manipulates the counter) and therefore intrusive reference counting can be used without any disadvantages.

The handling of polymorphic and non-polymorphic objects as well as those with intrusive counters can be done transparently with a little bit of meta-programming, as shown in the sample implementation.

The existing practices compatibility of this approach is fairly good since no changes are required to classes to support reference counting. The only problem arises when an object to be shared is allocated by a function (e.g., by a factory object or the clone() function). In this case the function needs to be extended to support sharing, for example:

class type
{
  virtual type*
  clone (share s = exclusive) const
  {
    return new (s) type (*this);
  }
};
 
type x;
type* y = x.clone ();
shared_ptr<type> z = x.clone (shared);

Or, if this interface is to be used by third-party code, something more generic may be more appropriate, for example:

struct allocator
{
  virtual void*
  allocate (size_t) = 0;
};
 
class type
{
  virtual type*
  clone (allocator* a = 0) const
  {
    return a
      ? new (a->allocate ()) type (*this)
      : new type (*this);
  }
};
 
struct shared_allocator
{
  virtual void*
  allocate (size_t n)
  {
    return operator new (n, shared);
  }
};
 
extern shared_allocator* shared_alloc;
 
type x;
type* y = x.clone ();
shared_ptr<type> z = x.clone (shared_alloc);

If such a function is provided by a third-party code then this modification may not be possible and the only general solution is to override default operator new(size_t) to always allocate the counter. This will waste 2 * sizeof(size_t) bytes per allocation for the non-shared cases.

This approach is easy to use in the client code. Similar to the intrusive counter it is “naked” pointer compatible. There is no speed or memory overhead and the thread-safe version is normally implemented using atomic operations.

The final method of allocating the counter that I will discuss also tries to improve the memory overhead of the external counter approach. Instead of allocating the counter on the heap, this method uses the number of shared_ptr instances pointing to a particular object as an implicit reference counter. To achieve this all shared_ptr instances pointing to the same object are arranged into a doubly-linked, circular list. When a shared_ptr instance is copied, the copy adds itself to the list. When a shared_ptr instance is destroyed, it removes itself from the list. If it was the last entry in the list, it deletes the pointed-to object. To return the reference count for an object, the implementation counts the number of entries in the list.

This approach has the same advantages and disadvantages in the existing practices compatibility and ease of use areas as the external counter method (that is, it also requires weak_ptr). It eliminates the counter allocation overhead and therefore has better performance. The major disadvantage of this approach is the difficulty of providing an efficient implement in a multi-threaded environment due to the linked list operations involved.

The following table summarizes the strong and weak points of each of the four counter placement approaches discussed above:

intrusive external object memory implicit
existing practices poor best good best
ease of use best good best good
performance best poor best best
concurrency best best best poor

Comments are closed.