Do we need std::buffer?

Or, boost::buffer for starters?

A few days ago I was again wishing that there was a standard memory buffer abstraction in C++. I have already had to invent my own classes for XSD and XSD/e (XML Schema to C++ compilers) where they are used for mapping the XML Schema hexBinary and base64Binary types to C++. Now I have the same problem in ODB (an ORM system for C++) where I need a suitable C++ type for representing database BLOB types. This time I have decided against creating another copy of my own buffer class and instead use the poor man’s “standard” buffer, std::vector<char>, with its unnatural interface and all.

The abstraction I am wishing for is a simple class for encapsulating the memory management of a raw memory buffer plus providing a few common operations, such as memcpy, memset, etc. So instead of writing this:

class person
{
public:
  person (char* key_data, std::size_t key_size)
    : key_size_ (key_size)
  {
    key_data_ = new char[key_size];
    std::memcpy (key_data_, key_data, key_size);
  }
 
  ~person ()
  {
    delete key_data_;
  }
 
  ...
 
  char* key_data_;
  std::size_t key_size_;
};

Or having to create yet another custom buffer class, we could do this:

class person
{
public:
  person (char* key_data, std::size_t key_size)
    : key_ (key_data, key_size)
  {
  }
 
  ...
 
  std::buffer key_;
};

Above I called vector<char> a poor man’s “standard” buffer. But what exactly is wrong with using it to manage a memory buffer? While it works reasonably well functionally, the interface is unnatural and some operations may not be as efficient as we would expect from a memory buffer. Let’s examine the most prominent examples of these issues.

The first problem is with how we access the underlying memory. The C++ standard defect report (DR) 464 added the data() member function to std::vector which returns a pointer to the buffer. However, there are still compilers in use that do not support this, notably GCC 3.4 and VC++ 2008/9.0. As a result, if you want your code to be portable, you will need to use the much less intuitive &b.front() expression:

vector<char> b = ...
memcpy (out, &b.front (), b.size ());

There is also a subtle issue with using front(). While it appears to be legal to call data() on an empty buffer (as long as we don’t dereference the returned pointer), it is illegal to call front(). This means that you may have to handle an empty buffer as a special case, further complicating your code:

vector<char> b = ...
memcpy (out, (b.empty () ? 0 : &b.front ()), b.size ());

The initialization of a buffer is also inconvenient and potentially inefficient. Let’s say we want to have an uninitialized buffer of 1024 bytes which we plan to fill in later. There is no way to do that with vector<char>. The best we can do is to have every byte initialized:

vector<char> b (1024); // Zero-initialized buffer.

If we want to create a buffer initialized with contents of a memory fragment, the interface we have to use is cumbersome:

vector<char> b (data, data + size);

What we want to write instead is this:

buffer b (data, size);

This initialization is also potentially inefficient. Depending on the quality of the implementation, std::vector may end up using a for loop instead of memcpy to copy the data. In fact, that’s exactly how it is done in GCC 4.5 and VC++ 2010/10.0 (Correction: as was pointed out in the comments, both GCC 4.5 and VC++ 10 optimize the case where the vector element is POD).

So I think it is quite clear that while vector<char> is workable, it is not particularly convenient or efficient.

Also, as it turns out this is not the first time I am playing with the idea of a dedicated buffer class in C++. A couple of months ago I started a thread on the Boost developer mailing list trying to see if there would be any interest in a simple buffer library in Boost. The result wasn’t very encouraging. The thread quickly splintered into discussions of various special-purpose, buffer-like data structures that people have in their applications.

On the other hand, I mentioned the buffer class at BoostCon 2011 to a couple of Boost users and got very positive responses, along the “If it were there we would use it!” lines. That’s when I got the idea of writing this article in an attempt to get feedback from the broader C++ community rather than from just the hard-core Boost developers (only they can withstand the boost-dev mailing list traffic).

While the above discussion should give you a pretty good idea about the kind of buffer class I am talking about, below I am going to show a proposed interface and provide a complete, header-only implementation (released under the Boost license), in case you would like to give it a try.

class buffer
{
public:
  typedef std::size_t size_type;
  static const size_type npos = -1;
 
  ~buffer ();
 
  explicit buffer (size_type size = 0);
  buffer (size_type size, size_type capacity);
  buffer (const void* data, size_type size);
  buffer (const void* data, size_type size, size_type capacity);
  buffer (void* data, size_type size, size_type capacity,
          bool assume_ownership);
 
  buffer (const buffer&);
  buffer& operator= (const buffer&);
 
  void swap (buffer&);
  char* detach ();
 
  void assign (const void* data, size_type size);
  void assign (void* data, size_type size, size_type capacity,
               bool assume_ownership);
  void append (const buffer&);
  void append (const void* data, size_type size);
  void fill (char value = 0);
 
  size_type size () const;
  bool size (size_type);
  size_type capacity () const;
  bool capacity (size_type);
  bool empty () const;
  void clear ();
 
  char* data ();
  const char* data () const;
 
  char& operator[] (size_type);
  char operator[] (size_type) const;
  char& at (size_type);
  char at (size_type) const;
 
  size_type find (char, size_type pos = 0) const;
  size_type rfind (char, size_type pos = npos) const;
 
private:
  char* data_;
  size_type size_;
  size_type capacity_;
  bool free_;
};
 
bool operator== (const buffer&, const buffer&);
bool operator!= (const buffer&, const buffer&);

Most of the interface should be self-explanatory. The last overloaded constructor allows us to create a buffer by reusing an existing memory block. If the assume_ownership argument is true, then the buffer object will free the memory using delete[]. The detach() function is the mirror side of this functionality in that it allows us to detach the underlying memory block and reuse it in some other way. After the call to detach() the buffer object becomes empty and we should eventually free the returned memory using delete[]. The size() and capacity() modifiers return true to indicate that the underlying buffer address has changed, in case we cached it somewhere.

So, do you think we need something like this in Boost and perhaps in the C++ standard library? Do you like the proposed interface?

17 Responses to “Do we need std::buffer?”

  1. Adam Jordan Says:

    I find using std::string to be an effective means of doing many of these operations as well. It is still unnatural, but it works as well.

    Another note, the best way I find to get the start of the buffer is &vectorinstance[0]. This also works with std::string and is cross platform.

  2. Boris Kolpackov Says:

    Adam, the major problem with using std::string as a buffer is the special meaning of ‘\0′. In other words, you cannot store binary data that contains zero byte values in the string.

    And, yes, &b[0] and &b.front () are essentially the same, including the fact that you cannot do b[0] on an empty buffer/string.

  3. Arseny Kapoulkine Says:

    Boris, ” does not have a special meaning in std::string - it only has it for c_str()-returned pointer (even then, I don’t think implementations are permitted to truncate string data in c_str() - i.e. [c_str(); c_str() + length()) should have the entire contents. Or you can use data()).

    So you can definitely store binary data in std::string, the question is - why would you want to :)

  4. foobar Says:

    Boris, who told you ” has any special meaning in std::string? You can store whatever you want in it, you are just guaranteed that c_str() will have that additional ” appended. You are free to have ” at any pos

  5. Anonymous Says:

    > std::vector may end up using a for loop instead of memcpy to copy the data. In fact, that’s exactly how it is done in GCC 4.5 and VC++ 2010/10.0

    Are you sure VC10 does not memcopy POD’s? I’m remember Stephen T. Lavalej explained this implementation optimization detail in a channe9 video. Maybe that was limited to std::copy, cant remember,

  6. Boris Kolpackov Says:

    Arseny, yes, I agree, with a bit of care you can use std::string to store binary data. I guess the bigger issue with std::string, that std::vector does not have, is the fact that you cannot modify the underlying buffer directly. For example, you cannot use the recv() socket function to receive the data directly into a buffer managed by std::string.

    However, this is not the point that I am trying to make in this post. The point is that we shouldn’t be forced to coerce these types into doing things for which they were not designed.

  7. Kim Gräsman Says:

    Looks interesting… Off the top of my head:

    > bool size (size_type);
    > bool capacity (size_type);

    These two feel a bit weird — I’d prefer resize(size_t) to size(size_t), but not sure what would be a good mutator for capacity… realloc? Do size and capacity need to be different for a buffer?

    find() seems redundant — you can always use std::find(b.data(), b.data() + b.size(), ‘x’).

    Would it make sense to expose pointer “iterators” with begin/end? There may be a number of std algorithms that apply to buffers directly, e.g. std::transform, std::find, etc.

    Also, making it movable for C++0x would be a nice touch, I think.

  8. Boris Kolpackov Says:

    I just re-checked the VC++ 10 implementation and it appears it does have the POD optimization.

  9. Norbert Says:

    std::vector v(1024); initializes memory, ok. But what would
    std::vector v; v.reserve(1024); do?

    OK, it’s more text to write, but wouldn’t that give you an uninitialized buffer of size 1024?

  10. Michael S. Says:

    I don’t particular like the use of char* type to represent the data (buffer memory). You’re already using both char* and void* which is confusing.

    The buffer data is not a sequence of characters. If anything it is a sequence/block of bytes.

    I’ll suggest introducing a typedef to represent that, e.g.

    class buffer
    {
    public:
    typedef void * bufptr; // or bufdata or memdata or memptr or …

    };

  11. Marco Craveiro Says:

    Hi Boris,

    how does this buffer class relate to the Asio buffer:

    http://www.boost.org/doc/libs/1_47_0/boost/asio/buffer.hpp

    it would be nice to have only one…

    Cheers

    Marco

  12. Boris Kolpackov Says:

    Kim, thanks for the feedback on the buffer interface.

    I agree the size() and capacity() modifiers are a bit unconventional. The alternatives would be to follow the std container/string naming and call them resize() and reserve(). I think it is a good idea to have both size and capacity; the buffer may contain only so much data (size) but you may want to allocate a larger chunk of memory than what is currently needed (capacity) in order to be able to grow without reallocations.

    Regarding the find() redundancy, I disagree: b.find(’x') is terser than find(b.data(), b.data() + b.size(), ‘x’). Also std::string has find() even though you can use std::find() with it as well.

    Regarding iterators, yes, I initially added them using the raw pointer as the iterator type. This may not be the preferred approach for std containers so it may have to be wrapped into a class-type. But I agree, iterators are a good idea.

    Same about movable — the C++-0x version should definitely support this. I wonder if there is a portable way to detect the C++-0x mode at compile time?

  13. Boris Kolpackov Says:

    Norbert, no,

    std::vector v;
    v.reserve(1024);

    will give a vector of size 0 and of capacity 1024 (or greater). Strictly speaking, you still cannot access the underlying buffer because the vector is empty. Also, if later you do something like v.resize(1024) or v.insert(v.end(), data, data + size), all your data will be overwritten.

  14. Boris Kolpackov Says:

    Michael, I disagree. char* is an idiomatic way to represent a sequence of bytes in C and C++ programs. The buffer interface only uses void* as input type in order to allow you to populate the buffer with data of some other type without requiring an explicit conversion.

    Changing the underlying buffer type from char* to void* would make using the buffer much less convenient. For example, to get the pointer to the 10th byte, now we can write:

    char* p = b.data () + 10;

    If the underlying buffer type were void*, we would have had to write this instead:

    void* p = static_cast (b.data ()) + 10;

    Also note that char* will always implicitly convert to void* so if you want you can use the buffer class as if the underlying buffer type were void*:

    void* p = b.data ();

  15. Boris Kolpackov Says:

    Marco, boost::asio::buffer is a wrapper (it is actually a function, not a class) that gives a unified (but limited) interface to various underlying buffer representation (e.g., {void*, size_t} tuple, std::vector, etc) so that they can all be used with ASIO. In particular, it does not manage the underlying memory. So in this sense, the buffer class I am proposing would be just one of the underlying buffers that boost::asio::buffer class would wrap:

    buffer b(128);
    sock.receive(boost::asio::buffer(b));

  16. Kim Gräsman Says:

    Boris,

    reserve, that’s it! Makes sense to follow that convention, I think.

    I don’t really see why I would use a buffer over a vector if I needed a growable thing, but you may have a point.

    std::string is often criticized for its excessively wide interface, so I was trying to save you from that ;-)

    Another thing occurred to me: detach() sounds a lot like auto_ptr’s release(). I like detach better, but it feels like a Microsoftism–Their ATL library uses it consistently.

    For what it’s worth,
    - Kim

  17. loodot_chris Says:

    i agree that char * is a good approach.