Archive for the ‘ORM’ Category

Extended Database to C++ Type Mapping in ODB

Wednesday, July 18th, 2012

When it comes to development tools, I view features that they provide as being of two kinds. The majority are of the first kind which simply do something useful for the user of the tool. But the ones I really like are features that help people help themselves in ways that I might not have foreseen. The upcoming ODB 2.1.0 release has just such a feature.

In case you are not familiar with ODB, it is an object-relational mapping (ORM) system for C++. It allows you to persist C++ objects to a relational database without having to deal with tables, columns, or SQL, and manually writing any of the mapping code. ODB natively supports SQLite, PostgreSQL, MySQL, Oracle, and Microsoft SQL Server.

To understand this new feature let’s first get some background on the problem. As you probably know, these days all relational databases support pretty much the same set of “core” SQL data types. Things like integers, floating point types, strings, binary, date-time, etc. Each database, of course, has its own names for these types, but they provide more or less the same functionality across all the vendors. For each database ODB provides native support for all the core SQL types. Here by native I mean that the data is exchanged with the database in the most efficient, binary format. ODB also allows you to map any core SQL type to any C++ type so we can map TEXT to std::string, QString, or my_string (the former two mappings are provided by default).

This all sounds nice and simple and that would have been the end of the story if all that modern databases supported were core SQL types. However, most modern databases also support a slew of extended SQL types. Things like spatial types, user-defined types, arrays, XML, the kitchen sink, etc, etc (Ok, I don’t think any database supports that last one, yet). Here is a by no means complete list that should give you an idea about the vast and varying set of extended types available in each database supported by ODB:

MySQL
  • Spatial types (GEOMETRY, GEOGRAPHY)
SQLite
  • NUMERIC
  • Spatial types (GEOMETRY, GEOGRAPHY) [spatialite extension]
PostgreSQL
  • NUMERIC
  • XML
  • JSON
  • HSTORE (key-value store) [hstore extension]
  • Geometric types
  • Network address types
  • Enumerated types
  • Arrays
  • Range types
  • Composite types
  • Spatial types (GEOMETRY, GEOGRAPHY) [PostGIS extension]
Oracle
  • ANY
  • XML
  • MEDIA
  • Arrays (VARRAY, table type)
  • User-defined types
  • Spatial types (GEOMETRY, GEOGRAPHY)
SQL Server
  • XML
  • Alias types
  • CLR types
  • Spatial types (GEOMETRY, GEOGRAPHY)

When people just started using ODB, core SQL types were sufficient. But now, as projects become more ambitious, we started getting questions about using extended SQL types in ODB. For example, ODB will handle std::vector<int> for us, but it will do it in a portable manner, which means it will create a separate, JOIN‘ed table to store the vector elements. On the other hand, if we are using PostgreSQL, it would be much cleaner to map it to a single column of the array of integers type (INTEGER[]). Clearly we needed a way to support extended SQL types in ODB.

The straightforward way to add this support would have been to handle extended types the same way we handle the core ones. That is, for each type implement a mapping that uses native database format. However, as types become more complex (e.g., arrays, user-defined types) so do the methods used to access them in the database-native format. In fact, for some databases, this format is not even documented and the only way to understand how things are represented is to study the database source code!

So the straightforward way appears to be very laborious and not very robust. What other options do we have? The idea that is implemented in ODB came from the way the OpenGIS specification handles reading and writing of spatial values (GEOMETRY, GEOGRAPHY). OpenGIS specifies the Well-Known Text (WKT) and Well-Known Binary (WKB) formats for representing spatial values. For example, point(10, 20) in WKT is represented as the "POINT(10 20)" string. Essentially, what OpenGIS did is define an interface for the spatial SQL types in terms of one of the core SQL types (text or binary). OpenGIS also defines a pair of functions for converting between, say, WKT and GEOMETRY values (GeomFromText/AsText).

As it turns out, this idea of interfacing with an extended SQL type using one of the core ones can be used to handle pretty much any extended type mentioned in the list above. In the vast majority of cases all we need to do is cast one value to another.

So in order to support extended SQL types, ODB allows us to map them to one of the built-in types, normally a string or a binary. Given the text or binary representation of the data we can then extract it into our chosen C++ data type and thus establish a mapping between an extended database type and its C++ equivalent.

The mapping between an extended type and a core SQL type is established with the map pragma:

#pragma db map type(regex) as(subst) to(subst) from(subst)

The type clause specifies the name of the database type that we are mapping, which we will call mapped type from now on. The as clause specifies the name of the database type that we are mapping the mapped type to. We will call it interface type from now on. The optional to and from clauses specify the database conversion expressions between the mapped type and the interface type. They must contain the special (?) placeholder which will be replaced with the actual value to be converted.

The name of the mapped type is actually a regular expression pattern so we can match a class of types, instead of just a single name. We will see how this can be useful in a moment. Similarly, the name of the interface type as well as the to/from conversion expressions are actually regex pattern substitutions.

Let’s now look at a concrete example that shows how all this fits together. Earlier I mentioned std::vector<int> and how it would be nice to map it to PostgreSQL INTEGER[] instead of creating a separate table. Let’s see what it takes to arrange such a mapping.

In PostgreSQL the array literal has the {n1,n2,...} form. As it turns out, if we cast an array to TEXT, then we will get a string in exactly this format. Similarly, Postgres is happy to convert a string in this form back to an array with a simple cast. With this knowledge, we can take a stab at the mapping pragma:

#pragma db map type("INTEGER\\[\\]") \
               as("TEXT")            \
               to("(?)::INTEGER[]")  \
               from("(?)::TEXT")

In plain English this pragma essentially says this: map INTEGER[] to TEXT. To convert from TEXT to INTEGER[], cast the value to INTEGER[]. To convert the other way, cast the value to TEXT. exp::TEXT is a shorter, Postgres-specific notation for CAST(exp AS TEXT).

The above pragma will do the trick if we always spell the type as INTEGER[]. However, INTEGER [] or INTEGER[123] are also valid. If we want to handle all the one-dimension arrays of integers, then that regex support I mentioned above comes in very handy:

#pragma db map type("INTEGER *\\[(\\d*)\\]") \
               as("TEXT")                    \
               to("(?)::INTEGER[$1]")        \
               from("(?)::TEXT")

With the above pragma we can now have a persistent class that contains std::vector<int> mapped to INTEGER[]:

// test.hxx
//
#ifndef TEST_HXX
#define TEST_HXX
 
#include <vector>
 
#pragma db map type("INTEGER *\\[(\\d*)\\]") \
               as("TEXT")                    \
               to("(?)::INTEGER[$1]")        \
               from("(?)::TEXT")
 
#pragma db object
class object
{
public:
  #pragma db id auto
  unsigned long id;
 
  #pragma db type("INTEGER[]")
  std::vector<int> array;
};
#endif

Ok, that’s one half of the puzzle. The other half is to implement conversion between std::vector<int> and the "{n1,n2,...}" text representation. For that we need to provide a value_traits specialization for std::vector<int> C++ type and TEXT PostgreSQL type. value_traits is the ODB customization mechanism I mentioned earlier that allows us to map any C++ type to any core SQL type. Here is a sample implementation which should be pretty easy to follow. I’ve instrumented it with a few print statements so that we can see what’s going on at runtime.

// traits.hxx
//
#ifndef TRAITS_HXX
#define TRAITS_HXX
 
#include <vector>
#include <sstream>
#include <iostream>
#include <cstring> // std::memcpy
 
#include <odb/pgsql/traits.hxx>
 
namespace odb
{
  namespace pgsql
  {
    template <>
    class value_traits<std::vector<int>, id_string>
    {
    public:
      typedef std::vector<int> value_type;
      typedef value_type query_type;
      typedef details::buffer image_type;
 
      static void
      set_value (value_type& v,
                 const details::buffer& b,
                 std::size_t n,
                 bool is_null)
      {
        v.clear ();
 
        if (!is_null)
        {
          char c;
          std::string s (b.data (), n);
          std::cerr << "in: " << s << std::endl;
          std::istringstream is (s);
 
          is >> c; // '{'
 
          for (c = static_cast<char> (is.peek ());
               c != '}';
               is >> c)
          {
            v.push_back (int ());
            is >> v.back ();
          }
        }
      }
 
      static void
      set_image (details::buffer& b,
                 std::size_t& n,
                 bool& is_null,
                 const value_type& v)
      {
        is_null = false;
        std::ostringstream os;
 
        os << '{';
 
        for (value_type::const_iterator i (v.begin ()),
               e (v.end ());
             i != e;)
        {
          os << *i;
 
          if (++i != e)
            os << ',';
        }
 
        os << '}';
 
        const std::string& s (os.str ());
        std::cerr << "out: " << s << std::endl;
        n = s.size ();
 
        if (n > b.capacity ())
          b.capacity (n);
 
        std::memcpy (b.data (), s.c_str (), n);
      }
    };
  }
}
#endif

Ok, now that we have both pieces of the puzzle, let’s put everything together. The first step is to compile test.hxx (the file that defines the persistent class) with the ODB compiler. At this stage we need to include traits.hxx (the file that defines the value_traits specialization) into the generated header file. We use the --hxx-epilogue option for that. Here is a sample ODB command line:

odb -d pgsql -s --hxx-epilogue '#include "traits.hxx"' test.hxx

Let’s also create a test driver that stores the object in the database and then loads it back. Here we want to see two things: the SQL statements that are being executed and the data that is being sent to and from the database:

// driver.cxx
//
#include <odb/transaction.hxx>
#include <odb/pgsql/database.hxx>
 
#include "test.hxx"
#include "test-odb.hxx"
 
using namespace std;
using namespace odb::core;
 
int main ()
{
  odb::pgsql::database db ("odb_test", "", "odb_test");
 
  object o;
  o.array.push_back (1);
  o.array.push_back (2);
  o.array.push_back (3);
 
  transaction t (db.begin ());
  t.tracer (stderr_tracer);
 
  unsigned long id (db.persist (o));
  db.load (id, o);
 
  t.commit ();
}

Now we can build and run our test driver:

g++ -o driver driver.cxx test-odb.cxx -lodb-pgsql -lodb
psql -U odb_test -d odb_test ./test.sql
./driver

The output of the test driver is shown below. Notice how the conversion expressions that we specified in the mapping pragma ended up in the SQL statements that ODB executed in order to persist and load the object.

out: {1,2,3}
INSERT INTO object(id,array) VALUES(DEFAULT,$2::INTEGER[]) RETURNING id
SELECT object.id,object.array::TEXT FROM object WHERE object.id=$1
in: {1,2,3}

For more information on custom database type mapping support in ODB refer to Section 12.6, “Database Type Mapping Pragmas” in the ODB manual. Additionally, the odb-tests package contains a set of tests in the <database>/custom directories that, for each database, shows how to provide custom mapping for a variety of SQL types.

While the 2.1.0 release is still several weeks out, if you would like to give the new type mapping support a try, you can use the 2.1.0.a1 pre-release.

ODB 2.0.0 released

Wednesday, May 2nd, 2012

ODB 2.0.0 was released today.

In case you are not familiar with ODB, it is an object-relational mapping (ORM) system for C++. It allows you to persist C++ objects to a relational database without having to deal with tables, columns, or SQL, and manually writing any of the mapping code. ODB natively supports SQLite, PostgreSQL, MySQL, Oracle, and Microsoft SQL Server.

This release packs a number of major new features, including support for C++11, polymorphism, and composite object ids, as well as a few backwards-incompatible changes (thus the major version bump). We have also added GCC 4.7 and Clang 3.0 to the list of compilers that we use for testing each release. Specifically, the ODB compiler has been updated to be compatible with the GCC 4.7 series plugin API. There is also an interesting addition (free proprietary licence) to the licensing terms. As usual, below I am going to examine these and other notable new features in more detail. For the complete list of changes, see the official ODB 2.0.0 announcement.

C++11 support

This is a big feature so I wrote a separate post about C++11 support in ODB a couple of weeks ago. It describes in detail what is now possible when using ODB in the C++11 mode. Briefly, this release adds integration with the new C++11 standard library components, specifically smart pointers and containers. We can now use std::unique_ptr and std::shared_ptr as object pointers (their lazy versions are also provided). On the containers front, support was added for std::array, std::forward_list, and the unordered containers.

One C++11 language feature that really stands out when dealing with query results is the range-based for-loop. Compare the C++98 way:

 
typedef odb::query<employee> query;
typedef odb::result<employee> result;
 
result r (db.query<employee> (query::first == "John"));
 
for (result::iterator i (r.begin ()); i != r.end (); ++i)
  cout << i->first () << ' ' << i->last () << endl;
 

To the C++11 way:

 
typedef odb::query<employee> query;
 
auto r (db.query<employee> (query::first == "John"));
 
for (employee& e: r)
  cout << e.first () << ' ' << e.last () << endl;
 

If you are interested in more information on C++11 support, do read that post, it has much more detail and code samples.

Polymorphism support

Another big feature in this release is support for polymorphism. Now we can declare a persistent class hierarchy as polymorphic and then persist, load, update, erase, and query objects of derived classes using their base class interfaces. Consider this hierarchy as an example:

#pragma db object polymorphic pointer(std::shared_ptr)
class person
{
  ...
 
  virtual void print () = 0;
 
  std::string first_;
  std::string last_;
};
 
#pragma db object
class employee: public person
{
  ...
 
  virtual void print ()
  {
    cout << (temporary_ ? "temporary" : "permanent")
         << " employee " << first_ << ' ' << last_;
  }
 
  bool temporary_;
};
 
#pragma db object
class contractor: public person
{
  ...
 
  virtual void print ()
  {
    cout << "contractor " << first_ << ' ' << last_
         << ' ' << email_;
  }
 
  std::string email_;
};

Now we can work with the employee and contractor objects polymorphically using their person base class:

unsigned long id1, id2;
 
// Persist.
//
{
  shared_ptr<person> p1 (new employee ("John", "Doe", true));
  shared_ptr<person> p2 (new contractor ("Jane", "Doe", "j@d.eu"));
 
  transaction t (db.begin ());
  id1 = db.persist (p1); // Stores employee.
  id2 = db.persist (p2); // Stores contractor.
  t.commit ();
}
 
// Load.
//
{
  shared_ptr<person> p;
 
  transaction t (db.begin ());
  p = db.load<person> (id1); // Loads employee.
  p = db.load<person> (id2); // Loads contractor.
  t.commit ();
}
 
// Update.
//
{
  shared_ptr<person> p;
  shared_ptr<employee> e;
 
  transaction t (db.begin ());
 
  e = db.load<employee> (id1);
  e->temporary (false);
  p = e;
  db.update (p); // Updates employee.
 
  t.commit ();
}
 
// Erase.
//
{
  shared_ptr<person> p;
 
  transaction t (db.begin ());
  p = db.load<person> (id1); // Loads employee.
  db.erase (p);              // Erases employee.
  db.erase<person> (id2);    // Erases contractor.
  t.commit ();
}

Polymorphic behavior is also implemented in queries, for example:

 
typedef odb::query<person> query;
 
transaction t (db.begin ());
 
auto r (db.query<person> (query::last == "Doe"));
 
for (person& p: r) // Can be employee or contractor.
  p.print ();
 
t.commit ();
 

The above query will select person objects that have the Doe last name, that is, any employee or contractor with this name. While the result set is defined in terms of the person interface, the actual objects (i.e., their dynamic types) that it will contain are employee or contractor. Given the above persist() calls, here is what this code fragment will print:

permanent employee John Doe
contractor Jane Doe j@d.eu

There are several alternative ways to map a polymorphic hierarchy to a relational database model. ODB implements the so-called table-per-difference mapping where each derived class is mapped to a separate table that contains only columns corresponding to the data members added by this derived class. This approach is believed to strike the best balance between flexibility, performance, and space efficiency. In the future we will consider supporting other mappings (e.g, table-per-hierarchy), depending on user demand.

For more detailed information on polymorphism support, refer to Chapter 8, “Inheritance” in the ODB Manual. There is also the inheritance/polymorphism example in the odb-examples package.

Composite object ids

ODB now supports composite object ids (translated to composite primary keys in the relational database). For example:

#pragma db value
class name
{
  ...
 
  std::string first_;
  std::string last_;
};
 
#pragma db object
class person
{
  ...
 
  #pragma db id
  name name_;
};

For more information on this feature, refer to Section 7.2.1, “Composite Object Ids” in the ODB manual as well as the composite example in the odb-examples package.

Optional session support

The most important backwards-incompatible change in this release is making session support optional (the other has to do with the database operations callbacks; see the official announcement for details). As you may remember, session is a persistent object cache which is often useful to minimize the number of database operations and can be required in order to load some bidirectional object relationships.

With ODB we try to follow the “you don’t pay for things you don’t use” principle. So support for things that are not needed by all the applications (e.g., query) is not included into the generated code by default. This is particularly important for mobile/embedded applications that need to minimize code size as well as memory and CPU usage. Session support was an exception to this rule and we’ve decided to fix it in this release.

Now there are several ways to enable/disable session support for persistent classes. It can be done on the per object basis or at the namespace level using the new session pragma. It can also be enabled by default for all the objects using the --generate-session ODB compiler option. Thus to get the old behavior where all the objects were session-enabled, simply add --generate-session to your ODB compiler command line. For more information, refer to Chapter 10, “Session” in the ODB manual.

Free proprietary licence

To conclude, I would also like to mention a change to the ODB licensing terms. In addition to all the licensing options we currently have (open source and commercial proprietary licenses), we now offer a free proprietary license for small object models. This license allows you to use ODB in a proprietary (closed-source) application free of charge and without any of the GPL restrictions provided that the amount of the generated database support code does not exceed 10,000 lines. The ODB compiler now includes the --show-sloc command line option that can be used to show the amount of code being generated.

How much is 10,000 lines? While it depends on the optional features used (e.g., query support, views, containers, etc.), as a rough guide, 10,000 lines of code are sufficient to handle an object model with 10-20 persistent classes each with half a dozen data members.

For more information on the free proprietary license, including a Q&A section, refer to the ODB Licensing page.

Explicit SQL DELETE vs ON DELETE CASCADE

Thursday, April 12th, 2012

Let’s say we have several tables in an SQL database referencing each other with foreign key constraints. If we need to delete a row in one of these tables as well as all other rows that reference it, then we have two options. The first option is to execute an explicit DELETE statement for each table that contains referencing rows and then finish by deleting the referenced row (this order is important if we don’t want to violate any foreign key constraints). The other option is to declare our foreign keys as ON DELETE CASCADE. This clause tells the database to automatically delete the referencing row if the row it references is deleted.

One important limitation of the first approach is that you need to know the primary key (here I assume that foreign keys reference primary keys in the target tables) of the referenced row in order to be able to delete all the referencing rows. Oftentimes this limitation makes the second approach the only practical option. For example, if we want to delete several rows matching a certain, non-primary key-based criteria, then to use the first approach we would first have to execute a SELECT statement that returns all the primary keys matching this criteria and then iterate over these keys and execute a set of DELETE statements for each of them. The same can be achieved with ON DELETE CASCADE by executing just one DELETE statement.

The question that I got the other day was this: if both approaches can be used, which one is better? But before trying to define what is better and answering this question, let me first give you some background on what triggered this question. This came up during my work on ODB, an object-relational mapping (ORM) system for C++. ODB allows you to persist C++ objects to a number of relational databases without having to deal with tables, columns, or SQL, and manually writing any of the mapping code.

In ODB, as in other ORMs, certain OOP constructs are mapped to additional tables. The two most notable ones are containers inside objects as well as object inheritance hierarchies. These additional tables are “linked” to the primary table (i.e., the object table in case of a container and the root object table in case of an inheritance hierarchy) with foreign key constraints. If you request ODB to generate the database schema for you, then ODB will add the ON DELETE CASCADE clause to such foreign keys.

At the same time, ODB provides two ways to erase an object state from the database. If we know the object id (primary key), then we can use the erase() function. Alternatively, to erase an object or multiple objects that match a certain criteria, we can call erase_query(). As you may have guessed, erase_query() is exactly the case where relying on the ON DELETE CASCADE clause is the only practical option. For the erase() case, however, either approach can be used. And we are back to the original question: which one is better?

First, let’s define better. One fairly natural definition would be faster is better. That is, whichever approach deletes objects faster is better. However, after some consideration, we may realize that other criteria, such as server load, can be equally or even more important. That is, whichever approach results in less sever load is better. Server load itself can be defined in many ways, for example, as CPU, memory, or disk resources needed to complete the task. For our case, both speed and CPU resources used seemed like the most relevant metrics, so that’s what I measured.

I used ODB to quickly create a benchmark that I could run against various databases. The resulting database schema consists of three tables, with the second and third referencing the first. For each row in the first table, both the second and the third tables each contain five rows. The benchmark first populates the database with 20,000 rows in the first table (100,000 rows in each of the second and third tables). It then proceeds to delete all the rows in a randomized order using either the three explicit DELETE statements or a single statement that relies on the ON DELETE CASCADE mechanism.

Note also that ODB uses low-level C APIs to access each database, so there are no “wrapper overhead” involved. All statements are prepared once and then reused. All the databases except SQLite are accessed remotely and are running on the same server machine connected to the client machine via gigabit ethernet (see the ODB Benchmark Results post for more information on the setup).

For all the databases except SQLite, the ON DELETE CASCADE approach is faster than explicit DELETE. I cannot publish actual times (so that you could compare different databases) due to licensing restrictions. So, instead, the following table shows the speedup (or slowdown, in case of SQLite) and consumed CPU resources decrease (or increase, in case of SQLite) for each database:

Database Speedup CPU decrease
MySQL 45% 25%
Oracle 28% 11%
PostgreSQL 56% 55%
SQL Server 17% 22%
SQLite -11% -75%

My interpretation of these results is as follows: while the ON DELETE CASCADE approach is not necessarily faster intrinsically (it requires extra work by the database server), the overhead of executing a separate statement over a network connection dwarfs this extra work in comparison. One confirmation of this is the result for SQLite (its statement execution overhead is just a function call). The other is the results of this benchmark when both the database and the client are on the same machine (only tested some databases):

Database Speedup remote Speedup local
MySQL 45% 43%
Oracle 28% 3%
PostgreSQL 56% 40%

While the ON DELETE CASCADE approach is still faster, for Oracle, for example, there is almost no difference compared to explicit DELETE.