[odb-users] HDF5 support?

Mon Feb 18 10:17:04 EST 2013

On Feb 18, 2013, at 12:56 AM, Boris Kolpackov wrote:

Besides the work, is there any reason not to consider adding support
for HDF5?

I looked a bit into it, and it appears that HDF5 is more of a file
format rather than a (relational) database. While right now ODB is
exclusively about relational databases, the idea of also supporting
file formats (XML, JSON, etc) seems like a natural extension and
crossed our minds on several occasions. I think HDF5 will fit into
this model (i.e., file format vs database) quite well.

I'm not aware of anyone that uses HDF5 just as a file format anymore.
I know of some cases of HDF4 being written by multiple independent clients, but HDF5 is a more complex standard and so I think everyone uses their libraries.
It's more like documenting the format of a filesystem.  (And HDF5 is a lot like a userspace filesystem.)

While Postgres has array types, HDF5 is designed for parallel computing
environments where datasets of terabytes in size are read and written
in native format in parallel. (I don't see existing RDBMs competing in
this arena.)

I wonder how will this parallelism fit into ODB? I don't believe HDF5
has a notion of transactions.

The kind of parallelism I have in mind would be decomposition of very large arrays.
That is, suppose a X/Y/Z 3-d space is physically decomposed along the Z dimension on large RAID array and a query is made to collect a subspace of it.  Further suppose that several servers were associated with different parts of the Z dimension (each chunk of Z on a server and over several hard drives). There would be parallelism collecting the data for a partial Z chunk.   With HDF5, this would be coordinated over different ranks of a MPI process.

How does HDF5 ensure ACID of the data?

I don't think it has mechanisms for this.   There are some proposals..

http://www.hdfgroup.org/pubs/rfcs/Metadata_Journaling_RFC.pdf

One use case for HDF5 are large simulations, like ocean and climate models, that periodically dump their state to disk.
This is 1) to facilitate analysis, and 2) so that if there is a system crash, the time invested in the simulation is not lost.
The parallel access isn't like a database with lots of autonomous agents attached to it.   It's one large parallel coordinated agent.

The kind of data that's stored may be arrays of structures or structures of arrays, depending on the calculation.

Advantages of HDF5 has over using multiple binary files include that it enforces type discipline over the data, and provides a single file that is self describing.
Unfortunately it is fairly tedious to describe the types through its API (tedious in the same way as it is for derived types in MPI) and it would be neat if all that was necessary was to write down a C++ class.

Cheers,

Marcus