C++ equivalence of Python code for HDF5. Is possible to write 2d std::vector to HDF5 file similarly to 2d std::array? or more simplified in general

This post is about code conversion from Python to C++ about HDF5, trying to keep the C++ part simplified.

General process in the program about HDF5:

  • Generate a .h5 file with one or more datasets.
  • Writing/reading if the .h5 file exists, and if not, then creating the .h5 file and writing/reading.
  • Each dataset in the file will contain data that comes from a 2d-collection with variable size.
  • The 2d-collection to insert in the dataset is returned by a function and we don’t know its exact size in advance (at compilation time). In some parts of the program we could know only the number of rows in advance and in other parts we could know only the number of columns in advance. This is the reason because the C++ code below shows tests with different “2-dimensional” data structures, because in C++ you have more options for different cases compared to Python.

Main question for this post:

What is the C++ equivalence for the following Python section (full code below):

f[dset_name].resize(f[dset_name].shape[0] + a_2d.shape[0], axis=0)
f[dset_name][-a_2d.shape[0]:] = a_2d

Python code to convert to C++:

import numpy as np
import h5py


file_name = "f_1"
dset_name = "dset_1"

# list-of-lists representing a function call that returns a list-of-lists with variable size.
l_2d = [[1, 1, 1], [2, 2, 2], [3, 3, 3], [4, 4, 4]]

f = h5py.File(f"{file_name}.h5", "a")
# f.flush()

# Create dataset
if dset_name not in f.keys():
  f.create_dataset(dset_name, (0, 3), maxshape=(None, 3), dtype="i")
  # f.flush()

# Write data
a_2d = np.array(l_2d)

f[dset_name].resize(f[dset_name].shape[0] + a_2d.shape[0], axis=0)
f[dset_name][-a_2d.shape[0]:] = a_2d

f.flush()
f.close()

C++ code, progress for now:

// Here are comments for context with some of the tests made while working
// to achieve the Python equivalence.

#include <array>
#include <iostream>
#include <string>
#include <vector>
#include "H5Cpp.h"

const H5std_string FILE_NAME("f.h5");
const H5std_string DATASET_NAME("dset_1");

int main()
{
  // int                               a_2d[4][3] = {{1, 1, 1}, {2, 2, 2}, {3, 3, 3}, {4, 4, 4}};
  // std::array<std::array<int, 3>, 4> a_2d{{ {1, 1, 1}, {2, 2, 2}, {3, 3, 3}, {4, 4, 4} }};
  std::vector<std::array<int, 3>>      a_2d{{1, 1, 1}, {2, 2, 2}, {3, 3, 3}, {4, 4, 4}};
  // std::array<std::vector<int>, 4>   a_2d{{ {1, 1, 1}, {2, 2, 2}, {3, 3, 3}, {4, 4, 4} }};
  // std::vector<std::vector<int>>     a_2d{{1, 1, 1}, {2, 2, 2}, {3, 3, 3}, {4, 4, 4}};
  
  try
  {
    Exception::dontPrint();
    
    H5::H5File f(FILE_NAME, H5F_ACC_TRUNC);  // AFAIK `H5F_ACC_TRUNC` wouldn't be the equivalence for the `"a"` mode in `h5py.File(f"{file_name}", "a")`
    
    // ... Here, extra process for vector-of-vectors and array-of-vectors. More details below ...
    
    hsize_t dims[2];
    dims[0] = 4;  // a_2d.size();
    dims[1] = 3;  // a_2d[0].size();
    H5::DataSpace dspace(2, dims);
    
    H5::DataSet dset = f.createDataSet(DATASET_NAME, H5::PredType::NATIVE_INT32, dspace);  // `H5::PredType::STD_I32BE`
    
    // dset.write(a_2d, H5::PredType::NATIVE_INT32);     // for C-style arrays
    dset.write(a_2d.data(), H5::PredType::NATIVE_INT32); // for non C-style arrays
    
    dspace.close();  // After checking some HDF5 C++ API examples in github,
    dset.close();    // it is not clear exactly what is needed about the
                     // `close` and `flush`
    
    // f.flush();  // Not tested because it asks for `H5F_scope_t scope`
    f.close();
    
  }

  catch (FileIException error)
  {
      error.printErrorStack();
      return -1;
  }

  catch (DataSetIException error)
  {
      error.printErrorStack();
      return -1;
  }

  catch (DataSpaceIException error)
  {
      error.printErrorStack();
      return -1;
  }

  return 0; // successfully terminated
}

About the 2d-collection

For now I prefer to work with C++ built-in data structures and the way to make the C++ code work in a more direct way has been with the next “2d” data structures:

  • C-style array-of-arrays.
  • array-of-arrays (std::array of std::array).
  • vector-of-arrays (std::vector of std::array). After some tests with small-size data like in the example here, this collection seems to work with the same HDF5 code for a std::array of std::array, without needing extra transformation or explicit copy. I say “seems” because when opening the generated .h5 file with HDFView, it shows the content as when working with std::array of std::array, but I’m not sure if this data structure is “adecuate” in comparison.

But, when trying with the following other 2d data structures, the way to make it work has been by copying the original collection values into a new collection like array-of-arrays or hvl_t:

  • array-of-vectors (std::array of std::vector)
  • vector-of-vectors (std::vector of std::vector)

Sample of “extra process for vector-of-vectors and array-of-vectors”:

    hsize_t dim(a_2d.size());  // tests about the error in `hvl_t vl[dim]`: uint64_t hsize_t dim(a_2d.size());
    H5::DataSpace dspace(1, &dim);  // H5::DataSpace dspace(2, &dim);
    H5::VarLenType dtype(H5::PredType::NATIVE_INT32);
    H5::DataSet dset(f.createDataSet(DATASET_NAME, dtype, dspace));
    hvl_t vl[4];  // hvl_t vl[dim];  // error about the type.
    for (hsize_t i=0; i<dim; i++)
    {
      vl[i].len = a_2d[i].size();
      vl[i].p   = &a_2d[i][0];
    }
    dset.write(vl, dtype);

About the “extra process for vector-of-vectors and array-of-vectors”, I can see it has the disadvantage of being based on making an explicit copy of the original collection. This section is just a test, the real program will work placing each element of the sub-collections into a separate table cell.


Notes:

I’m still looking for information (documentation, online references) for the other missing parts in the C++ code and for now we could focus on this main question.



About the HDF5 C++ API

When searching for online information about the HDF5 C++ API documentation (this and other parts in https://support.hdfgroup.org), code examples from the documentation, and in general when searching for online information about HDF5 (another example here), I can notice the content and information about the C++ API is not always as complete as the content for the HDF5 C API. Just mentioning a mere illustration of this, when you compare the same code example from the documentation: in C and in Python, with the equivalent example in C++, I notice the C++ code example doesn’t content the parts about to close a dataspace and to close a file, and I mention this mainly for the purpose of highlighting that it would be great if the examples from the documentation had a 1:1 equivalence among the different languages, showing basic aspects that are present in this post, like the equivalence in C and C++ of the Python “a” mode (“Read/write if exists, create otherwise”), or when to use the flush functionality, etc.


Thank you for the help!

I would recommend creating your own “2D” array class that holds the array in a std::vector. Provide accessors like at(index), operator[](index) and also value_type get(row, col) but then you also provide a ErrorType write(hid_t location, const std::string& name) and also ErrorType read(hid_t location, const std::string& name)

Eigen has some API that mimics the 2D row/col access but not the HDF5 read/write.

Here is some older code that we use to read/write std::vector to HDF5 files.

Depending on your application and data access patterns the “array of arrays” may or may not work well. You will definitely find it more tedious to write those “array of arrays” into a single contiguous HDF5 data set.

1 Like

In general in C++ you do know the size of the dataset, even if it is already running into the stack, though debugger and good understanding of memory layout helps. Putting funny business aside, use containers. There are a few linear algebra libraries out there and some C++ solutions dmsc h5cpp, highfive, etc… which support matrices.

#include <Eigen/Core>
#include <h5cpp/all>
#include <filesystem>
#include <iostream>

int main() {
    using matrix_t = Eigen::Matrix<int, Eigen::Dynamic, Eigen::Dynamic, Eigen::RowMajor>;
    matrix_t M(5, 2);
    M << 1, 1,
         2, 2,
         3, 3,
         4, 4,
         5, 5;

    h5::fd_t fd = std::filesystem::exists("my-file.h5")
        ? h5::open("my-file.h5", H5F_ACC_RDWR)  : h5::create("my-file.h5", H5F_ACC_TRUNC);
    h5::write(fd, "dataset",M);
}

Then you would get something like this, which is boring and probably not what you want…

HDF5 "my-file.h5" {
GROUP "/" {
   DATASET "dataset" {
      DATATYPE  H5T_STD_I32LE
      DATASPACE  SIMPLE { ( 5, 2 ) / ( 5, 2 ) }
      DATA {
      (0,0): 1, 1,
      (1,0): 2, 2,
      (2,0): 3, 3,
      (3,0): 4, 4,
      (4,0): 5, 5
      }
   }
}
}

Now this ragged array is different, and less mundane so to speak… in this model you do not quite know the shape of the object up front, and possibly not even the rank.

HDF5 "my-file.h5" {
GROUP "/" {
   DATASET "ragged" {
      DATATYPE  H5T_VLEN { H5T_STD_I32LE}
      DATASPACE  SIMPLE { ( 3 ) / ( 3 ) }
      DATA {
      (0): (1, 2, 3, 4, 5), (7, 8), (9)
      }
   }
}
}

Here is how you do it…

int main() {
    h5::fd_t fd = std::filesystem::exists("my-file.h5")
        ? h5::open("my-file.h5", H5F_ACC_RDWR)
        : h5::create("my-file.h5", H5F_ACC_TRUNC);
   
    std::vector<std::vector<int>> ragged = {
        {1,2,3,4,5},   {7,8},  {9}};
    h5::write(fd, "ragged",ragged);
}

And the library which can do it for you GitHub - vargalabs/h5cpp: C++17 templates between [stl::vector | armadillo | eigen3 | ublas | blitz++] and HDF5 datasets · GitHub

Sequence containers

C++ type storage_representation_t HDF5 on-disk shape
std::vector<T> linear_value_dataset rank-1 of T’s native HDF5 type
std::deque<T> linear_value_dataset rank-1 of T’s native HDF5 type (staged through vector)
std::list<T> linear_value_dataset rank-1 of T’s native HDF5 type (staged through vector)
std::forward_list<T> linear_value_dataset rank-1 of T’s native HDF5 type (staged through vector)
std::valarray<T> linear_value_dataset rank-1 of T’s native HDF5 type
std::array<T, N> (non-char T) array_element scalar dataspace with H5T_ARRAY[N] element type
T[N] (C array, non-char T) array_element scalar dataspace with H5T_ARRAY[N] element type

Associative containers

C++ type storage_representation_t HDF5 on-disk shape
std::set<T> / std::multiset<T> linear_value_dataset rank-1 of T (iter-staging through vector buffer)
std::unordered_set<T> / multiset linear_value_dataset rank-1 of T
std::map<K, V> / multimap key_value_dataset rank-1 of H5T_COMPOUND { K key; V value; }
std::unordered_map<K, V> / multimap key_value_dataset rank-1 of H5T_COMPOUND { K key; V value; }

Tuples and pairs

C++ type storage_representation_t HDF5 on-disk shape
std::pair<K, V> (top-level) scalar scalar dataspace, H5T_COMPOUND { K first; V second; }
std::tuple<Ts...> (top-level) scalar scalar dataspace, H5T_COMPOUND { Ts...; }
std::vector<std::pair<K, V>> key_value_dataset rank-1 of H5T_COMPOUND { K first; V second; }
std::vector<std::tuple<Ts...>> linear_value_dataset (composite element) rank-1 of H5T_COMPOUND { Ts...; }

Strings

C++ type storage_representation_t HDF5 on-disk shape
std::string / std::string_view vlen_text_dataset (scalar form) scalar dataspace with H5T_C_S1, H5T_VARIABLE
char* / const char* (top-level) vlen_text_dataset scalar H5T_C_S1, H5T_VARIABLE
char[N] / std::array<char, N> fixed_length_string scalar H5T_C_S1 with H5Tset_size(N)
std::vector<std::string> vlen_text_dataset rank-1 of H5T_C_S1, H5T_VARIABLE (VLEN string per element)
std::vector<std::array<char, N>> fls_dataset rank-1 of H5T_C_S1 + H5Tset_size(N) (no VLEN — flat bytes)

Ragged / vlen

C++ type storage_representation_t HDF5 on-disk shape
std::vector<std::vector<T>> ragged_vlen_dataset rank-1 of H5T_VLEN { T } (hvl_t relay)

Containers of arrays

C++ type storage_representation_t HDF5 on-disk shape
std::vector<std::array<T, N>> (non-char) array_dataset rank-1 of H5T_ARRAY[N] of T
std::vector<T[N]> — equivalent array_dataset rank-1 of H5T_ARRAY[N] of T

Numeric

C++ type storage_representation_t HDF5 on-disk shape
std::complex<T> (top-level) scalar scalar H5T_COMPOUND { T real; T imag; }
std::vector<std::complex<T>> linear_value_dataset rank-1 of H5T_COMPOUND { T real; T imag; }

Smart pointers (memory-region overloads)

C++ type storage_representation_t HDF5 on-disk shape
std::unique_ptr<T[]> (forwarded to raw pointer) rank-N (caller-supplied h5::count); element type = T’s native
std::shared_ptr<T[]> (forwarded to raw pointer) same — h5cpp/H5Mmemory_io.hpp mapper

Views (C++23)

C++ type storage_representation_t HDF5 on-disk shape
std::mdspan<T, Extents, …> (forwarded via view) rank from Extents; element type = T’s native

Gated on __cpp_lib_mdspan >= 202207L (libstdc++ ≥ 15 / libc++ ≥ 19).

Brace-list convenience

C++ type storage_representation_t HDF5 on-disk shape
std::initializer_list<T> (via h5::awrite(parent, name, {a,b,c})) linear_value_dataset rank-1 of T (materialised internally)
2 Likes

@steven.varga Thank you for your detailed response.

Currently I’m studying the information you posted, it contains very interesting information.

There are some aspects I would like to comment about, but for now let please me clarify that for the moment I trying to work with built-in C++ data structures and with the HDF5 “official” library and APIs like H5Cpp.h.


I’m totally new in HDF5 and I have some questions about your response, mainly in the context about containers-of-other-containers for numeric values (int, float, etc.):

  1. In the table you posted I see it contents std::vector<std::array<T, N>> but couldn’t see std::array<std::array<T, N>, M> and I don’t know if from the HDF5 point of view, it could be because maybe std::array<std::array<T, N>, M> would be “equivalent” to std::array<T, N>, maybe because an array-of-arrays gets flatten or something like that.

  1. I’m trying to find information about the meaning of the “HDF5 on-disk shape” in the table in your answer, for example to better understand a comparison like std::array<std::array<T, N>, M> vs std::vector<std::array<T, N>>, or trying to know if “rank-1” means less efficient at the moment to handle/write/read, or what does “rank-1” exactly means?

  1. What about how work with the std::array<std::vector<T>, M> data structure?

I also updated the main post with more details and more context.

Thank you for your time!

For the official C++ library, please contact The HDF Group directly. I can only speak for my own contribution.

Your questions about container persistence are right on the money. There is more going on behind the scenes, mostly related to how C++ organizes data in memory.

To set the context: an elementary datatype is a fixed-length sequence of bits, interpreted according to the target architecture. For example, std::uint64_t is 64 bits, or 8 bytes. Many such 8-byte values may be stored contiguously, or they may be scattered across memory. When the data is contiguous, no scatter/gather operation is needed. When it is not, things become suboptimal.

This is why the H5CPP type system tries to block constructs that look harmless but are actually foot-guns. For example, std::array<std::vector<T>, K> may look at first glance like a performant array of Ts, but in reality it is an array of vector objects sitting side by side. Each vector has its own data() pointer, and each points to a separate memory region. In other words, the type hides an extra level of indirection.

Funny enough, the underlying HDF5 machinery would work, but as a field expert I say: stop the madness. Use the right container and make the intended layout explicit. For an actual fixed-size array-of-arrays, use something like:

std::array<std::array<T, N>, K>

That maps naturally to the expected HDF5 H5T_ARRAY type. So, to the business: array-of-array works fine. Blame me for missing it in the documentation. You can use it safely for more than rank 2.

#include <filesystem>
#include <array>
#include <cstdint>
#include <h5cpp/all>

int main() {
    h5::fd_t fd = std::filesystem::exists("my-file.h5")
        ? h5::open("my-file.h5", H5F_ACC_RDWR)
        : h5::create("my-file.h5", H5F_ACC_TRUNC);

    std::array<std::array<std::uint64_t, 6>, 2> array_of_array = {{
        {{11, 12, 13, 14, 15, 16}},
        {{21, 22, 23, 24, 25, 26}}
    }};

    h5::write(fd, "array of array", array_of_array);
}

HDF5 "my-file.h5" {
  GROUP "/" {
     DATASET "array of array" {
        DATATYPE  H5T_ARRAY { [2] H5T_ARRAY { [6] H5T_STD_U64LE } }
        DATASPACE  SCALAR
        DATA {
        (0): [ [ 11, 12, 13, 14, 15, 16 ],
               [ 21, 22, 23, 24, 25, 26 ] ]
        }
     }
  }
}

C arrays work as well, up to rank 7, but you either need the h5cpp-compiler, which is supported on Windows, macOS, and Linux, or you have to hand-roll the type descriptor.

So rank is a mathematical concept, while how data is represented and stored is language- and platform-dependent. How this maps to physical memory, as opposed to the virtual memory seen by an application, is the responsibility of the operating system and memory manager. In HDF5 there are also two related ideas that are easy to mix up:

  • the dataspace rank, which describes the dimensionality of the dataset;
  • the array datatype rank, which describes the dimensionality of an H5T_ARRAY element.

In the example above, the dataset has a scalar dataspace, but the scalar element itself is a two-dimensional array datatype.

Informally:

rank 0: single element, of any type
rank 1: sequence / vector of elements
rank 2: matrix
rank 3: cube
rank 4+: you get the idea

It would be impolite to ask the rank of std::vector<std::vector<T>>.

1 Like