This post is about code conversion from Python to C++ about HDF5, trying to keep the C++ part simplified.
General process in the program about HDF5:
- Generate a .h5 file with one or more datasets.
- Writing/reading if the .h5 file exists, and if not, then creating the .h5 file and writing/reading.
- Each dataset in the file will contain data that comes from a 2d-collection with variable size.
- The 2d-collection to insert in the dataset is returned by a function and we don’t know its exact size in advance (at compilation time). In some parts of the program we could know only the number of rows in advance and in other parts we could know only the number of columns in advance. This is the reason because the C++ code below shows tests with different “2-dimensional” data structures, because in C++ you have more options for different cases compared to Python.
Main question for this post:
What is the C++ equivalence for the following Python section (full code below):
f[dset_name].resize(f[dset_name].shape[0] + a_2d.shape[0], axis=0)
f[dset_name][-a_2d.shape[0]:] = a_2d
Python code to convert to C++:
import numpy as np
import h5py
file_name = "f_1"
dset_name = "dset_1"
# list-of-lists representing a function call that returns a list-of-lists with variable size.
l_2d = [[1, 1, 1], [2, 2, 2], [3, 3, 3], [4, 4, 4]]
f = h5py.File(f"{file_name}.h5", "a")
# f.flush()
# Create dataset
if dset_name not in f.keys():
f.create_dataset(dset_name, (0, 3), maxshape=(None, 3), dtype="i")
# f.flush()
# Write data
a_2d = np.array(l_2d)
f[dset_name].resize(f[dset_name].shape[0] + a_2d.shape[0], axis=0)
f[dset_name][-a_2d.shape[0]:] = a_2d
f.flush()
f.close()
C++ code, progress for now:
// Here are comments for context with some of the tests made while working
// to achieve the Python equivalence.
#include <array>
#include <iostream>
#include <string>
#include <vector>
#include "H5Cpp.h"
const H5std_string FILE_NAME("f.h5");
const H5std_string DATASET_NAME("dset_1");
int main()
{
// int a_2d[4][3] = {{1, 1, 1}, {2, 2, 2}, {3, 3, 3}, {4, 4, 4}};
// std::array<std::array<int, 3>, 4> a_2d{{ {1, 1, 1}, {2, 2, 2}, {3, 3, 3}, {4, 4, 4} }};
std::vector<std::array<int, 3>> a_2d{{1, 1, 1}, {2, 2, 2}, {3, 3, 3}, {4, 4, 4}};
// std::array<std::vector<int>, 4> a_2d{{ {1, 1, 1}, {2, 2, 2}, {3, 3, 3}, {4, 4, 4} }};
// std::vector<std::vector<int>> a_2d{{1, 1, 1}, {2, 2, 2}, {3, 3, 3}, {4, 4, 4}};
try
{
Exception::dontPrint();
H5::H5File f(FILE_NAME, H5F_ACC_TRUNC); // AFAIK `H5F_ACC_TRUNC` wouldn't be the equivalence for the `"a"` mode in `h5py.File(f"{file_name}", "a")`
// ... Here, extra process for vector-of-vectors and array-of-vectors. More details below ...
hsize_t dims[2];
dims[0] = 4; // a_2d.size();
dims[1] = 3; // a_2d[0].size();
H5::DataSpace dspace(2, dims);
H5::DataSet dset = f.createDataSet(DATASET_NAME, H5::PredType::NATIVE_INT32, dspace); // `H5::PredType::STD_I32BE`
// dset.write(a_2d, H5::PredType::NATIVE_INT32); // for C-style arrays
dset.write(a_2d.data(), H5::PredType::NATIVE_INT32); // for non C-style arrays
dspace.close(); // After checking some HDF5 C++ API examples in github,
dset.close(); // it is not clear exactly what is needed about the
// `close` and `flush`
// f.flush(); // Not tested because it asks for `H5F_scope_t scope`
f.close();
}
catch (FileIException error)
{
error.printErrorStack();
return -1;
}
catch (DataSetIException error)
{
error.printErrorStack();
return -1;
}
catch (DataSpaceIException error)
{
error.printErrorStack();
return -1;
}
return 0; // successfully terminated
}
About the 2d-collection
For now I prefer to work with C++ built-in data structures and the way to make the C++ code work in a more direct way has been with the next “2d” data structures:
- C-style array-of-arrays.
- array-of-arrays (
std::arrayofstd::array). - vector-of-arrays (
std::vectorofstd::array). After some tests with small-size data like in the example here, this collection seems to work with the same HDF5 code for astd::arrayofstd::array, without needing extra transformation or explicit copy. I say “seems” because when opening the generated .h5 file with HDFView, it shows the content as when working withstd::arrayofstd::array, but I’m not sure if this data structure is “adecuate” in comparison.
But, when trying with the following other 2d data structures, the way to make it work has been by copying the original collection values into a new collection like array-of-arrays or hvl_t:
- array-of-vectors (
std::arrayofstd::vector) - vector-of-vectors (
std::vectorofstd::vector)
Sample of “extra process for vector-of-vectors and array-of-vectors”:
hsize_t dim(a_2d.size()); // tests about the error in `hvl_t vl[dim]`: uint64_t hsize_t dim(a_2d.size());
H5::DataSpace dspace(1, &dim); // H5::DataSpace dspace(2, &dim);
H5::VarLenType dtype(H5::PredType::NATIVE_INT32);
H5::DataSet dset(f.createDataSet(DATASET_NAME, dtype, dspace));
hvl_t vl[4]; // hvl_t vl[dim]; // error about the type.
for (hsize_t i=0; i<dim; i++)
{
vl[i].len = a_2d[i].size();
vl[i].p = &a_2d[i][0];
}
dset.write(vl, dtype);
About the “extra process for vector-of-vectors and array-of-vectors”, I can see it has the disadvantage of being based on making an explicit copy of the original collection. This section is just a test, the real program will work placing each element of the sub-collections into a separate table cell.
Notes:
I’m still looking for information (documentation, online references) for the other missing parts in the C++ code and for now we could focus on this main question.
About the HDF5 C++ API
When searching for online information about the HDF5 C++ API documentation (this and other parts in https://support.hdfgroup.org), code examples from the documentation, and in general when searching for online information about HDF5 (another example here), I can notice the content and information about the C++ API is not always as complete as the content for the HDF5 C API. Just mentioning a mere illustration of this, when you compare the same code example from the documentation: in C and in Python, with the equivalent example in C++, I notice the C++ code example doesn’t content the parts about to close a dataspace and to close a file, and I mention this mainly for the purpose of highlighting that it would be great if the examples from the documentation had a 1:1 equivalence among the different languages, showing basic aspects that are present in this post, like the equivalence in C and C++ of the Python “a” mode (“Read/write if exists, create otherwise”), or when to use the flush functionality, etc.
Thank you for the help!
