Multidimensional Variable length arrays (std::vector<std::vector<std::vector<any dtype>>>)


#1

Hi,

Is there a recommended way of saving a multidimensional variable length vector using HDF5 libraries?

I’ve been seeing a page here that explain how to manage Variable Length Array Datatypes

I would like to use the same strategy but I was wandering how to use it with multidimensional vectors such as a vector of vectors of vectors.
I also want to be able to manage such structure as chunked in order to save and load it by smaller chunks of data.

In my specific case I want to load and save a dataset composed of three dimensions [time steps, cortical columns, active cells]

The number of Active cells could vary in each cortical column and I want to load only some time steps per load operation since the dataset could be huge and impossible of loading in one shot.

In a future implementation the number of cortical columns could vary in each time step but it is not urgent right now.

Thanks in advance!


#2

IMHO: H5CPP next release will support this out of the box, i do have the working code already, needs cleanup, expect few weeks as I am held up with other related activities to coordinate a more general case: paving the road for scalable arbitrary complex standard layout C++ classes.
If this is an appealing option to you, please elaborate on the use case either on this forum, or in an online meetup to coordinate your request with more complex ideas currently being discussed with Fiber5 author Werner Benger.

As for the time-step: it is worth considering in a separate discussion, AFAIK time may be easily encoded in floats by recent development in C++ (Howard Hinnant’s datatime), allowing user to have homogenious dataype for both time and data. Which I find applicable in machine learning/optimisation. The non-homogeneous encoding is already supported by compiler assisted reflection – no coding is required on your behalf. The HDF5 Compound datatype descriptor is generated for you.

Partial IO and practical Petascale approach available as property of HDF5 datasystem: single FS, Parallel FS, KITA VOL plugin, … , see examples for H5CPP for modern C++ here.

best wishes:
steven


#3

ugh, a vector<vector> is just about the worst data structure ever. My recommendation is to just don’t do that. It requires multiple memory reallocations even on minor operations and it will perform badly. I even had problems with the presumable simpler situations of a std::vector< std::string > having “just” 80.000 entries there. It’s easy to make such data structures in C++, but that’s more a problem than a benefit since performance is hell.

Instead, I’d rather recommend placing all data in contiguous memory and to reference it. In C++20 there are std::span<>, which allows referencing data out of a bigger chunk, so no data copying or reallocation needed.

Structures such as vector<vector> could then be achieved via index lists into a contiguous memory chunk. Reallocation, and insertion/deletion of members of course is problematic, if that needs to be done frequently, then maybe something like a list<list<list<>> is the better data structure anyway. Still, “flattening” that data structure for I/O would be computationally effortsome.

Nevertheless, HDF5 sure can handle variable length data types, and there would be ways to directly have a vector<vector<vector<>>> . Just, it would be performance hell and I’d rather avoid variable length by nearly all costs, if possible, and that starts on the C++ level, actually on C’s malloc() level, its not even HDF5’s fault.

Maybe Stevens H5CPP library provides some magic though, that’s something to investigate as it goes beyond some limitations of the C API.


#4

Thank you Steven, I will return to this task in some months. I hope the support for Ragged Arrays is ready then.

Best!

Dario


#5

Thanks Werner, my problem is not how I manage Ragged Arrays at run time but instead how I save them in the best way.

Best!

Dario