I’m prototyping some code to read and write our particle simulation input and output to an HDF5 file in a specific structure (h5part) that I could open in a visualisation software (paraview).
The format contains some datasets of 1D arrays of primitive types for each value (x, y, z, vx, …), like shown in the h5dump structure:
the simulation program stores the data in an array of struct:
struct {
double x;
double y;
double z;
int id;
double vx;
double vy;
double vz;
[ ... other properties, some internal, shouldn't go to h5 file ... ]
};
What’s the canonical way to read/write from/to the structure in memory to the hdf5 file?
I did some searching and found some ideas to start with, but some pointers to push me in the right direction would be welcome.
Here is what I thought of:
brute force approach: loop on my array of particles and read/write individual elements from each dataset. I think I can get this to work, but looks mighty inefficient.
use dataspaces to apply a stride and offset and access the correct place in the array of struct for each element of a dataset. If I understand correctly, this is designed to access arrays of the same type and skip some elements, so I’m not sure I can access all offsets in my structure with heterogenous datatypes.
use compound data types: it looks pretty elegant to define my memory mapping to point to the correct element, but from my understanding, I can only use this with compound datatypes on both sides of the read write operation, and I think my data format needs arrays of primitive types. Is it possible to map a compound format with one double member to for instance a F64 ?
write the data out as an array of compound objects and write a paraview plugin (also a big task)
I might be missing something obvious, as I’m exploring the documentation and prototyping, so any suggestion is welcome.
This might be a case for a user-defined datatype conversion. You can register a conversion function compound <-> double of type H5T_conv_t via H5Tregister. In your case, this function would extract (on read) or update (on write) the desired struct field. The only ugliness left would be that you’d have to call it six times (once for each component x, y, z, vx, vy, vz). Since the core of the datatype “conversion” is an assignment, there might be a way to wing it with H5Dgather or H5Dscatter; however, that would get you in trouble if the field types were different (endianness or size).
I agree that registering a custom conversion function is the best solution here. It should be possible to only register one conversion function (or two if you want to be able to read back to the compound). You can register a soft conversion between a compound data type and an atomic data type (such as double). The conversion function will then, in H5T_CONV_INIT, check if there is exactly one member in the compound and if the member’s type is exactly the same as the destination atomic type, and note the offset in the compound. In H5T_CONV_CONV it will simply copy the data as appropriate. Optionally you could extend it to work with atomic data types that aren’t the same but are compatible, and call H5Tconvert on the compound member during H5T_CONV_CONV. This could be a good idea for something to build in to HDF5 in a future release.
Solution 2 could work, but it’s inelegant and would only work if the struct size is a multiple of the element size and the element is aligned to a multiple of the element size within the struct.
Another option would be to define a custom atomic datatype that’s identical to native double but with padding added to the beginning and end to make it the same size as the struct and with the numerical data in the same location as the element you want to select. This might go through slow bitwise conversion routines though, and risks triggering the newly added file integrity checks, though as long as the datatype description isn’t written to the file that should be fine.
I forgot to add, perhaps the simplest option would be to copy the selected members from the struct to an intermediate buffer that is simply an array of doubles, and use that for HDF5 I/O, though this of course uses some memory.
thanks a lot for the answers. I did some reading and prototyping of the conversion method. I tried to implement the file reading (from H5 file with multiple linear arrays to one array of struct in memory).
If my understanding is correct, the data gets read into a temporary buffer. In my H5T_conv_t function, I then need to read this array as their own primitive type to a temporary buffer, so that I can then write it back to the same buffer but with the “array of structs” layout, setting the member I just read. This gets then copied to the output memory array that was passed to the read function.
I can then go on with the next input array. But it looks like the writing to the output memory array overwrites the other members of the struct. From what I understand from the docs, that’s what the background array is there for, which I should use to store the previous data of the struct. This means one more copy of the data. And as the buffers are by default of a size of 1M, if I have more elements, the background data buffering would not work. This means I should allocate a background array of the size of my dataset to copy the data to, additionally to the output array.
From the docs, there is a hint that I might be able to subvert the background array and simply use this as output storage (with H5T_BKG_YES), but the documentation is a bit sparse around those subjects. And this would then make me use one more memory buffer of the same size as the output buffer.
Is my understanding correct? I was hoping the conversion function would permit to read data from an input buffer and write it to some places in the output buffer in another layout without all those intermediate copies and allocations. But it looks to me that this is not possible.
As I have more elements in my struct and my dataset might need to scale to bigger sizes, it looks like I’m way better off reading all the datasets in smaller chunks to temporary arrays and populate my array of structs from there, like nfortne2 suggests.
Type conversion is always performed in place, the background buffer is never used as the output buffer. It should be possible to forgo the background buffer entirely when converting from a compound to an atomic type - simply memmove (or memcpy if you can verify it doesn’t overlap) each element from the correct location within the tconv buf as an array of compounds to the correct location within the tconv buf as an array of atomic types, in increasing offset. For converting from atomic to compound you’ll need to use the background buffer (H5T_BKG_YES) because otherwise you’d overwrite the other compound members. In this case the background buffer will be initialized with the existing (compound) destination data, so you’ll want to memcpy each atomic element from the tconv buffer to the correct place in the background buffer (as an array of compounds), then memcpy the background buffer back to the tconv buffer afterwards.
If you want to support differing atomic datatypes this of course gets a bit more complicated.
Type conversion in HDF5 always imposes an overhead, so the fastest way to transfer the data in this case is likely an intermediate buffer of doubles maintained by your app (as I suggested in my second message), provided you have the memory to do so.
Thanks for the explanation about the background buffer, I didn’t understand that it actually was initialised with the existing data! That makes the model clearer.
But as you say, it still has some overhead. I’ll experiment with the other approach of going through a buffer in my app.