Collective VDS creation

fredpz · August 11, 2022, 3:52pm

Hi

I am trying to use VDS in my application.

We have data scattered across thousands of processes. But each process does not own an hyperslab of the data. It actually owns a somewhat random collection of pieces of data, which are all hyperslabs. To write this data to a file collectively, I first tried to make each process select a combination of hyperslabs using H5Scombine_hyperslab. That would be the perfect tool for us, but it does not work (it says “feature not supported” with H5S_SELECT_APPEND).

Now I am trying to reproduce this behaviour using VDS: all pieces are written in a dataset in disorder. Another dataset (virtual this time) maps all pieces in the right order. It works when using 1 process!

The problem: I am unable to create this VDS in a collective way.

More precisely, I would like to prepare the mapping individually (each process owns a dataset creation property list with virtual layout for mapping its own pieces), then calls H5Dcreate2 collectively. All the creation property lists from all processes should be merged in the VDS. Unfortunately, I could not find any way to tell H5Dcreate2 to work collectively: it ends up crashing at flush time.

I know it would be possible to make 1 process create this VDS independently from the other processes. But the number of pieces is so large that it would take too much time and memory.

-> is it possible to create a VDS collectively ?

gheber · August 11, 2022, 5:30pm

H5S_SELECT_APPEND is for point selections. What combination do you want? G.

fredpz · August 11, 2022, 8:39pm

Ok I didn’t realize append was for point selection. What I would like is a combination of hyperslabs (in my case they are 2d or 3d squares) that are located a bit everywhere in the full array.

I tried H5S_SELECT_OR and it does not throw any error. However the pieces of data seem to be strangely arranged very disorderly. It’s as if the order in which hyperslabs are combined (withH5Scombine_hyperslab) does not match the order in which the data buffer is written. Is there any documentation explaining how the data should be ordered when combining selections?

gheber · August 12, 2022, 3:49pm

For hyperslab selections, everything is in, what I would call, dataspace order. To be specific, let’s say we have a dataspace of rank 4 and extent (10, 20, 30, 40). Think of the dataspace order as an order on the grid points (i,j,k,l), where 0 <= i < 10, etc. In other words, for any two grid points (i,j,k,l) and (u,v,w,z), we can tell which of the two precedes the other in dataspace order. This is the order in which you would iterate over the grid points in a nested (depth 4) loop with the last dimension’s loop index changing the fastest, and so on.

When combining hyperslab selections with the usual set operations (union, intersection, etc.), the resulting set of grid points is always processed in dataspace order. There are many reasons for doing that, not least efficiency. If you want a prescribed (non-dataspace) order, you will have to resort to point selections, which are much less efficient.

If you can arrange your hyperslab selections and buffer in dataspace order, then the union should work fine.

G.

fredpz · August 12, 2022, 8:56pm

Ok I think I understand. But reordering little bits of cubes in this manner is quite a daunting task. Especially when these cubes are scattered somewhat randomly.

Alternately, what about collective VDS creation, as in the original question? Has this been considered? Would it make sense in my situation?

gheber · August 15, 2022, 11:29am

Unless they’re overlapping, it’s not that difficult. Just look at the smallest and largest grid points (corners).

I will let a more competent colleague answer that question. I’m a little pessimistic, though. The problem is the logic. Having multiple MPI ranks make rank-dependent selections against an existing dataset is well-defined. Merging rank-dependent VDS mappings is ambiguous even if we fell back to dataspace order because the dataset definition must not depend on an MPI communicator. In this case, we are, in a way, talking about a dataset that doesn’t exist, but we implicitly pretend is well-defined. But maybe I’m completely off base…

G.

nfortne2 · August 15, 2022, 2:01pm

For the VDS question: to create a VDS collectively every process needs to add every mapping. This conforms with the way collective metadata writes in HDF5 generally work - every process is assumed to make exactly the same (metadata write) calls with exactly the same parameters. It would be possible to implement what you are describing as a high level routine which would call H5Pget_virtual then do an allgather on the VDS mappings, but for now it’s easiest to handle it in the user application.

It’s also worth noting that there are some other limitations to parallel VDS I/O:

“printf” style mappings are not supported
The VDS must be opened collectively
When using separate source file(s), the source file(s) cannot be opened by the library while the VDS is open.
When using separate source file(s), data cannot be written through the VDS unless the mapping is equivalent to 1 process per source file
All I/O is independent internally (possible performance penalty)
Each rank does an independent open of each source file it accesses (possible performance penalty)

I should also note that VDS is not currently tested in the parallel regression test suite so there may be other issues.

fredpz · August 16, 2022, 1:29pm

Thank you for the answers. It looks like collective VDS creation is not going to be possible. When I have 5000 ranks opening a VDS, I fear the performance may be insufficient. I will fall back to hyperslab combinations.

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Collective VDS creation