H5Dread_multi for distinct datasets

Good afternoon,

We’re looking at H5Dread_multi to read from multiple (1000s or more) datasets in a single call. Reading the [RFC] the new feature seems very promising, i.e. to my understanding the idea is to compute a list of offsets and sizes to read, sort the list by file offset, (maybe add checks to remove overlapping regions or join reads separated by small gaps), read collectively using MPI I/O. Then somehow HDF5 needs to convert the bytes to something the user expects (decompression, big/small endian, etc.).

To a naive reader none of this requires the following condition:

All datasets must be in the same HDF5 file, and each unique dataset may only be listed once. If this function is called collectively in parallel, each rank must pass exactly the same list of datasets in dset_id , though the other parameters may differ.

Our use case is that each MPI rank wants to read mostly distinct datasets; and therefore doesn’t seem to naturally satisfy the restriction mentioned above. The details about how many and how big each dataset is, have been described here [1]. However, feel free to ask for further information as needed.

We have the following questions:

  1. Is H5Dread_multi intended to work in our usecase? If yes:
    1. How would we use it? One idea would be to: exchange the names of all datasets; open all datasets on every MPI rank; and tell almost every MPI rank to read 0 elements. I’m skeptical of this approach, because it in some ways it increases the size of the problem proportional to the number of MPI ranks in order to benefit from improved access patterns to the parallel filesystem.
    2. Can HIDs be MPI_Allgathered or are only valid in the process they were created?
  2. If not: can this feature be extended to support the case of many small groups? Are there any plans to do so?

Thank you for your time and help.

Here’s one of the classical references on the matter Combining I/O Operations for Multiple Array Variables in Parallel NetCDF for some background.

I think this is an assumption to simplify the implementation. Since you can combine multiple selections (via union, etc.), why would you have two separate selections for the same dataset?

This is just a restatement of “collectively,” which applies everywhere.

Yes, but given your use case “that each MPI rank wants to read mostly distinct datasets,” the construction of the handle arrays, etc., would be rather tedious, as you’ve pointed out.

No, they are library instance (process) local.

I can’t answer that question, and others should chime in.

G.

From a performance perspective, your best bet is to have global points and structure datasets plus descriptors (offsets and ranges). You can dress that up w/ a navigational structure for situations where people expect to find groups and small (virtual) datasets, but you want to keep those link chases out of the I/O path. With the descriptors in memory (and the mapping to path names), you can at least sort and fuse the accesses on a per-rank basis or implement a fancier scheme where you might do a global sort, etc.

G.

Thank you for both responses. We’ll certainly try the global structure, independent of the discussion here. Exactly because it’ll allow the kind of optimizations you’re describing.

What prompted the question is that in my mental image of H5Dread_multi, we’ll reach a stage where each MPI rank has a list of (offset, size) pairs to be read. The offset and size refer to the same file. Note that at this stage notions such as datasets have been lost. The types of optimization we’re talking about could conceptually also be applied to this list.

Given the restrictions imposed by HDF5 and your response, I suspect there’s an important flaw in my mental image of how HDF5 or MPI-I/O works internally.

Once again thank you for the responses. They help understand how we’d use the feature.