VDS dataset mapping to multiple files in swmr mode

Hi, I have done some initial testing and posted a gist of the code here: Parallel data writing a · GitHub
And a StackOverflow post I made about this here: python - Dynamically update h5py VDS from data written to in parallel -- multiple writer multiple reader case - Stack Overflow

We have a process that creates multiple hdf5 files based on number of tasks, each of those files contains the same datasets and at the end of processing is joined together by index reference into a single VDS file. We do not know ahead of time what the final dimension along the growing axis will be. We were looking into being able to do this in a SWMR mode so that we can read the datasets while they are being written. This is possible, but only by having a dummy dimension which is only related to the number of parts files which then needs to be flattened after reading to create a “joined” dataset during the read process. The main problem being that you can not have multiple virtual datasources growing the dataset along the same dimension.

I would love to be corrected if I am missing something! Otherwise, I would just like to post this here as a desire from the community. Thank you for all of your hard work.

You can have multiple unlimited source selections in the same VDS. From reading the StackOverflow link, it looks like the problem is you’re not seeing the VDS dimension adjust to match the extended source datasets? In order for this to happen you currently need to call H5Dget_extent(). I am not an h5py expert so I am not sure if the line

data = ds[:]

Will trigger a call to H5Dget_extent(). Perhaps you need to call ds.shape()? Again I don’t know if h5py will just return a cached extent or not.

There is certainly an argument that this functionality should be added to H5Drefresh(). I’ll have to think about whether that should be done or not. However, it seems that the user/app/h5py will need to call H5Dget_extent() either way to get the new extent, so maybe h5py needs to call H5Dget_extent() in the dataset refresh() method?

1 Like

Neil, thank you for your response! More specifically, the problem is here:

for f_cnt in range(num_files):
        # Define virtual sources and map to layout
        with h5py.File(f'source_{f_cnt}.h5','r') as h5sf:
            ds = h5sf['test_data']
            v_source = h5py.VirtualSource(ds)
            vds_layout[f_cnt,:UNLIMITED] = v_source[:UNLIMITED]

And even more specifically again, the problem is here:
vds_layout[f_cnt,:UNLIMITED] = v_source[:UNLIMITED]

You can see in both the Stack Overflow solution and in my gist solution, I believe that I am forced to have a axis which is strictly related to the number of parts files. What would be ideal would be having a single unlimited axis which can map to multiple virtual sources. You can map multiple virtual sources to different slices along the same axis, but I believe you have to know the exact indices ahead of time in order to do that.

This is an example of what this might look like.

    for f_cnt in range(num_files):
        with h5py.File(f'output_{f_cnt}.h5', 'r', libver="latest", swmr=True) as h5sf:
            sine_ds = h5sf['sine_data']
            sine_source = h5py.VirtualSource(sine_ds)

            # Map this file's unlimited data to a slice starting at current_offset
            sine_layout[current_offset:current_offset + UNLIMITED] = sine_source[:UNLIMITED]

            # Update offset for next file
            current_offset += file_sizes[f_cnt]

When I try this I get ValueError: Invalid mapping selections (virtual and source space selections have different numbers of elements)

My guess is the error occurs because I am trying to map UNLIMITED elements starting at a specific offset but the VDS doesn’t know how to handle the overlap.

Are you trying to create a 1-D VDS with multiple unlimited mappings?

Yes! Multiple virtual sources for a single axis which can grow indefinitely. It would be like a MWMR (Multiple Writer, Multiple Reader) mode for the VDS. This would require the join logic to happen behind the scenes on the HDF5 side, and then we would not have to have a dimension which is strictly related to the number of tasks/parts files. It would be very useful for high-throughput data acquisition where multiple processes write chunks to separate files, but consumers want to see a single continuous data stream without having to implement their own concatenation logic.

So the library would automatically adjust the start point of each mapping when a mapping below/before them changes size?