Independent datasets for MPI processes. Progress?


#1

TL;DR

A previous post mentioned that it may become possible to have indpendant datasets for independant processes, reducing the need for collective calls such as when extending a dataset.

That post is now 10yrs old. Has anything changed on this front? The link in the previous post is no longer working.

Full story

I have two seperate processes generating data, one generates values rapidly from an oscilicope, the other generates larger data more slowly from a camera.Therefore they are looping at different speeds.

I would like to write the data from each process into its own data set, this includes extending the data set, which is a collective call. Since the two proceses do not interact with the other’s dataset can this be made indepenant?

Mainly this is desired to save time by preventing processes waiting for each other.


#2

The behaviour didn’t change with HDF5 v1.10.6, here is the test case with H5CPP and combination of C API calls

Did you consider using ZeroMQ to collect your events and record them ? If you have interest, let me know…
best: steve

MWE: mpi-extend

creates a container, and rank many datasets with random sizes using collective calls. Once the rig is set up, by default it extends a single dataset with collectiv ecall to demonstrate fitness of the rig but not the actual problem. By adjusting/uncommenting lines marked with NOTE: one can trigger the code relevant to the OP’s question.

identification:

The MWE indeed shows for phdf5v1.10.6 all processes must participate in H5Dset_extent call, otherwise the program will hang indefinetally.

output:

mpic++ -o mpi-extend.o   -std=c++17 -O3  -I/usr/local/include -Wno-narrowing -c mpi-extend.cpp
mpic++ mpi-extend.o -lz -ldl -lm  -lhdf5 -o mpi-extend	
srun -n 4 -w io ./mpi-extend
[rank]	2	[total elements]	0
[dimensions]	current: {346,0}	maximum: {346,inf}
[selection]	start: {0,0}	end:{345,inf}
[rank]	2	[total elements]	0
[dimensions]	current: {346,0}	maximum: {346,inf}
[selection]	start: {0,0}	end:{345,inf}
[rank]	2	[total elements]	0
[dimensions]	current: {346,0}	maximum: {346,inf}
[selection]	start: {0,0}	end:{345,inf}
[rank]	2	[total elements]	0
[dimensions]	current: {346,0}	maximum: {346,inf}
[selection]	start: {0,0}	end:{345,inf}
{346,0}{346,0}{346,0}{346,0}h5ls -r mpi-extend.h5
/                        Group
/io-00                   Dataset {346, 400/Inf}
/io-01                   Dataset {465, 0/Inf}
/io-02                   Dataset {136, 0/Inf}
/io-03                   Dataset {661, 0/Inf}

workarounds:

  • ZeroMQ based solution with a single writer thread

requirements:

  • h5cpp v1.10.6-1
  • PHDF5 C base library, no high level API or beuilt in C++ API is needed works with: parallel
  • c++17 or higher compiler

#3

I’m not sure I understand the Full story. When you say

do you mean processes in general, or MPI processes, separate MPI applications, or something else?

Are the processes independent in the sense that no synchronization is needed?

OK, they acquire data at different rates, and you would like to put the data into different data sets.
What kind of data rates are you looking at, and what’s the connectivity? Are the devices hooked up to the same host? Local storage or NAS?

Datset extension a la H5Dset_extent is collective, provided we are using MPI. But then we are jumping to threads:

Threads, processes,…, which one?

If they are independent, why not write the datasets into different HDF5 files and then stitch them together with a stub that contains just two external links. Or you can just merge/repack, if you need a single physical file (and compress or make them contiguous, depending on what you are looking for). If you don’t like the (temporary) duplication, then Steven’s ZMQ is a fine option.

If you have time, this sounds like another good topic for our HDF Clinic (next Tuesday).

G.


#4

I am talking about MPI processes. Please ignore my slip-up mentioning threads.

Here is a flow chart of the kind of design I have in mind:

Each dataset is written to by a single process, even if the file is shared by both processes.

Combined-1


#5

Nice figure! (yEd?) Assuming I understand it correctly, my main concern would be the tight coupling between the two processes. (You were planning to use the MPI-IO VFD, right?) For example, at the moment, the dataset extension must be a collective operation. In other words, the ‘Oscilloscope’ rank(s) must participate in the extension of the ‘Cam Data’ dataset and the ‘Camera 1’ rank(s) must participate in the extension of the ‘Scope Data’ dataset, at least as long as those datasets reside in the same file. You could get around this by placing those datasets into separate HDF5 files and then just have two external links in ‘Rec1.h5’. In that case, you wouldn’t need the MPI-IO VFD, unless there were multiple ‘oscilloscope’ or ‘camera’ ranks. In that case, you should consider opening those files on separate MPI communicators. Does that make sense? G.


#6

Thanks @gheber , all made on draw.io.

MPI-IO VFD: Yes ( I think…?) I have compiled the parallel version of HDF5 using MS MPI.

Using a file per process

  1. If we where to use a seperate file per process would we still need to use a PHDF5 build or would we want to use a standard HDF5 build? assume we want a driver per process, how to make sure there is no interference…?

File structure

We are planning to do a lot of sequential recordings and then access them all through virtual datasets (VDS).

My main concern was keeping the directory as clean as possible and making it easy to access recordings indivudally, hence my desire to use a single file.

External Links

I think external links might be a solution, or part of one. We could still access each recording layer via a single file.

  1. Would we still have two seperate files? I take it we would end up with a structure like below:

ExternalLinks

and so a single entry point using Virtual files and External Paths would look something like:

ExternalLinksVirtual

Virtual datasets

  1. However could we just use a virtual datasets generated post recording sequence? doubling the number of files in the dir isnt so bad…and it looks like its happening either way

External Links vs Virtual datasets?

Not that its a competition…

  1. Apart from the obvious lack of unifed data sets…are there any inherent advanages/disadvantages to using external links rather than using a virtual datasets straight?

    a. Do either of these options effect IO speed for chunked data sets?
    b. How will either of these fail?

    • When the file is missing.
    • When a file exists but is corrupted.

Specifically will this completely break (and require restucturing to be used without the missing/corrupted file)?


#7

Yes, no problem. You can use MPI and even MPI-IO (without HDF5) and use the sequential HDF5 VFDs (POSIX, core, etc.). File per process is fine. Unless you are using the MPI-IO VFD, you can just go with the non-parallel build.


#8

I’m glad to see you are taking the time to plan and think these things through. You clearly have a data life cycle in mind (planning, acqusition, pre-processing, analysis, …) and, as you’ve seen already, different stages have often competing requirements. Keep in mind that HDF5 alone can’t solve all problems for you.

To keep the directory as clean as possible is a good goal, but does that mean “at all times” and “at all cost” or
“eventually?” Planning for how you will manage and evolve your assets is part of the life cycle. Sometimes a simple tool such as h5repack does the trick, sometimes you’ll need more than that.


#9

I’m not sure I understand your concept of “virtual files.” Do you mean virtual datasets (VDS)? (VDS is the aility to have datasets that are backed by dataset elements of a compatible type stored in other datasets, including datasets in other files.) Assuming that’s the case, the differences would be mostly in how you locate and access the data. With external links you can quickly locate individual recordings by following the link structure. With VDS, since you are dealing with combined datasets, you’d need a separate look-up structure to tell you which hyperslab or selection represents a particular recording. (You can query the VDS, but a separate lookup might be more convenient.) On the upside, reading across recordings, if that were a use case for you, would be very straightforward with VDS (single read), but a little more cumbersome with external links (iterations + read ops).

  1. I wouldn’t expect a huge performance difference. Access to the dataset elements (in chunks) is the same in both cases; any difference would have to come from from the underlying metadata operations. Maybe link traversals and opening external files might be a tad slower than accessing VDS metadata, but that metadata will be in cache (eventually). You can whip up a quick benchmark if you really wanna know. I think convenience in supporting your use case(s) is more important for now.
  2. VDS might be a little more forgiving with missing data. Corrupted files will be an issue either way. I think defensive programming and proper error handling will go a long way toward creating a robust solution, i.e., one that doesn’t fall apart when it encounters a missing or corrupt file.

G.


#10

Summary:

shared file - a file open across multiple processes using PHDF5.
(not sure this is the correct terminology but this is what i mean)

Answer

  • Any modifications to the structure of a shared file must be collective calls made by all processes.
    • To have several processes modify datasets independantly within a shared file is not possible with PHDF5.
    • There are no plans to implement this functionality.

The real question appears to be: why do you need to do this?

For me the answer was tidyness and ease of access, for post-processing/data analytics.

In which case do we need it to be tidy all the time, Or can we tidy later?

Potential solutions/work arounds

Clean up after

  • Create a file per process and unify data afterwards, either via:
    • VDS: virtual data sets (access all dataset as one continuous dataset)
    • External links (access a single dataset in one file from another)
    • h5repack command line tool to copy hdf5 files and change chunking, compression
    • repack yourself, for more complex operations (merging datasets/files, post-processing…)

Independant processes, single file

  • Collect data and have a single writer thread/process
    • MPI - have each process send data to the writer process using MPI_Send
      • for large data this is slow
      • Communication with each extra process is done individually, more operations, less speed… (unless some tree structured collection is used)
      • MPI w/ windowed memory - share memory between processes, eliminiate the redundant copying/sending of MPI_Send ( only for local memory on individual nodes?)
      • MPI is designed for computer clusters: good comms between nodes etc
    • ZeroMQ - asynchronous messaging library, essentially an alternative MPI
      • collect your events and record them
        • Better for for this than MPI? asynchronous
      • Designed for distributed computing:
        • also has much better fault tolerance (errors do not generate system wide failure by default…)

Outcome

In the name of KISS I have decided to go with the clean up option, particularly since some post-processing is going to be required anyway…

Thanks to @gheber and @steven for their help!