A previous post mentioned that it may become possible to have indpendant datasets for independant processes, reducing the need for collective calls such as when extending a dataset.
That post is now 10yrs old. Has anything changed on this front? The link in the previous post is no longer working.
I have two seperate processes generating data, one generates values rapidly from an oscilicope, the other generates larger data more slowly from a camera.Therefore they are looping at different speeds.
I would like to write the data from each process into its own data set, this includes extending the data set, which is a collective call. Since the two proceses do not interact with the other’s dataset can this be made indepenant?
Mainly this is desired to save time by preventing processes waiting for each other.
Did you consider using ZeroMQ to collect your events and record them ? If you have interest, let me know…
creates a container, and rank many datasets with random sizes using collective calls. Once the rig is set up, by default it extends a single dataset with collectiv ecall to demonstrate fitness of the rig but not the actual problem. By adjusting/uncommenting lines marked with NOTE: one can trigger the code relevant to the OP’s question.
The MWE indeed shows for phdf5v1.10.6 all processes must participate in H5Dset_extent call, otherwise the program will hang indefinetally.
I’m not sure I understand the Full story. When you say
do you mean processes in general, or MPI processes, separate MPI applications, or something else?
Are the processes independent in the sense that no synchronization is needed?
OK, they acquire data at different rates, and you would like to put the data into different data sets.
What kind of data rates are you looking at, and what’s the connectivity? Are the devices hooked up to the same host? Local storage or NAS?
Datset extension a la H5Dset_extent is collective, provided we are using MPI. But then we are jumping to threads:
Threads, processes,…, which one?
If they are independent, why not write the datasets into different HDF5 files and then stitch them together with a stub that contains just two external links. Or you can just merge/repack, if you need a single physical file (and compress or make them contiguous, depending on what you are looking for). If you don’t like the (temporary) duplication, then Steven’s ZMQ is a fine option.
If you have time, this sounds like another good topic for our HDF Clinic (next Tuesday).
Nice figure! (yEd?) Assuming I understand it correctly, my main concern would be the tight coupling between the two processes. (You were planning to use the MPI-IO VFD, right?) For example, at the moment, the dataset extension must be a collective operation. In other words, the ‘Oscilloscope’ rank(s) must participate in the extension of the ‘Cam Data’ dataset and the ‘Camera 1’ rank(s) must participate in the extension of the ‘Scope Data’ dataset, at least as long as those datasets reside in the same file. You could get around this by placing those datasets into separate HDF5 files and then just have two external links in ‘Rec1.h5’. In that case, you wouldn’t need the MPI-IO VFD, unless there were multiple ‘oscilloscope’ or ‘camera’ ranks. In that case, you should consider opening those files on separate MPI communicators. Does that make sense? G.
MPI-IO VFD: Yes ( I think…?) I have compiled the parallel version of HDF5 using MS MPI.
Using a file per process
If we where to use a seperate file per process would we still need to use a PHDF5 build or would we want to use a standard HDF5 build? assume we want a driver per process, how to make sure there is no interference…?
We are planning to do a lot of sequential recordings and then access them all through virtual datasets (VDS).
My main concern was keeping the directory as clean as possible and making it easy to access recordings indivudally, hence my desire to use a single file.
I think external links might be a solution, or part of one. We could still access each recording layer via a single file.
Would we still have two seperate files? I take it we would end up with a structure like below:
and so a single entry point using Virtual files and External Paths would look something like:
However could we just use a virtual datasets generated post recording sequence? doubling the number of files in the dir isnt so bad…and it looks like its happening either way
Yes, no problem. You can use MPI and even MPI-IO (without HDF5) and use the sequential HDF5 VFDs (POSIX, core, etc.). File per process is fine. Unless you are using the MPI-IO VFD, you can just go with the non-parallel build.
I’m glad to see you are taking the time to plan and think these things through. You clearly have a data life cycle in mind (planning, acqusition, pre-processing, analysis, …) and, as you’ve seen already, different stages have often competing requirements. Keep in mind that HDF5 alone can’t solve all problems for you.
To keep the directory as clean as possible is a good goal, but does that mean “at all times” and “at all cost” or
“eventually?” Planning for how you will manage and evolve your assets is part of the life cycle. Sometimes a simple tool such as h5repack does the trick, sometimes you’ll need more than that.
I’m not sure I understand your concept of “virtual files.” Do you mean virtual datasets (VDS)? (VDS is the aility to have datasets that are backed by dataset elements of a compatible type stored in other datasets, including datasets in other files.) Assuming that’s the case, the differences would be mostly in how you locate and access the data. With external links you can quickly locate individual recordings by following the link structure. With VDS, since you are dealing with combined datasets, you’d need a separate look-up structure to tell you which hyperslab or selection represents a particular recording. (You can query the VDS, but a separate lookup might be more convenient.) On the upside, reading across recordings, if that were a use case for you, would be very straightforward with VDS (single read), but a little more cumbersome with external links (iterations + read ops).
I wouldn’t expect a huge performance difference. Access to the dataset elements (in chunks) is the same in both cases; any difference would have to come from from the underlying metadata operations. Maybe link traversals and opening external files might be a tad slower than accessing VDS metadata, but that metadata will be in cache (eventually). You can whip up a quick benchmark if you really wanna know. I think convenience in supporting your use case(s) is more important for now.
VDS might be a little more forgiving with missing data. Corrupted files will be an issue either way. I think defensive programming and proper error handling will go a long way toward creating a robust solution, i.e., one that doesn’t fall apart when it encounters a missing or corrupt file.