For a data processing application, I was considering the use of multiple readers on a single HDF5 file, and a single writer that is appending to the same HDF5 file. The readers and writer are separate processes (e.g., Python multiprocessing package) The readers read from existing datasets that do not change, perform computations, and notify the writer of new data; the writer writes new data to a separate dataset that is not being read by any other process.
For this situation, do I absolutely need SWMR, or is it sufficient to disable file locking (i.e., HDF5_USE_FILE_LOCKING=FALSE) and run without SWMR?
I suspect not using SWMR could be dangerous, but I’ll let someone who is more familiar with HDF5 libraries internals make a definitive response.
If you would like to have more flexibility with multi-process applications, you might want to look into HSDS. It supports not just SWMR, but also having multiple writers is no problem.
I think it should work as long as you’re very careful that the writer process does not modify anything needed by the readers. Make sure to flush the file with H5Fflush (or the h5py equivalent) before the readers open it, and as you mentioned the writer should only write to its dataset, and not perform any other operations on the file, and the readers should only read from theirs, and again not perform any other operations on the file. This is not an officially supported usage, so if it doesn’t work for whatever reason it wouldn’t be considered a bug.
Thanks for your thoughts on this use-case! I see that HSDS supports multi-reader and multi-writer - I will definitely explore this option in the future. For now, we want to prove out file-based HDF5 for our data processing.
We will give this multi-reader/single writer scenario a try despite not being officially supported. We can also just enable SWMR. In the future, will the SWMR VFD work cover this scenario?
Hi, @alan113696 !
Would you please explain how you are going to implement this part in your workflow?
What hardware (e.g., i386, amd64, arm64) / OS / compiler do you use for your workflow?
We will be using Python 3.12 and h5py on RHEL 8, amd64. We plan on using Python multiprocessing to parallel process the computations, and shared memory to share buffers. Communications between processes will use queues. We might look at Dask as well to see if it can reduce complexity and achieve better performance.
Hi, @alan113696 !
Thank you for sharing your plan!
I think this document has a very good advice if you haven’t read it yet:
Multiprocessing package - torch.multiprocessing — PyTorch 2.6 documentation
Also, please let me know if your data processing application is part of an AI system because it can be a good use case of IOWarp project.
Best regards,