HDF5 and parallelism with fork(2)


#1

Hello!

In https://github.com/h5py/h5py/issues/934, user opens HDF5 file read-only, forks, and starts accessing the data from the child processes concurrently.
Unfortunately, this doesn’t work reliably, because lseek+read combinations used in HDF5 are not atomic. Still, fork followed by concurrent reads seems appealing and easy-to-use paradigm, free of complicated MPI or other type of inter-process coordination.

But what if HDF5 library used atomic
https://linux.die.net/man/3/pread
(and maybe https://linux.die.net/man/3/pwrite, for consistency)
if this interface is available?

Maybe that should be submitted as a low-urgency/experimental improvement suggestion?

Best wishes in 2019,
Andrey Paramonov


#2

while use of pread and pwrite may make some sense, I am just curious if solution to “…opens HDF5 file read-only, forks, and starts access…” is to just change order of operations to “…forks, opens HDF5 file read-only and starts accessing…”? I mean, why is it so important to have HDF5 file handle maintain consistency across forks when underlying standard interfaces (fopen/fread/frwrite/fclose or open/read/write/close) don’t do that either? Do we know if there are any performance implications of pread/pwrite vs. read/write? Do we know if most implementations of pread/pwrite just turn around and use read/write? It seems like reducing from 2 system calls (e.g. seek followed by read) to one (pread) would be a performance benefit. But, I honestly don’t know.


#3

Hello,

And Happy New Year!

while use of pread and pwrite may make some sense, I am just curious if
solution to “…opens HDF5 file read-only, forks, and starts access…” is
to just change order of operations to “…forks, opens HDF5 file read-only
and starts accessing…”?

Personally I can only speculate, but it seems reasonable if operation in
a forked process happens conditionally, depending on other content of
HDF5 file. In this case, re-opening file may inflict performance penalty
and sub-optimal caching. Theoretically, forking should be faster on
modern OSes.
Hopefully, original h5py issue submitter chimes in!

I mean, why is it so important to have HDF5 file
handle maintain consistency across forks when underlying standard
interfaces (fopen/fread/frwrite/fclose or open/read/write/close) don’t
do that either?

You are right that lseek+read is a standard interface but pread is
a standard interface as well :wink: HDF5 could use the latter to inherit
its good properties.

Do we know if there are any performance implications of
pread/pwrite vs. read/write? Do we know if most implementations of
pread/pwrite just turn around and use read/write? It seems like reducing
from 2 system calls (e.g. seek followed by read) to one (pread) would be
a performance benefit. But, I honestly don’t know.

I believe it wouldn’t be any slower, only a bit less portable.

Best wishes,
Andrey Paramonov


#4

pread is meant for multi-threaded programs to have an atomic seek/read. The same for pwrite.

Ger