Hi all,
I was wondering if HDF5 was going to be keep the 1.8.x branch going? Or is it recommend to move to the 1.10.x?
I'm asking as we all know for SWMR you need flock() and that you can not disable SWMR at compile time (I don't need it in my day to day use).
On one of the clusters I run on we've got a Lustre file-system. However the admin's have deemed that file locking is too expensive and have disabled it. Here's the mount information:
mds01ib@o2ib1:mds02ib@o2ib1:/scratch on /lustre/janus_scratch type lustre (rw,noauto,_netdev)
So when I run a very simple test to create a HDF5 with version 1.10.0 on this file system it fails:
janus-compile1 ~$ ./test /lustre/janus_scratch/tibr1099/foo.h5
HDF5-DIAG: Error detected in HDF5 (1.10.0) thread 0:
#000: H5F.c line 491 in H5Fcreate(): unable to create file
major: File accessibilty
minor: Unable to open file
#001: H5Fint.c line 1168 in H5F_open(): unable to lock the file or initialize file structure
major: File accessibilty
minor: Unable to open file
#002: H5FD.c line 1821 in H5FD_lock(): driver lock request failed
major: Virtual File Layer
minor: Can't update object
#003: H5FDsec2.c line 939 in H5FD_sec2_lock(): unable to flock file, errno = 38, error message = 'Function not implemented'
major: File accessibilty
minor: Bad file ID accessed
Unable to open: /lustre/janus_scratch/tibr1099/foo.h5: -1
1
When I strace the program I see it's because flock() failed:
open("/lustre/janus_scratch/tibr1099/foo.h5", O_RDWR) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
close(3) = 0
open("/lustre/janus_scratch/tibr1099/foo.h5", O_RDWR|O_CREAT|O_TRUNC, 0666) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
flock(3, LOCK_EX|LOCK_NB) = -1 ENOSYS (Function not implemented)
close(3) = 0
Versus if I trace the program with version 1.8.15:
open("/lustre/janus_scratch/tibr1099/foo.h5", O_RDWR) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
close(3) = 0
open("/lustre/janus_scratch/tibr1099/foo.h5", O_RDWR|O_CREAT|O_TRUNC, 0666) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
brk(0x235a000) = 0x235a000
mmap(NULL, 528384, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f17252b8000
So my long winded example leads to three questions.
1) Do other HPC sites enable flock() on lustre? If so is it only localflock so as not to have the burden of a cluster wide flock?
2) Is there a path forward for sites that don't enable flock?
3) Is there the opposite of H5Fstart_swmr_write?
Thanks!
Tim
test.f90 (1.12 KB)