Parallel HDF5 file locking error

I'm having some issues with file locking on a parallel filesystem which I
believe is related the problem here:
https://lists.hdfgroup.org/pipermail/hdf-forum_lists.hdfgroup.org/2011-February/004254.html

I tried the suggestion of disabling romio_ds_read and romio_ds_write, but
this doesn't fix the problem. I've also tried setting H5Pset_seive_buf_size
to 0 to disable data striding in HDF5 itself, but that doesn't work, either.

Here's a short snippet of the relevant code:

hid_t fapl_id = H5Pcreate(H5P_FILE_ACCESS);
MPI_Info info;
MPI_Info_create(&info);
MPI_Info_set(info, "romio_ds_read", "disable");
MPI_Info_set(info, "romio_ds_write", "disable");
H5Pset_sieve_buf_size(fapl_id, 0);
H5Pset_fapl_mpio(fapl_id, comm, info);

hid_t f_id = H5Fcreate("test_file.h5", H5F_ACC_TRUNC, H5P_DEFAULT, fapl_id);

It's on this last line that the program hangs, and eventually MPI_ABORT
gets called. Note that I get the same problem regardless of if I try to
create a new file, or if I try to open an existing file:

hid_t f_id = H5Fopen("test_file.h5", H5F_ACC_RDWR, fapl_id);

Other relevant information:

1. Filesystem is NFS v3 with noac off (can't change)
2. Tested with HDF 5.8.13 and 5.8.14
3. Tested with OpenMPI 1.6.5 and 1.10.2

What am I missing? I saw that there's an option to disable filesystem
atomicity, but this requires that the file already be opened, and I can't
even get that far.

I also know that my code does work on different computers with different
filesystems and/or mount options.

Hi Andrew,

Since it’s going to be hard to replicate this, could you provide more information by attaching a debugger and show the trace of where the hang is?

Thanks,
Mohamad

···

From: Hdf-forum <hdf-forum-bounces@lists.hdfgroup.org> on behalf of Andrew Ho <andrewh0@uw.edu>
Reply-To: hdf-forum <hdf-forum@lists.hdfgroup.org>
Date: Friday, June 10, 2016 at 10:06 PM
To: hdf-forum <hdf-forum@lists.hdfgroup.org>
Subject: [Hdf-forum] Parallel HDF5 file locking error

I'm having some issues with file locking on a parallel filesystem which I believe is related the problem here: https://lists.hdfgroup.org/pipermail/hdf-forum_lists.hdfgroup.org/2011-February/004254.html

I tried the suggestion of disabling romio_ds_read and romio_ds_write, but this doesn't fix the problem. I've also tried setting H5Pset_seive_buf_size to 0 to disable data striding in HDF5 itself, but that doesn't work, either.

Here's a short snippet of the relevant code:

hid_t fapl_id = H5Pcreate(H5P_FILE_ACCESS);
MPI_Info info;
MPI_Info_create(&info);
MPI_Info_set(info, "romio_ds_read", "disable");
MPI_Info_set(info, "romio_ds_write", "disable");
H5Pset_sieve_buf_size(fapl_id, 0);
H5Pset_fapl_mpio(fapl_id, comm, info);

hid_t f_id = H5Fcreate("test_file.h5", H5F_ACC_TRUNC, H5P_DEFAULT, fapl_id);

It's on this last line that the program hangs, and eventually MPI_ABORT gets called. Note that I get the same problem regardless of if I try to create a new file, or if I try to open an existing file:

hid_t f_id = H5Fopen("test_file.h5", H5F_ACC_RDWR, fapl_id);

Other relevant information:

1. Filesystem is NFS v3 with noac off (can't change)
2. Tested with HDF 5.8.13 and 5.8.14
3. Tested with OpenMPI 1.6.5 and 1.10.2

What am I missing? I saw that there's an option to disable filesystem atomicity, but this requires that the file already be opened, and I can't even get that far.

I also know that my code does work on different computers with different filesystems and/or mount options.

Hi Andrew,

Since it’s going to be hard to replicate this, could you provide more
information by attaching a debugger and show the trace of where the hang is?

Thanks,

Mohamad

Here's the trace I get:

#0 0x000000382f2dba48 in fcntl () from /lib64/libc.so.6
#1 0x00002aaab1138a58 in ADIOI_Set_lock () from mca_io_romio.so
#2 0x00002aaab111a7f2 in ADIOI_NFS_Fcntl () from mca_io_romio.so
#3 0x00002aaab1110b5a in mca_io_romio_dist_MPI_File_get_size () from
mca_io_romio.so
#4 0x00002aaab110ecd2 in mca_io_romio_file_get_size () from mca_io_romio.so
#5 0x00002aaaab22e184 in PMPI_File_get_size () from libmpi.so.12
#6 0x00002aaaaabe5a8f in H5FD_mpio_open () from libhdf5_debug.so.9.0.0
#7 0x00002aaaaabce8ec in H5FD_open () from libhdf5_debug.so.9.0.0
#8 0x00002aaaaabb4969 in H5F_open () from libhdf5_debug.so.9.0.0
#9 0x00002aaaaabacedd in H5Fcreate () from libhdf5_debug.so.9.0.0
#10 0x000000000040a09a in main () at main.cpp:70

GDB for some reason isn't giving me any line number information for HDF5,
so I found the lines these calls were happening at:

#6 H5FDmpio.c:1081
#7 H5FD.c:991

···

--
Andrew Ho

    Hi Andrew,____

    __ __

    Since it�s going to be hard to replicate this, could you provide
    more information by attaching a debugger and show the trace of where
    the hang is?____

    __ __

    Thanks,____

    Mohamad

Here's the trace I get:

#0 0x000000382f2dba48 in fcntl () from /lib64/libc.so.6
#1 0x00002aaab1138a58 in ADIOI_Set_lock () from mca_io_romio.so
#2 0x00002aaab111a7f2 in ADIOI_NFS_Fcntl () from mca_io_romio.so
#3 0x00002aaab1110b5a in mca_io_romio_dist_MPI_File_get_size () from
mca_io_romio.so
#4 0x00002aaab110ecd2 in mca_io_romio_file_get_size () from mca_io_romio.so
#5 0x00002aaaab22e184 in PMPI_File_get_size () from libmpi.so.12
#6 0x00002aaaaabe5a8f in H5FD_mpio_open () from libhdf5_debug.so.9.0.0
#7 0x00002aaaaabce8ec in H5FD_open () from libhdf5_debug.so.9.0.0
#8 0x00002aaaaabb4969 in H5F_open () from libhdf5_debug.so.9.0.0
#9 0x00002aaaaabacedd in H5Fcreate () from libhdf5_debug.so.9.0.0
#10 0x000000000040a09a in main () at main.cpp:70

GDB for some reason isn't giving me any line number information for
HDF5, so I found the lines these calls were happening at:

#6 H5FDmpio.c:1081
#7 H5FD.c:991

If you are stuck with nfs then I would recommend against using parallel access to that file system. You only have one server anyway.

It's likely you are trying to develop on this system, then deploy somewhere else, right? But there's no tuning that can eliminate the file size check.

For this system you're probably better off without the MPI-IO transfer property.

==rob

···

On 06/14/2016 10:38 AM, Andrew Ho wrote:

--
Andrew Ho

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Yes, I am developing on this system and deploying on a different one. I'm
not too interested in tuning on this system right now because for my
current code I/O time is significantly less than simulation time. However,
I am using multiple compute nodes.

If I open up an H5 file with serial HDF5, can I write the metadata with one
node (setup groups/attributes/datasets), and then later open up the H5 file
from multiple nodes and only write actual data?

···

On Thu, Jun 16, 2016 at 7:56 AM, Rob Latham <robl@mcs.anl.gov> wrote:

If you are stuck with nfs then I would recommend against using parallel
access to that file system. You only have one server anyway.

It's likely you are trying to develop on this system, then deploy
somewhere else, right? But there's no tuning that can eliminate the file
size check.

For this system you're probably better off without the MPI-IO transfer
property.

==rob

--
Andrew Ho

Yes, you should be able to do that as long as you set dataset allocation to EARLY when creating datasets in serial mode (the default is not Early for the non-parallel case):
https://www.hdfgroup.org/HDF5/doc/RM/RM_H5P.html#Property-SetAllocTime

While I don’t see any issues in the approach, this is not very well tested in HDF5 so if you find bugs please report to our Helpdesk.

Thanks,
Mohamad

···

From: Hdf-forum <hdf-forum-bounces@lists.hdfgroup.org> on behalf of Andrew Ho <andrewh0@uw.edu>
Reply-To: hdf-forum <hdf-forum@lists.hdfgroup.org>
Date: Thursday, June 16, 2016 at 12:01 PM
To: hdf-forum <hdf-forum@lists.hdfgroup.org>
Subject: Re: [Hdf-forum] Parallel HDF5 file locking error

If I open up an H5 file with serial HDF5, can I write the metadata with one node (setup groups/attributes/datasets), and then later open up the H5 file from multiple nodes and only write actual data?