ADIOI Lock problems on NFS and Panasas

Hello everyone,

We cannot use parallel HDF5 on any of our systems. The processes either crash or hang (and they work with sequential HDF5).

On NFS, we are getting:

ADIOI_Set_lock:: No locks available
ADIOI_Set_lock:offset 69744, length 256
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 124
File locking failed in ADIOI_Set_lock(fd 25,cmd F_SETLKW/7,type F_WRLCK/1,whence 0) with return value FFFFFFFF and errno 25.
If the file system is NFS, you need to use NFS version 3, ensure that the lockd daemon is running on all the machines, and mount the directory with the 'noac' option (no attribute caching).

On Panasas:

ADIOI_PANFS_RESIZE: Rank 13: Resize failed: requested=46996328 actual=9187464.

We are using intel 12.1.4, mvapich1.6 (tested with 1.8 and 1.9 as well) and HDF5 1.8.10.

Is this a known problem, and do you know any workarounds without turning of the parallel capabilities of HDF5?

Any suggestions you may have will be appreciated!

Thanks,
-Mehmet

Hello everyone,

We cannot use parallel HDF5 on any of our systems. The processes either crash or hang (and they work with sequential HDF5).

On NFS, we are getting:

ADIOI_Set_lock:: No locks available
ADIOI_Set_lock:offset 69744, length 256
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 124
File locking failed in ADIOI_Set_lock(fd 25,cmd F_SETLKW/7,type F_WRLCK/1,whence 0) with return value FFFFFFFF and errno 25.
If the file system is NFS, you need to use NFS version 3, ensure that the lockd daemon is running on all the machines, and mount the directory with the 'noac' option (no attribute caching).

NFS is tricky to get right, and often requires turning off any and all
caching. Let's set the NFS issue aside for now.

On Panasas:

ADIOI_PANFS_RESIZE: Rank 13: Resize failed: requested=46996328 actual=9187464.

We are using intel 12.1.4, mvapich1.6 (tested with 1.8 and 1.9 as well) and HDF5 1.8.10.

Is this a known problem, and do you know any workarounds without turning of the parallel capabilities of HDF5?

Parallel HDF5 works on a lot of other environments, but I don't have
any experience with Panasas or the Panasas-contributed ADIO driver.

Any suggestions you may have will be appreciated!

The simplest workaround will be to select other ROMIO drivers.
- When accessing the Panasas file system, try prefixing the file name
  you pass to HDF5 with "ufs:". This will turn off any
  panasas-specific optimizations, unfortunately, but lots of folks use
  the default "unix file system"
  driver.

Also, nvapic1.6 is based on an ancient version of ROMIO. If you've
got any way to use mvapich2, there are undoubtedly some fixes that
might make your life better.

Perhaps you meant 'mvapich2' .. I've seen mvapich 1 "in the wild"
enough times, though, that I thought I should double-check.

==rob

···

On Fri, Apr 19, 2013 at 12:47:40PM -0400, Mehmet Belgin wrote:

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

Rob,

Thanks a lot for your suggestions, I will try them and keep the list updated. And yes, I meant mvapich2... I too heard about unconfirmed mvapich sightings in the wild so thanks for checking :slight_smile:

Oh, also, I installed a no-romio version of mvapich2 and recompiled HDF5 with it, but still seeing the same problem :frowning:

-Mehmet

···

On May 6, 2013, at 10:47 AM, Rob Latham wrote:

On Fri, Apr 19, 2013 at 12:47:40PM -0400, Mehmet Belgin wrote:

Hello everyone,

We cannot use parallel HDF5 on any of our systems. The processes either crash or hang (and they work with sequential HDF5).

On NFS, we are getting:

ADIOI_Set_lock:: No locks available
ADIOI_Set_lock:offset 69744, length 256
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 124
File locking failed in ADIOI_Set_lock(fd 25,cmd F_SETLKW/7,type F_WRLCK/1,whence 0) with return value FFFFFFFF and errno 25.
If the file system is NFS, you need to use NFS version 3, ensure that the lockd daemon is running on all the machines, and mount the directory with the 'noac' option (no attribute caching).

NFS is tricky to get right, and often requires turning off any and all
caching. Let's set the NFS issue aside for now.

On Panasas:

ADIOI_PANFS_RESIZE: Rank 13: Resize failed: requested=46996328 actual=9187464.

We are using intel 12.1.4, mvapich1.6 (tested with 1.8 and 1.9 as well) and HDF5 1.8.10.

Is this a known problem, and do you know any workarounds without turning of the parallel capabilities of HDF5?

Parallel HDF5 works on a lot of other environments, but I don't have
any experience with Panasas or the Panasas-contributed ADIO driver.

Any suggestions you may have will be appreciated!

The simplest workaround will be to select other ROMIO drivers.
- When accessing the Panasas file system, try prefixing the file name
you pass to HDF5 with "ufs:". This will turn off any
panasas-specific optimizations, unfortunately, but lots of folks use
the default "unix file system"
driver.

Also, nvapic1.6 is based on an ancient version of ROMIO. If you've
got any way to use mvapich2, there are undoubtedly some fixes that
might make your life better.

Perhaps you meant 'mvapich2' .. I've seen mvapich 1 "in the wild"
enough times, though, that I thought I should double-check.

==rob

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Rob,

Thanks a lot for your suggestions, I will try them and keep the list updated. And yes, I meant mvapich2... I too heard about unconfirmed mvapich sightings in the wild so thanks for checking :slight_smile:

Oh, also, I installed a no-romio version of mvapich2 and recompiled HDF5 with it, but still seeing the same problem :frowning:

no-romio? You'll need *some* implementation of MPI-IO for parallel
HDF5 to work, and the problems you are seeing are manifesting
themselves in ROMIO-specific error messages.

I think you might have to get Panasas involved here, since the problem
seems to be below the HDF5 layer in the software stack. As the ROMIO
maintainer, I'll be happy to work with Panasas to improve any bugs in
their ad_panfs driver. If you do contact Panasas, please keep me
CCed. I'll be happy to incorporate any fixes they think are needed.

==rob

···

On Mon, May 06, 2013 at 11:02:31AM -0400, Mehmet Belgin wrote:

-Mehmet

On May 6, 2013, at 10:47 AM, Rob Latham wrote:

> On Fri, Apr 19, 2013 at 12:47:40PM -0400, Mehmet Belgin wrote:
>> Hello everyone,
>>
>> We cannot use parallel HDF5 on any of our systems. The processes either crash or hang (and they work with sequential HDF5).
>>
>> On NFS, we are getting:
>>
>> ADIOI_Set_lock:: No locks available
>> ADIOI_Set_lock:offset 69744, length 256
>> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 124
>> File locking failed in ADIOI_Set_lock(fd 25,cmd F_SETLKW/7,type F_WRLCK/1,whence 0) with return value FFFFFFFF and errno 25.
>> If the file system is NFS, you need to use NFS version 3, ensure that the lockd daemon is running on all the machines, and mount the directory with the 'noac' option (no attribute caching).
>
> NFS is tricky to get right, and often requires turning off any and all
> caching. Let's set the NFS issue aside for now.
>
>> On Panasas:
>>
>> ADIOI_PANFS_RESIZE: Rank 13: Resize failed: requested=46996328 actual=9187464.
>>
>> We are using intel 12.1.4, mvapich1.6 (tested with 1.8 and 1.9 as well) and HDF5 1.8.10.
>>
>> Is this a known problem, and do you know any workarounds without turning of the parallel capabilities of HDF5?
>
> Parallel HDF5 works on a lot of other environments, but I don't have
> any experience with Panasas or the Panasas-contributed ADIO driver.
>
>> Any suggestions you may have will be appreciated!
>
> The simplest workaround will be to select other ROMIO drivers.
> - When accessing the Panasas file system, try prefixing the file name
> you pass to HDF5 with "ufs:". This will turn off any
> panasas-specific optimizations, unfortunately, but lots of folks use
> the default "unix file system"
> driver.
>
> Also, nvapic1.6 is based on an ancient version of ROMIO. If you've
> got any way to use mvapich2, there are undoubtedly some fixes that
> might make your life better.
>
> Perhaps you meant 'mvapich2' .. I've seen mvapich 1 "in the wild"
> enough times, though, that I thought I should double-check.
>
> ==rob
>
> --
> Rob Latham
> Mathematics and Computer Science Division
> Argonne National Lab, IL USA
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> Hdf-forum@hdfgroup.org
> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA