Hang for MPI HDF5 in parallel on an NFS system

I am not a novice as far as file-systems, so bear with me please.

We have some parallel HDF5 applications built on top of MPI-IO that have always worked as expected on lustre drives, but hang on some of our NFS drives. They seem to hang during the closing of the file even if we just open and close a file collectively. The following tests all hang:

            88 - MPI_TEST_testphdf5 (Timeout)
            95 - MPI_TEST_testphdf5_selnone (Timeout)
            96 - MPI_TEST_testphdf5_cngrpw-ingrpr (Timeout)
            97 - MPI_TEST_testphdf5_cschunkw (Timeout)
            98 - MPI_TEST_testphdf5_ccchunkw (Timeout)
            103 - MPI_TEST_t_mpi (Timeout)
            104 - MPI_TEST_t_bigio (Timeout)
            106 - MPI_TEST_t_pflush1 (Timeout)
            107 - MPI_TEST_t_pflush2 (Failed)
            112 - MPI_TEST_t_shapesame (Timeout)
            113 - MPI_TEST_t_filters_parallel (Timeout)

This was built using HDF5-1.10.5 with OpenMPI-4.0.1 and intel-19.0.1. This problem seems to be independent of compiler and OpenMPI version, but doesn’t show up with MPICH. The system admins said they “underwent some upgrades to the NFS filers and TOSS” on the system that now works.

Do you have any suggestions on how to diagnose this problem? Or what NFS settings I should ask the system admins to forward along that may be causing this problem? Any guidance is appreciated!

Hello, I got a little more info from our system admin that it may just be the hardware on our filers being too out of date. What spec should I expect as a minimum requirement to work.

Did you read the prerequisites of parallel HDF5 regarding file systems? AFAIK: NFS is not an MPI-IO capable filesystem;

Lustre, OrangeFS, BeeGFS, …, are.

Hope it helps