I am not a novice as far as file-systems, so bear with me please.
We have some parallel HDF5 applications built on top of MPI-IO that have always worked as expected on lustre drives, but hang on some of our NFS drives. They seem to hang during the closing of the file even if we just open and close a file collectively. The following tests all hang:
88 - MPI_TEST_testphdf5 (Timeout) 95 - MPI_TEST_testphdf5_selnone (Timeout) 96 - MPI_TEST_testphdf5_cngrpw-ingrpr (Timeout) 97 - MPI_TEST_testphdf5_cschunkw (Timeout) 98 - MPI_TEST_testphdf5_ccchunkw (Timeout) 103 - MPI_TEST_t_mpi (Timeout) 104 - MPI_TEST_t_bigio (Timeout) 106 - MPI_TEST_t_pflush1 (Timeout) 107 - MPI_TEST_t_pflush2 (Failed) 112 - MPI_TEST_t_shapesame (Timeout) 113 - MPI_TEST_t_filters_parallel (Timeout)
This was built using HDF5-1.10.5 with OpenMPI-4.0.1 and intel-19.0.1. This problem seems to be independent of compiler and OpenMPI version, but doesn’t show up with MPICH. The system admins said they “underwent some upgrades to the NFS filers and TOSS” on the system that now works.
Do you have any suggestions on how to diagnose this problem? Or what NFS settings I should ask the system admins to forward along that may be causing this problem? Any guidance is appreciated!