Running CGNS with 10 mpi processes, I am getting a hang when trying to close the file. It works fine with 1-9 processors, but fails with 10 or more. Here is the traceback on ranks 0-7:
Stack Trace
C opal_sys_timer_get_cycles, FP=7fffffff92a0
C opal_timer_linux_get_usec_sys_timer, FP=7fffffff92a0
C opal_progress, FP=7fffffff92c0
C mca_pml_cm_recv, FP=7fffffff9680
C ...llreduce_intra_recursivedoubling, FP=7fffffff9780
C mca_coll_cuda_allreduce, FP=7fffffff97f0
C PMPI_Allreduce, FP=7fffffff9840
C ..._romio314_dist_MPI_File_set_size, FP=7fffffff9880
C mca_io_romio314_file_set_size, FP=7fffffff98a0
C PMPI_File_set_size, FP=7fffffff98e0
H5FD_get_mpio_atomicity, FP=7fffffff9910
H5FD_truncate, FP=7fffffff9940
H5F__dest, FP=7fffffff9970
H5F_try_close, FP=7fffffff9db0
H5F__close_cb, FP=7fffffff9dd0
H5I_dec_ref, FP=7fffffff9e10
H5I_dec_app_ref, FP=7fffffff9e30
H5F__close, FP=7fffffff9e50
H5Fclose, FP=7fffffff9e70
ADFH_Database_Close, FP=7fffffff9fa0
cgio_close_file, FP=7fffffff9fc0
cg_close, FP=7fffffff9fe0
And here is the trace at the same time for rank 8-9:
Stack Trace
psm2_poll, FP=7fffffff9630
psm2_mq_ipeek2, FP=7fffffff9650
C ompi_mtl_psm2_progress, FP=7fffffff96e0
C opal_progress, FP=7fffffff9700
C ompi_request_default_wait, FP=7fffffff97a0
C ompi_coll_base_barrier_intra_bruck, FP=7fffffff9810
C PMPI_Barrier, FP=7fffffff9850
H5AC__run_sync_point, FP=7fffffff98f0
H5AC__flush_entries, FP=7fffffff9910
H5AC_dest, FP=7fffffff9940
H5F__dest, FP=7fffffff9970
H5F_try_close, FP=7fffffff9db0
H5F__close_cb, FP=7fffffff9dd0
H5I_dec_ref, FP=7fffffff9e10
H5I_dec_app_ref, FP=7fffffff9e30
H5F__close, FP=7fffffff9e50
H5Fclose, FP=7fffffff9e70
ADFH_Database_Close, FP=7fffffff9fa0
cgio_close_file, FP=7fffffff9fc0
cg_close, FP=7fffffff9fe0
This is with HDF5-1.10.5 compiled with intel-18.0 and intel-openmpi-2.1, but I see similar behavior with openmpi-3.0. This is using H5Pset_file_space_strategy(g_propfilecreate, H5F_FSPACE_STRATEGY_FSM_AGGR, 1, (hsize_t)1);
, but I also see similar behavior without that, but there are times that it will run correctly and then a subsequent run will hang.
I haven’t seen this with HDF5-1.10.4. Any ideas?