Hang in H5Fclose


#1

Running CGNS with 10 mpi processes, I am getting a hang when trying to close the file. It works fine with 1-9 processors, but fails with 10 or more. Here is the traceback on ranks 0-7:

Stack Trace
C    opal_sys_timer_get_cycles, FP=7fffffff92a0
C    opal_timer_linux_get_usec_sys_timer, FP=7fffffff92a0
C    opal_progress,       FP=7fffffff92c0
C    mca_pml_cm_recv,     FP=7fffffff9680
C    ...llreduce_intra_recursivedoubling, FP=7fffffff9780
C    mca_coll_cuda_allreduce, FP=7fffffff97f0
C    PMPI_Allreduce,      FP=7fffffff9840
C    ..._romio314_dist_MPI_File_set_size, FP=7fffffff9880
C    mca_io_romio314_file_set_size, FP=7fffffff98a0
C    PMPI_File_set_size,  FP=7fffffff98e0
     H5FD_get_mpio_atomicity, FP=7fffffff9910
     H5FD_truncate,       FP=7fffffff9940
     H5F__dest,           FP=7fffffff9970
     H5F_try_close,       FP=7fffffff9db0
     H5F__close_cb,       FP=7fffffff9dd0
     H5I_dec_ref,         FP=7fffffff9e10
     H5I_dec_app_ref,     FP=7fffffff9e30
     H5F__close,          FP=7fffffff9e50
     H5Fclose,            FP=7fffffff9e70
     ADFH_Database_Close, FP=7fffffff9fa0
     cgio_close_file,     FP=7fffffff9fc0
     cg_close,            FP=7fffffff9fe0

And here is the trace at the same time for rank 8-9:

Stack Trace
     psm2_poll,           FP=7fffffff9630
     psm2_mq_ipeek2,      FP=7fffffff9650
C    ompi_mtl_psm2_progress, FP=7fffffff96e0
C    opal_progress,       FP=7fffffff9700
C    ompi_request_default_wait, FP=7fffffff97a0
C    ompi_coll_base_barrier_intra_bruck, FP=7fffffff9810
C    PMPI_Barrier,        FP=7fffffff9850
     H5AC__run_sync_point, FP=7fffffff98f0
     H5AC__flush_entries, FP=7fffffff9910
     H5AC_dest,           FP=7fffffff9940
     H5F__dest,           FP=7fffffff9970
     H5F_try_close,       FP=7fffffff9db0
     H5F__close_cb,       FP=7fffffff9dd0
     H5I_dec_ref,         FP=7fffffff9e10
     H5I_dec_app_ref,     FP=7fffffff9e30
     H5F__close,          FP=7fffffff9e50
     H5Fclose,            FP=7fffffff9e70
     ADFH_Database_Close, FP=7fffffff9fa0
     cgio_close_file,     FP=7fffffff9fc0
     cg_close,            FP=7fffffff9fe0

This is with HDF5-1.10.5 compiled with intel-18.0 and intel-openmpi-2.1, but I see similar behavior with openmpi-3.0. This is using H5Pset_file_space_strategy(g_propfilecreate, H5F_FSPACE_STRATEGY_FSM_AGGR, 1, (hsize_t)1);, but I also see similar behavior without that, but there are times that it will run correctly and then a subsequent run will hang.

I haven’t seen this with HDF5-1.10.4. Any ideas?


Append HDF5 files in parallel
#2

A little more detail. I rebuilt with debug turned on and traced down to H5FD_mpio_truncate. The problem seems to be at line 2030 if (size != needed_eof). Here size = 105907546 on all ranks, but needed_eof is off-by-one on some ranks. This causes some ranks to go into the if and some to skip it causing a hang.


#3

Hi Greg,

We entered a bug report for this (HDFFV-10748) and are investigating it.

-Barbara


#4

This is not consistent either… It will happen consistently for awhile and then for some reason decide to work and then it seems to work for several tries. Not sure what triggers working -> not-working -> working.


#5

smells like possibly uninitialized memory