After migrating from hdf5 1.8.17 to 1.10.2, one of my routines is extremely slow. I have narrowed the issue down to the following fortran test case.
Using serial hdf5, I open and then close a single file repeatedly. Most cycles are fast (~0.01 sec), but every 1 in 50 or so, the operation is long (~10 seconds, up to 60 seconds). The issue does not manifest with 1.8.17. The sample code is:
! ----------------------------------------------------------------------
! -- Measures the time to repeatedly open and close an hdf5 file
! -- Some cycles are long with hdf5 1.10.2, none are long with 1.8.17
! ----------------------------------------------------------------------
program main
use hdf5
implicit none
integer, parameter :: wp = selected_real_kind(15,307)
! -- Some hardcoded parameters
character(*), parameter :: fname = 'tmp.h5'
integer, parameter :: N_cycles = 400
real(wp), parameter :: warn_time = 1._wp ! -- Warn if cycle is longer than this
integer :: i, iher, my_cr, my_cm, ct_cur, ct_prv, ct_start, ct_end
real(wp) :: my_time
integer(HID_T) :: id_file
call h5open_f(iher)
if (iher/=0) stop 2
! -- Create file, overwrite if it already exists
call h5fcreate_f(fname, H5F_ACC_TRUNC_F, id_file, iher)
if (iher/=0) stop 2
call h5fclose_f(id_file, iher)
if (iher/=0) stop 2
! -- Initialize timing
call system_clock(count_rate = my_cr, count_max = my_cm)
call system_clock(ct_start)
ct_cur = ct_start
! -- Main loop to open/close N times
do i=1,N_cycles
! -- Open with read/write then close
call h5fopen_f(fname,H5F_ACC_RDWR_F,id_file,iher)
if (iher/=0) stop 2
call h5fclose_f(id_file, iher)
if (iher/=0) stop 2
ct_prv = ct_cur
call system_clock(ct_cur)
! -- Compute time of this cycle
my_time = real(ct_cur-ct_prv,wp) / real(my_cr,wp)
! -- Output warning if we have a long cycle time
if (my_time > warn_time) then
write(*,'(a,i5,a,f10.5)') &
' -- Iteration ', i, ' had long cycle time (seconds): ', my_time
endif
enddo
! -- We have completed all cycles, measure total time
call system_clock(ct_end)
my_time = real(ct_end-ct_start,wp) / real(my_cr,wp)
write(*,'(a,f10.5)') ' -- Complete, total time (seconds): ', my_time
end program main
I compile the code with ifort -I$HDF5_HOME/include -L$HDF5_HOME/lib -lhdf5_fortran main.f90 -o main
, where $HDF5_HOME
is the installation directory.
Using 1.10.2, a typical output is:
chaud106@ln0003 [~/reaqct/Test_h5open/post] % h5ls --version
h5ls: Version 1.10.2
chaud106@ln0003 [~/reaqct/Test_h5open/post] % ifort -I$HDF5_HOME/include -L$HDF5_HOME/lib -lhdf5_fortran main.f90 -o main
chaud106@ln0003 [~/reaqct/Test_h5open/post] % ./main
-- Iteration 98 had long cycle time (seconds): 18.01450
-- Iteration 99 had long cycle time (seconds): 13.86730
-- Iteration 100 had long cycle time (seconds): 13.60210
-- Iteration 101 had long cycle time (seconds): 10.00920
-- Iteration 140 had long cycle time (seconds): 12.47370
-- Iteration 198 had long cycle time (seconds): 22.01530
-- Iteration 199 had long cycle time (seconds): 8.00830
-- Iteration 200 had long cycle time (seconds): 7.67160
-- Iteration 201 had long cycle time (seconds): 9.06810
-- Iteration 202 had long cycle time (seconds): 6.68700
-- Iteration 240 had long cycle time (seconds): 15.99550
-- Iteration 296 had long cycle time (seconds): 30.01490
-- Iteration 297 had long cycle time (seconds): 5.99770
-- Iteration 298 had long cycle time (seconds): 6.00180
-- Iteration 300 had long cycle time (seconds): 8.00110
-- Iteration 339 had long cycle time (seconds): 18.82060
-- Iteration 394 had long cycle time (seconds): 23.76350
-- Iteration 395 had long cycle time (seconds): 3.38680
-- Iteration 397 had long cycle time (seconds): 8.00930
-- Iteration 398 had long cycle time (seconds): 5.66050
-- Iteration 400 had long cycle time (seconds): 8.00710
-- Complete, total time (seconds): 256.84700
Using 1.8.17, I have never observed a long cycle. The typical output is:
chaud106@ln0003 [~/reaqct/Test_h5open/post] % h5ls --version
h5ls: Version 1.8.17
chaud106@ln0003 [~/reaqct/Test_h5open/post] % ifort -I$HDF5_HOME/include -L$HDF5_HOME/lib -lhdf5_fortran main.f90 -o main
chaud106@ln0003 [~/reaqct/Test_h5open/post] % ./main
-- Complete, total time (seconds): 0.08220
I am compiling my test code and HDF5 using Intel 18.0.0. I have only observed this behavior on one high-performance file system, I can provide details if necessary. I was unable to replicate the behavior on a different lustre-based system or NFS-based system.
I’ve noticed that the long cycles occur near multiples of 100 (iterations 98 through 101). While this toy code is being run, the entire machine is extremely sluggish. The result seems independent of the size/contents of the hdf5 file - this test cases uses an empty file, but I got similar results on a 2 GB production database. Finally, I read an hdf5 whitepaper on rapid open/close slowdown, but it seemed to be only for parallel operations.
I’m aware that rapidly opening and closing a file in this way is non-optimal, and I have reworked my code so that this issue has less impact. However, given that the behavior on 1.8.17 is so different, I think it is worth investigating.