I’m working with a code that involves writing using HDF5 in fortran in addition to parallelization through MPI. I’m utilizing an HPC to submit the job using different numbers of MPI tasks. I’ve found that HDF5 writes successfully when I use 6 or fewer MPI tasks, but when I go larger (say 8) it hangs indefinitely on the h5dwrite_f function. My software stack is as follows:
hdf5-1.12.0 and hdf-1.8.21 (attempted using both versions, same error)
The line where it fails is here:
!.....write local array data
call h5dwrite_f(dset_id,hdf5realtype, &
& f(1,1,1,1,1), &
& mem_sizes,error, &
& file_space_id=filespace, &
& mem_space_id=memspace, &
! write(*,*), 'After'
call hdf_show_status("h5dwrite_f", trim(datasetname), error)
Some of the information (i.e., memspace) is shown here:
Segment 3 MBytes 57.68 Time 48.67 Bandwidth MB/s 1.19
0 Filespace size 3 2 150 150 150
0 Filespace subsize 3 2 0 0 0
0 Filespace start 0 0 0 0 0
0 Memspace size 3 2 158 83 83
0 Memspace subsize 3 2 0 0 0
0 Memspace start 0 0 1 1 1
Can you use a debugger to get a backtrace on where in the HDF5 library it’s hanging? The Fortran API’s are just wrappers to the C library.
I tried using VTune but the code just crashed. Let me give another one a try and I can get back to you with the output.
I’ve managed to get the API trace going. Where the code hangs I’ve got the following output:
H5Dwrite(dset=83886080 (dset), mem_type=50331742 (dtype), mem_space=67108867 (dspace), file_space=67108866 (dspace), dxpl=167772178 (genprop list), buf=0x2ae4fe4f9240)
Notably, this is the first line that doesn’t say “SUCCESS” after it. I can’t include the full out/error files as I’m a new user but I’ve put them in this google drive link.
[https://drive.google.com/drive/folders/1whPIaoyKNeAUiqzhwJvcgn1mN2Li1CMd?usp=sharing](http://Google Drive Link)
I could not determine the cause from the logs provided. Is it possible to run the program interactively and then attach to a hanging process with gdb and then do a backtrace? Can you provide a simple reproducer?
Can you give more details about the HDF5 build and provide config.log? Which compiler and MPI variant are you using (openmpi or mpich), do both have the same hang?
We have had issues with older versions of openmpi hanging. If you are using openmpi, you could try using a different io back-end:
MCA io: ompio (MCA v2.1.0, API v2.0.0, Component v4.0.2)
MCA io: romio321 (MCA v2.1.0, API v2.0.0, Component v4.0.2)
mpiexec --mca io romio321
I was able to resolve the issue by doing s HDF5_debug=all setting and finding that it was getting hung up on MPIO. I turned off the collectiveIO feature for MPI with HDF5 in the code and it managed to successfully write. I’m not sure of the minutiae that caused it to work/not work but it consistently writes now!
Thank you for all of your help