h5dwrite_f hangs for large number of MPI tasks


#1

I’m working with a code that involves writing using HDF5 in fortran in addition to parallelization through MPI. I’m utilizing an HPC to submit the job using different numbers of MPI tasks. I’ve found that HDF5 writes successfully when I use 6 or fewer MPI tasks, but when I go larger (say 8) it hangs indefinitely on the h5dwrite_f function. My software stack is as follows:

ifort 17.0
hdf5-1.12.0 and hdf-1.8.21 (attempted using both versions, same error)
hypre-2.11.2
mpi 2017.1.132

The line where it fails is here:

!.....write local array data
      call h5dwrite_f(dset_id,hdf5realtype, &
     &                f(1,1,1,1,1),            &
     &                mem_sizes,error,         &
     &                file_space_id=filespace, &
     &                mem_space_id=memspace,   &
     &                xfer_prp=dxpl)
!      write(*,*), 'After'
      call hdf_show_status("h5dwrite_f", trim(datasetname), error)

Some of the information (i.e., memspace) is shown here:

Segment 3 MBytes    57.68 Time  48.67 Bandwidth MB/s     1.19
   0 vector
   0 Filespace size          3     2   150   150   150
   0 Filespace subsize       3     2     0     0     0
   0 Filespace start         0     0     0     0     0
   0  Memspace size          3     2   158    83    83
   0  Memspace subsize       3     2     0     0     0
   0  Memspace start         0     0     1     1     1

#2

Can you use a debugger to get a backtrace on where in the HDF5 library it’s hanging? The Fortran API’s are just wrappers to the C library.

Scot


#3

I tried using VTune but the code just crashed. Let me give another one a try and I can get back to you with the output.


#5

I’ve managed to get the API trace going. Where the code hangs I’ve got the following output:

H5Dwrite(dset=83886080 (dset), mem_type=50331742 (dtype), mem_space=67108867 (dspace), file_space=67108866 (dspace), dxpl=167772178 (genprop list), buf=0x2ae4fe4f9240)

Notably, this is the first line that doesn’t say “SUCCESS” after it. I can’t include the full out/error files as I’m a new user but I’ve put them in this google drive link.

[https://drive.google.com/drive/folders/1whPIaoyKNeAUiqzhwJvcgn1mN2Li1CMd?usp=sharing](http://Google Drive Link)


#6

I could not determine the cause from the logs provided. Is it possible to run the program interactively and then attach to a hanging process with gdb and then do a backtrace? Can you provide a simple reproducer?

Can you give more details about the HDF5 build and provide config.log? Which compiler and MPI variant are you using (openmpi or mpich), do both have the same hang?

We have had issues with older versions of openmpi hanging. If you are using openmpi, you could try using a different io back-end:

ompi_info

MCA io: ompio (MCA v2.1.0, API v2.0.0, Component v4.0.2)

MCA io: romio321 (MCA v2.1.0, API v2.0.0, Component v4.0.2)

mpiexec --mca io romio321


#7

Hi Scot,

I was able to resolve the issue by doing s HDF5_debug=all setting and finding that it was getting hung up on MPIO. I turned off the collectiveIO feature for MPI with HDF5 in the code and it managed to successfully write. I’m not sure of the minutiae that caused it to work/not work but it consistently writes now!

Thank you for all of your help