h5fclose_f hangs, or everything runs but produces segmentation faults afterwards

Wasn't sure if I should have revived the following thread, but I *think* I'm having a different issue:
http://hdf-forum.184993.n3.nabble.com/Fwd-H5Fclose-hangs-when-using-parallel-HDF5-td4027003.html

I'm using hdf5-1.8.12 in my code on a PBS cluster. The code works on two different Mac OSX (2 and 8 core) machines. The code also runs on the cluster when I use 4 procs (1 node), or 16 procs (2 nodes, 8 procs/node). However, after it runs, and all the *.h5 data files are written without error, I get a segmentation fault. If I try to run with other node/proc configurations, like 32 procs (4 nodes and 8 proc/node), the application hangs up when h5fclose_f is called. I have included a truncated version of the subroutine that contains the call to h5fclose_f. Oh, and the code runs without error on any node/proc configuration when I do not use HDF5 output.

Any insight or advice will be much appreciated!

···

==========================================

   integer(HID_T) :: file_id ! File identifier
   integer(HID_T) :: memspace ! File identifier
   integer(HID_T) :: dset_id ! Dataset identifier
   integer(HID_T) :: plist_id ! Property list identifier
   integer(HID_T) :: filespace ! Dataspace indentifier in file
   integer(HSIZE_T) :: array_dims(4) ! dimensions of the arrays
   integer :: error ! Error flag

   integer(HSIZE_T), dimension(2) :: chunk_dims

   integer(HSIZE_T), dimension(2) :: block
   integer(HSIZE_T), dimension(2) :: stride
   integer(HSIZE_T), dimension(2) :: count
   integer(HSSIZE_T), dimension(2) :: offset

   integer :: i,j,k,isc

   ! initialize FORTRAN interface and open data set and file
   call h5open_f(error)

   ! setup file access property list with parallel I/O access
   call h5pcreate_f(H5P_FILE_ACCESS_F,plist_id, error)
   call h5pset_fapl_mpio_f(plist_id, comm, MPI_INFO_NULL, error)

   ! create new h5 file collectively
   write(filen,'(i4.4)') nfile
   filename ='HDF5-2D/Field_'//filen//'.h5'
   nfile = nfile + 1
   call h5fcreate_f(filename, H5F_ACC_TRUNC_F, file_id, error, access_prp = plist_id)
   call h5pclose_f(plist_id,error)

   ! Grid
   do i=1,nx
     xhdf(i) = xm(i+imin-1)
   end do
   do j=1,ny
     yhdf(j) = ym(j+jmin-1)
   end do
   array_dims(1) = Nx
   call h5ltmake_dataset_float_f(file_id, "X", 1, array_dims, xhdf(1:Nx), error)
   array_dims(1) = Ny
   call h5ltmake_dataset_float_f(file_id, "Y", 1, array_dims, yhdf(1:Ny), error)

   ! Lagrangian particle data
   array_dims(1) = npart
   chunk_dims(1) = npart_proc(irank) !npart_
   if (irank.eq.1) then
      offset(1) = 0
   else
      offset(1) = sum(npart_proc(1:(irank-1)))
   end if
   stride(1) = 1
   count(1) = 1
   block(1) = chunk_dims(1)

   ! Particle id
   call h5screate_simple_f(1, array_dims, filespace, error)
   call h5screate_simple_f(1, chunk_dims, memspace, error)
   call h5pcreate_f(H5P_DATASET_CREATE_F, plist_id, error)
! call h5pset_chunk_f(plist_id, 1, chunk_dims, error)
   call h5dcreate_f(file_id, "part_id", H5T_NATIVE_REAL, filespace, &
                       dset_id, error, plist_id)
   call h5sclose_f(filespace, error)
   call h5dget_space_f(dset_id, filespace, error)
   call h5sselect_hyperslab_f (filespace, H5S_SELECT_SET_F, offset, count, error, &
                                  stride, block)
   call h5pcreate_f(H5P_DATASET_XFER_F,plist_id,error)
   call h5pset_dxpl_mpio_f(plist_id, H5FD_MPIO_INDEPENDENT_F,error)
   do i=1,npart_
      bufferp(i) = real(part(i)%id,WP)
   end do
   call h5dwrite_f(dset_id, H5T_NATIVE_REAL, bufferp(:), array_dims, error, &
                      file_space_id = filespace, mem_space_id = memspace, xfer_prp = plist_id)
   call h5sclose_f(filespace,error)
   call h5sclose_f(memspace,error)
   call h5dclose_f(dset_id,error)
   call h5pclose_f(plist_id,error)

   !!!! There are 7 more Lagrangian datasets in the full code

   ! Eulerian grid data
   array_dims(1) = Ny
   array_dims(2) = Nx

   chunk_dims(1) = ny_
   chunk_dims(2) = nx_

   stride(1) = 1
   stride(2) = 1
   count(1) = 1
   count(2) = 1
   block(1) = chunk_dims(1)
   block(2) = chunk_dims(2)

   offset(1) = jmin_ - nover - 1
   offset(2) = imin_ - nover - 1

   ! U
     call h5screate_simple_f(2, array_dims, filespace, error)
     call h5screate_simple_f(2, chunk_dims, memspace, error)
     call h5pcreate_f(H5P_DATASET_CREATE_F, plist_id, error)
     call h5pset_chunk_f(plist_id, 2, chunk_dims, error)
     call h5dcreate_f(file_id, "U", H5T_NATIVE_REAL, filespace, &
                       dset_id, error, plist_id)
     call h5sclose_f(filespace, error)
     call h5dget_space_f(dset_id, filespace, error)
     call h5sselect_hyperslab_f (filespace, H5S_SELECT_SET_F, offset, count, error, &
                                  stride, block)
     call h5pcreate_f(H5P_DATASET_XFER_F,plist_id,error)
     call h5pset_dxpl_mpio_f(plist_id, H5FD_MPIO_INDEPENDENT_F,error)
     buffer3(:,:slight_smile: = transpose(Ui(imin_:imax_,jmin_:jmax_,kmin_))
     call h5dwrite_f(dset_id, H5T_NATIVE_REAL, buffer3(:,:), array_dims, error, &
                      file_space_id = filespace, mem_space_id = memspace, xfer_prp = plist_id)
     call h5sclose_f(filespace,error)
     call h5sclose_f(memspace,error)
     call h5dclose_f(dset_id,error)
     call h5pclose_f(plist_id,error)

   !!!! There are (potentially) 9 more Eulerian datasets in the full code

   ! close the dataset, the file and the FORTRAN interface
   call h5fclose_f(file_id, error)
   call h5close_f(error)

   ==========================================

Hi Daniel,

I don't see anything "wrong" with your HDF5 usage in the code that you sent, but since it's not the real thing here are a few issues you can do to get a better idea of what is going on.
1) make sure that collective operations are called collectively and in correct order on all procs: http://www.hdfgroup.org/HDF5/doc/RM/CollectiveCalls.html
2) Are you calling MPI_Finalize() before closing the HDF5 file (you shouldn't)?
3) Add an MPI_Barrier() before the h5fclose_f and see if all processes hit the barrier and get into the h5fclose_f. If they do not, then you need to review the collective requirements again in 1.

I would also suggest to upgrade to the latest HDF5 release.
What MPI implementation and version are you using?

It would also be really helpful to send a working program for us to replicate the problem, but I understand that it's not always a simple task to do.

Thanks,
Mohamad

···

-----Original Message-----
From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of Daniel Stokes Hagan
Sent: Thursday, May 28, 2015 5:59 PM
To: hdf-forum@lists.hdfgroup.org
Subject: [Hdf-forum] h5fclose_f hangs, or everything runs but produces segmentation faults afterwards

Wasn't sure if I should have revived the following thread, but I
*think* I'm having a different issue:
http://hdf-forum.184993.n3.nabble.com/Fwd-H5Fclose-hangs-when-using-parallel-HDF5-td4027003.html

I'm using hdf5-1.8.12 in my code on a PBS cluster. The code works on two different Mac OSX (2 and 8 core) machines. The code also runs on the cluster when I use 4 procs (1 node), or 16 procs (2 nodes, 8 procs/node). However, after it runs, and all the *.h5 data files are written without error, I get a segmentation fault. If I try to run with other node/proc configurations, like 32 procs (4 nodes and 8 proc/node), the application hangs up when h5fclose_f is called. I have included a truncated version of the subroutine that contains the call to h5fclose_f. Oh, and the code runs without error on any node/proc configuration when I do not use HDF5 output.

Any insight or advice will be much appreciated!

   ==========================================

   integer(HID_T) :: file_id ! File identifier
   integer(HID_T) :: memspace ! File identifier
   integer(HID_T) :: dset_id ! Dataset identifier
   integer(HID_T) :: plist_id ! Property list identifier
   integer(HID_T) :: filespace ! Dataspace indentifier in file
   integer(HSIZE_T) :: array_dims(4) ! dimensions of the arrays
   integer :: error ! Error flag

   integer(HSIZE_T), dimension(2) :: chunk_dims

   integer(HSIZE_T), dimension(2) :: block
   integer(HSIZE_T), dimension(2) :: stride
   integer(HSIZE_T), dimension(2) :: count
   integer(HSSIZE_T), dimension(2) :: offset

   integer :: i,j,k,isc

   ! initialize FORTRAN interface and open data set and file
   call h5open_f(error)

   ! setup file access property list with parallel I/O access
   call h5pcreate_f(H5P_FILE_ACCESS_F,plist_id, error)
   call h5pset_fapl_mpio_f(plist_id, comm, MPI_INFO_NULL, error)

   ! create new h5 file collectively
   write(filen,'(i4.4)') nfile
   filename ='HDF5-2D/Field_'//filen//'.h5'
   nfile = nfile + 1
   call h5fcreate_f(filename, H5F_ACC_TRUNC_F, file_id, error, access_prp = plist_id)
   call h5pclose_f(plist_id,error)

   ! Grid
   do i=1,nx
     xhdf(i) = xm(i+imin-1)
   end do
   do j=1,ny
     yhdf(j) = ym(j+jmin-1)
   end do
   array_dims(1) = Nx
   call h5ltmake_dataset_float_f(file_id, "X", 1, array_dims, xhdf(1:Nx), error)
   array_dims(1) = Ny
   call h5ltmake_dataset_float_f(file_id, "Y", 1, array_dims, yhdf(1:Ny), error)

   ! Lagrangian particle data
   array_dims(1) = npart
   chunk_dims(1) = npart_proc(irank) !npart_
   if (irank.eq.1) then
      offset(1) = 0
   else
      offset(1) = sum(npart_proc(1:(irank-1)))
   end if
   stride(1) = 1
   count(1) = 1
   block(1) = chunk_dims(1)

   ! Particle id
   call h5screate_simple_f(1, array_dims, filespace, error)
   call h5screate_simple_f(1, chunk_dims, memspace, error)
   call h5pcreate_f(H5P_DATASET_CREATE_F, plist_id, error) ! call h5pset_chunk_f(plist_id, 1, chunk_dims, error)
   call h5dcreate_f(file_id, "part_id", H5T_NATIVE_REAL, filespace, &
                       dset_id, error, plist_id)
   call h5sclose_f(filespace, error)
   call h5dget_space_f(dset_id, filespace, error)
   call h5sselect_hyperslab_f (filespace, H5S_SELECT_SET_F, offset, count, error, &
                                  stride, block)
   call h5pcreate_f(H5P_DATASET_XFER_F,plist_id,error)
   call h5pset_dxpl_mpio_f(plist_id, H5FD_MPIO_INDEPENDENT_F,error)
   do i=1,npart_
      bufferp(i) = real(part(i)%id,WP)
   end do
   call h5dwrite_f(dset_id, H5T_NATIVE_REAL, bufferp(:), array_dims, error, &
                      file_space_id = filespace, mem_space_id = memspace, xfer_prp = plist_id)
   call h5sclose_f(filespace,error)
   call h5sclose_f(memspace,error)
   call h5dclose_f(dset_id,error)
   call h5pclose_f(plist_id,error)

   !!!! There are 7 more Lagrangian datasets in the full code

   ! Eulerian grid data
   array_dims(1) = Ny
   array_dims(2) = Nx

   chunk_dims(1) = ny_
   chunk_dims(2) = nx_

   stride(1) = 1
   stride(2) = 1
   count(1) = 1
   count(2) = 1
   block(1) = chunk_dims(1)
   block(2) = chunk_dims(2)

   offset(1) = jmin_ - nover - 1
   offset(2) = imin_ - nover - 1

   ! U
     call h5screate_simple_f(2, array_dims, filespace, error)
     call h5screate_simple_f(2, chunk_dims, memspace, error)
     call h5pcreate_f(H5P_DATASET_CREATE_F, plist_id, error)
     call h5pset_chunk_f(plist_id, 2, chunk_dims, error)
     call h5dcreate_f(file_id, "U", H5T_NATIVE_REAL, filespace, &
                       dset_id, error, plist_id)
     call h5sclose_f(filespace, error)
     call h5dget_space_f(dset_id, filespace, error)
     call h5sselect_hyperslab_f (filespace, H5S_SELECT_SET_F, offset, count, error, &
                                  stride, block)
     call h5pcreate_f(H5P_DATASET_XFER_F,plist_id,error)
     call h5pset_dxpl_mpio_f(plist_id, H5FD_MPIO_INDEPENDENT_F,error)
     buffer3(:,:slight_smile: = transpose(Ui(imin_:imax_,jmin_:jmax_,kmin_))
     call h5dwrite_f(dset_id, H5T_NATIVE_REAL, buffer3(:,:), array_dims, error, &
                      file_space_id = filespace, mem_space_id = memspace, xfer_prp = plist_id)
     call h5sclose_f(filespace,error)
     call h5sclose_f(memspace,error)
     call h5dclose_f(dset_id,error)
     call h5pclose_f(plist_id,error)

   !!!! There are (potentially) 9 more Eulerian datasets in the full code

   ! close the dataset, the file and the FORTRAN interface
   call h5fclose_f(file_id, error)
   call h5close_f(error)

   ==========================================

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5