Best way to write zone ordered data in a collective way (Fortran)

artu72 · May 6, 2022, 3:59pm

Hi everyone,
I hope that I can ask for some hint here since I am new in this topic. I would like to implement the most effective way (collective) to write down a set of ordered data in a MPI Fortran CFD code by means of parallel HDF5. In each MPI process, I have NZ zones, and in each of them there are there are some 3d variables in a structure
VAR(Z)%R(NI(Z),NJ(Z),NK(Z))
where R is a variable (with some other with the same dimension) and NI(Z), NJ(Z),NK(Z) are the number of cells in dimensions 1,2,3.
Thank you in advance.

brtnfld · May 6, 2022, 4:41pm

Have you looked into using CGNS, cgns.org?
It works with Fortran and is tuned to use HDF5 as the backend when doing parallel IO.

artu72 · May 9, 2022, 12:01pm

I have considered cgns, but I would like to focus on performances. Since cgns relays over hdf5, I think that there is a little bit of overhead, or am I wrong? Thank you.

brtnfld · May 25, 2022, 4:07pm

Sorry, I just saw this.

Yes, any standard will incur an overhead, just as using HDF5 is going to have an overhead compared to raw IO.

Most of the overhead in using standards is associated with the metadata. Still, if you are smart about it, you can minimize this overhead by having only a limited number of the ranks involved in the metadata stage.

In the case of, for example, cgp_field_write_data_f, that is, for the most part, just calling H5Dwrite. I would be surprised if there would be much performance difference between the two APIs. However, the key is the tuning (via H5P APIs) baked into CGNS’s H5Dwrite. So you will need to spend time tuning your implementation (rarely does H5Dwrite with H5P_DEFAULT properties perform optimally in parallel) to mitigate the overhead of using a standard such as CGNS.

artu72 · May 26, 2022, 2:12pm

Hi. I become aware that, by pure chance, we are already interacting by email (the mail title is the same as the topic here). If you allow, I would report here the conversation in order to help other people and let who is interested to join in the discussion. Of course, thank you again for the (double) support.

By the way, the CGNS call cgp_field_write_data_f is designed to write a concurrent array scattered between MPI processses with reference to offset and count for each process, whereas I need to write in a concurrent (and efficient!) way different and independent arrays in the various processes, as you perhaps know.

artu72 · July 12, 2022, 3:07pm

I return on this topic. I have developed a code that write properly the data on a process by creating a compound dadatype (compound.t5.f90), albeit with some trickies in memory locations since the use of MPI scrambles memory locations of the structure objects. But it works.

module var_mod

  type vt
    sequence
    double precision,dimension(:,:,:),allocatable :: r,p
  endtype vt
  type(vt),target,dimension(:),allocatable :: var

endmodule var_mod

PROGRAM main

  USE hdf5
  USE ISO_C_BINDING
  USE MPI
  USE var_mod

  IMPLICIT NONE

! FILES

  CHARACTER(LEN=20) :: H5FILE_NAME
  CHARACTER(LEN=*),PARAMETER :: DATASETNAME = "CompoundOfArrays"
  CHARACTER(LEN=10) :: varname

  INTEGER,PARAMETER :: NB = 4
  INTEGER,PARAMETER :: RANK = 3

  !----------------------------------------------------------------

  INTEGER :: b,i,j,k,me,np,mpierr,bmin,bmint,bmax,bmaxt
  INTEGER(hid_t) :: file,dataset,space
  INTEGER(hsize_t) :: DIM(1) = (/1/)   ! Dataspace dimensions
  INTEGER(SIZE_T) :: maxloc,minloc,offset,type_size,type_sized  ! Size of the double precision datatype
  INTEGER :: hdferr
  TYPE(C_PTR) :: min_ptr
  INTEGER(hid_t) :: vwrite     ! File datatype identifier
  INTEGER(hid_t),dimension(nb) :: tidr,tidp ! /* Nested Array Datatype ID        */
  INTEGER(HSIZE_T),DIMENSION(3) :: tdims

  integer,dimension(:),allocatable :: ni,nj,nk

!----------------------------------------------------------------

  me = 0

  call mpi_init(mpierr)
  call mpi_comm_size(mpi_comm_world,np,mpierr)
  call mpi_comm_rank(mpi_comm_world,me,mpierr)

  allocate(ni(nb))
  allocate(nj(nb))
  allocate(nk(nb))

  allocate(var(nb))

  do b = 1,NB
    ni(b) = 10*(b/3+1)
    nj(b) = 10*(b)
    nk(b) = 10*(b/5+1)
    allocate(var(b)%r(ni(b),nj(b),nk(b)))
    allocate(var(b)%p(ni(b),nj(b),nk(b)))
  ENDDO

  !
  ! Initialize FORTRAN interface.
  !

  CALL h5open_f(hdferr)

  !
  ! Initialize the data
  !
  do b = 1,nb
    DO k = 1,nk(b)
      DO j = 1,nj(b)
        DO i = 1,ni(b)
          var(b)%r(i,j,k) = 10000*me+1000*b+i*100+j*10+k
          var(b)%p(i,j,k) = -var(b)%r(i,j,k)
        ENDDO
      ENDDO
    ENDDO
  enddo

  !
  ! Create the file.
  !

  write(H5FILE_NAME,12) me
  12  format('mpivars_p',I0,'.h5')

  CALL H5Fcreate_f(H5FILE_NAME,H5F_ACC_TRUNC_F,file,hdferr)

  CALL h5tget_size_f(H5T_NATIVE_DOUBLE,type_sized,hdferr)
!
!    find minimum and maximum address memory
!
  minloc = loc(var) ; bmin = 1 ; bmint = 1
  maxloc = loc(var) ; bmax = nb ; bmaxt = 3
  do b = 1,NB
    write(10+me,*) b,loc(var(b)),loc(var(b)%r),loc(var(b)%p),minloc
    if(minloc-loc(var(b))>0) then
      minloc = loc(var(b))
      bmin = b
      bmint = 1
    endif
    if(minloc-loc(var(b)%r)>0) then
      minloc = loc(var(b)%r)
      bmin = b
      bmint = 2
    endif
    if(minloc-loc(var(b)%p)>0) then
      minloc = loc(var(b)%p)
      bmin = b
      bmint = 3
    endif
    if(maxloc-loc(var(b))<0) then
      maxloc = loc(var(b))
      bmax = b
      bmaxt = 1
    endif
    if(maxloc-loc(var(b)%r)<0) then
      maxloc = loc(var(b)%r)
      bmax = b
      bmaxt = 2
    endif
    if(maxloc-loc(var(b)%p)<0) then
      maxloc = loc(var(b)%p)
      bmax = b
      bmaxt = 3
    endif
  enddo

!  write(*,*) ' lowest memory address:',minloc
!  write(*,*) 'highest memory address:',maxloc,bmax

  select case(bmint)
  case(1)
    min_ptr = c_loc(var(bmin))
  case(2)
    min_ptr = c_loc(var(bmin)%r)
  case(3)
    min_ptr = c_loc(var(bmin)%p)
  endselect

  type_size = 0
  do i = 1,nb
    type_size = type_size + 2*ni(i)*nj(i)*nk(i)*8
  enddo
  write(*,*) 'min theorical type size:',type_size

  type_size = maxloc-minloc+ni(bmax)*nj(bmax)*nk(bmax)*type_sized*2
  !
  ! Create the memory data type.
  !
  write(*,*) 'actual type size:',type_size
!  write(*,*) 'cloc:',LOC(var)
!  do i = 1,nb
!    write(*,*) 'ni*nj*nk*8:',ni(i)*nj(i)*nk(i)*8
!    write(*,*) 'cloc:',i,LOC(var(i))
!    write(*,*) 'cloc r:',i,LOC(var(i)%r)
!    write(*,*) 'cloc p:',i,LOC(var(i)%p)
!  enddo
!
!  do i = 1,nb
!    if(i>1) write(*,*) 'offset:',i,H5OFFSETOF(C_LOC(var(i-1)),C_LOC(var(i)))
!    write(*,*) 'offset r:',i,H5OFFSETOF(C_LOC(var(i)),C_LOC(var(i)%r))
!    write(*,*) 'offset p:',i,H5OFFSETOF(C_LOC(var(i)%r),C_LOC(var(i)%p))
!  ENDDO

  CALL H5Tcreate_f(H5T_COMPOUND_F,type_size,vwrite,hdferr)

  do b = 1,nb
    tdims = (/ni(b),nj(b),nk(b)/)
    CALL h5tarray_create_f(H5T_NATIVE_DOUBLE,RANK,tdims,tidr(b),hdferr)
    CALL h5tarray_create_f(H5T_NATIVE_DOUBLE,RANK,tdims,tidp(b),hdferr)
  enddo

  offset = 0
  do b = 1,nb
    write(varname,10) b
10  format('var_r_b',I0)
    offset = H5OFFSETOF(min_ptr,C_LOC(var(b)%r))
    CALL h5tinsert_f(vwrite,varname,offset,tidr(b),hdferr)
    if(hdferr>0) write(*,*) b,'insert r:',hdferr
!    offset = offset+type_sized*ni(b)*nj(b)*nk(b)
    write(varname,11) b
11  format('var_p_b',I0)
    offset = H5OFFSETOF(min_ptr,C_LOC(var(b)%p))
    CALL h5tinsert_f(vwrite,varname,offset,tidp(b),hdferr)
    if(hdferr>0) write(*,*) b,'insert p:',hdferr
!    offset = offset+type_sized*ni(b)*nj(b)*nk(b)
  enddo


  !
  ! Create the data space.
  !
  !
  CALL H5Screate_simple_f(1,dim,space,hdferr)

  !
  ! Create the dataset.
  !
  CALL H5Dcreate_f(file,DATASETNAME,vwrite,space,dataset,hdferr)

!  CALL H5Tpack_f(vwrite,hdferr)


  !
  ! Write data to the dataset
  !

  CALL H5Dwrite_f(dataset,vwrite,min_ptr,hdferr)

  !
  ! Release resources
  !
  CALL H5Tclose_f(vwrite,hdferr)
  CALL H5Sclose_f(space,hdferr)
  CALL H5Dclose_f(dataset,hdferr)
  CALL H5Fclose_f(file,hdferr)

  do b = 1,nb
    deallocate(var(b)%r,var(b)%p)
  ENDDO
  deallocate(var)
  deallocate(ni,nj,nk)

!  end if

  call mpi_finalize(mpierr)

ENDPROGRAM main

Then I tryed to extend this approach to the parallel one, using an array of compound datatypes to be written in collective mode by hyperslabs (final_test_hdf5). The code is ok over a single process, but it hangs in an infinite closure loop when two or more processes are used. Could you support me? Thanks in advance.

module var_mod

  type vt
sequence
double precision,dimension(:,:,:),allocatable :: r,p
  end type vt
  type(vt),target,dimension(:),allocatable :: var

end module var_mod

PROGRAM FINALH5TEST

  USE HDF5 ! This module contains all necessary modules
  USE ISO_C_BINDING
  USE MPI_f08
  USE var_mod

  IMPLICIT NONE
  !
  ! HDF5 definitions
  !
  character(len=20) :: varname
  CHARACTER(LEN=10),PARAMETER :: filename = "final.h5"  ! File name
  CHARACTER(LEN=10),PARAMETER :: dsetname = "CompArray" ! Dataset name
  INTEGER(HID_T) :: file_id       ! File identifier
  INTEGER(HID_T) :: dset_id,vwrite! Dataset identifier
  INTEGER(HID_T) :: filespace     ! Dataspace identifier in file
  INTEGER(HID_T) :: memspace      ! Dataspace identifier in memory
  INTEGER(HID_T) :: plist_id      ! Property list identifier
  INTEGER(hid_t),dimension(:),allocatable :: tidr,tidp ! /* Nested Array Datatype ID */
  INTEGER(HSIZE_T),DIMENSION(1) :: dimsf ! Dataset dimensions.
  INTEGER(HSIZE_T),DIMENSION(1) :: count
  INTEGER(HSIZE_T),DIMENSION(3) :: tdims
  INTEGER(HSSIZE_T),DIMENSION(1) :: offset
!  INTEGER,ALLOCATABLE :: data(:)  ! Data to write
  INTEGER :: rank = 1  ! Dataset rank
  INTEGER :: rank3 = 3 ! Arrays rank
  INTEGER :: hdferr  ! hdferr flags
  !
  ! MPI definitions
  !
  INTEGER :: mpierror       ! MPI hdferr flag
  INTEGER :: mpi_size,mpi_rank,mpicomm,mpiinfo
  type(mpi_comm) :: comm
  type(mpi_info) :: info
  !
  ! program definitions
  !
  integer :: nb,b,bg,i,j,k,bmin,bmint,bmax,bmaxt
  integer,dimension(:),allocatable :: ni,nj,nk
  TYPE(C_PTR) :: min_ptr
  INTEGER(SIZE_T) :: maxloc,minloc,memoff,type_size,type_sized

  !----------------------------------------------------------------

  comm = MPI_COMM_WORLD
  info = MPI_INFO_NULL
  mpicomm = comm%mpi_val
  mpiinfo = info%mpi_val

  CALL MPI_INIT(mpierror)
  CALL MPI_COMM_SIZE(comm,mpi_size,mpierror)
  CALL MPI_COMM_RANK(comm,mpi_rank,mpierror)
!
!    allocate and initialize data structures

  dimsf = (/ mpi_size /)
  nb = mpi_rank+1
  allocate (ni(nb))
  allocate (nj(nb))
  allocate (nk(nb))

  allocate (var(nb))

  do b = 1,NB
ni(b) = 1*(b/3+1)
nj(b) = 1*(b)
nk(b) = 1*(b/5+1)
allocate (var(b)%r(ni(b),nj(b),nk(b)))
allocate (var(b)%p(ni(b),nj(b),nk(b)))
  END DO
!
! Initialize the data
!
  do b = 1,nb
DO k = 1,nk(b)
  DO j = 1,nj(b)
    DO i = 1,ni(b)
      var(b)%r(i,j,k) = 10000*mpi_rank+1000*b+i*100+j*10+k
      var(b)%p(i,j,k) = -var(b)%r(i,j,k)
    END DO
  END DO
END DO
  end do
!
!    find minimum and maximum address memory
!
  minloc = loc(var) ; bmin = 1 ; bmint = 1
  maxloc = loc(var) ; bmax = nb ; bmaxt = 3
  do b = 1,NB
if(minloc-loc(var(b))>0) then
  minloc = loc(var(b))
  bmin = b
  bmint = 1
endif
if(minloc-loc(var(b)%r)>0) then
  minloc = loc(var(b)%r)
  bmin = b
  bmint = 2
endif
if(minloc-loc(var(b)%p)>0) then
  minloc = loc(var(b)%p)
  bmin = b
  bmint = 3
endif
if(maxloc-loc(var(b))<0) then
  maxloc = loc(var(b))
  bmax = b
  bmaxt = 1
endif
if(maxloc-loc(var(b)%r)<0) then
  maxloc = loc(var(b)%r)
  bmax = b
  bmaxt = 2
endif
if(maxloc-loc(var(b)%p)<0) then
  maxloc = loc(var(b)%p)
  bmax = b
  bmaxt = 3
endif
  enddo

  select case(bmint)
  case(1)
min_ptr = c_loc(var(bmin))
  case(2)
min_ptr = c_loc(var(bmin)%r)
  case(3)
min_ptr = c_loc(var(bmin)%p)
  endselect
  !
  ! Initialize FORTRAN predefined datatypes
  !
  CALL h5open_f(hdferr)
!
! get double precision size in hdf5 and evaluate data size dimension in memory
!
  CALL h5tget_size_f(H5T_NATIVE_DOUBLE,type_sized,hdferr)

  type_size = maxloc-minloc+ni(bmax)*nj(bmax)*nk(bmax)*type_sized*2
!
! create compound datatype
!
  allocate(tidp(nb))
  allocate(tidr(nb))

  CALL H5Tcreate_f(H5T_COMPOUND_F,type_size,vwrite,hdferr)

  do b = 1,nb
tdims = (/ni(b),nj(b),nk(b)/)
CALL h5tarray_create_f(H5T_NATIVE_DOUBLE,RANK3,tdims,tidr(b),hdferr)
CALL h5tarray_create_f(H5T_NATIVE_DOUBLE,RANK3,tdims,tidp(b),hdferr)
  enddo

  memoff = 0
  do b = 1,nb
 bg = b+nb*mpi_rank
write(varname,10) bg
10  format('var_r_b',I0)
memoff = H5OFFSETOF(min_ptr,C_LOC(var(b)%r))
CALL h5tinsert_f(vwrite,varname,memoff,tidr(b),hdferr)
if(hdferr>0) write(*,*) b,'insert r:',hdferr
write(varname,11) bg
11  format('var_p_b',I0)
memoff = H5OFFSETOF(min_ptr,C_LOC(var(b)%p))
CALL h5tinsert_f(vwrite,varname,memoff,tidp(b),hdferr)
if(hdferr>0) write(*,*) b,'insert p:',hdferr
  enddo

  write(*,*) mpi_rank,'min_ptr,size:',type_size

  !
  ! Setup file access property list with parallel I/O access.
  !
  CALL h5pcreate_f(H5P_FILE_ACCESS_F,plist_id,hdferr)
  CALL h5pset_fapl_mpio_f(plist_id,mpicomm,mpiinfo,hdferr)

  !
  ! Create the file collectively.
  !
  CALL h5fcreate_f(filename,H5F_ACC_TRUNC_F,file_id,hdferr,access_prp=plist_id)
  CALL h5pclose_f(plist_id,hdferr)
  !
  ! Create the data space for the  dataset.
  !
  CALL h5screate_simple_f(rank,dimsf,filespace,hdferr)
  !
  ! Create the dataset with default properties.
  !
!  CALL h5dcreate_f(file_id,dsetname,H5T_NATIVE_INTEGER,filespace,dset_id,hdferr)
  CALL H5Dcreate_f(file_id,dsetname,vwrite,filespace,dset_id,hdferr)
  CALL h5sclose_f(filespace,hdferr)
  !
  ! Each process defines dataset in memory and writes it to the hyperslab
  ! in the file.
  !
 count(1) = 1
 offset(1) = mpi_rank
 CALL h5screate_simple_f(rank,count,memspace,hdferr)
  !
  ! Select hyperslab in the file.
  !
  CALL h5dget_space_f(dset_id,filespace,hdferr)
  CALL h5sselect_hyperslab_f(filespace,H5S_SELECT_SET_F,offset,count,hdferr)
  !
  ! Initialize data buffer with trivial data.
  !
!  ALLOCATE (data(count(1)))
!  data = mpi_rank+10
  !
  ! Create property list for collective dataset write
  !
  CALL h5pcreate_f(H5P_DATASET_XFER_F,plist_id,hdferr)
  CALL h5pset_dxpl_mpio_f(plist_id,H5FD_MPIO_COLLECTIVE_F,hdferr)

  !
  ! Write the dataset collectively.
  !
!  CALL h5dwrite_f(dset_id,H5T_NATIVE_INTEGER,data,dimsf,hdferr, &
  CALL h5dwrite_f(dset_id,vwrite,min_ptr,hdferr, &
              file_space_id=filespace,mem_space_id=memspace,xfer_prp=plist_id)
  !
  ! Write the dataset independently.
  !
!    CALL h5dwrite_f(dset_id, H5T_NATIVE_INTEGER, data, dimsfi, hdferr, &
!    CALL h5dwrite_f(dset_id, vwrite, min_ptr, hdferr, &
!                     file_space_id = filespace, mem_space_id = memspace)
  !
  ! Deallocate data buffer.
  !
!  DEALLOCATE (data)

  !
  ! Close dataspaces.
  !
  CALL h5sclose_f(filespace,hdferr)
  CALL h5sclose_f(memspace,hdferr)

  !
  ! Close the dataset and property list.
  !
  CALL h5dclose_f(dset_id,hdferr)
  CALL h5pclose_f(plist_id,hdferr)
  CALL H5Tclose_f(vwrite,hdferr)

  !
  ! Close the file.
  !
  CALL h5fclose_f(file_id,hdferr)

  !
  ! Close FORTRAN predefined datatypes.
  !
  CALL h5close_f(hdferr)
!
! deallocates data
!
  do b = 1,nb
deallocate (var(b)%r,var(b)%p)
  END DO
  deallocate (var)
  deallocate (ni,nj,nk)
  deallocate(tidr,tidp)
  !
  ! terminate MPI environment
  !
  CALL MPI_FINALIZE(mpierror)

END PROGRAM FINALH5TEST

brtnfld · July 12, 2022, 5:18pm

The issue is each rank is creating a different compound datatype, i.e., in the H5Tcreate_f, the type_size for vwrite is different depending on the rank.

The main problem is the call to H5Dcreate_f because each rank passes a different vwrite datatype definition for the dataset. HDF5 does not know how to resolve the datatype conflict between ranks when it tries to create the dataset. H5Dcreate is a collective operation, meaning all the ranks must be consistent with the parameters they pass.

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Best way to write zone ordered data in a collective way (Fortran)