Parallel writes with compression (HDF5 via netCDF-4)

The parallel I/O is done with netCDF-4. Since it uses HDF5, I hope there is some relevance to posting the question here and someone might have some input.

Software:
netcdf-fortran-4.5.3, built after installing netcdf-c-4.7.4 with parallel I/O enabled.
HDF5 1.10.6 with parallel enabled.

Background:
We are using parallel I/O with compression to read in many compressed netcdf-4 result files of all (ocean) model sub-domains, and then write them into one compressed netcdf-4 result file of the full domain.
Since the sub-domains are not all the same size, we cannot make the chunk size equal the sub-domain size.
Instead, each process is assigned a region of the full domain, it reads in the corresponding sub-domain files. It then writes its data into the result file of the full domain.
We choose the chunk sizing so that each process writes into complete chunks.
Execpt for the processes that cover the eastern and northern edges of our domain, sometimes they only partial fill the chunks if a dimension is not divisble by the chunk size chosen. (1)

Results:
Full domain with grid points 1840960100 and 10 timesteps, 4 x 2D vars and 4 x 3D vars.
Chunk size depends on sub-domains, if 10x10 then chunking = 184,96,100,1. (timsteps chunking = 1)
deflate_level=1 and shuffle=on:
Program: sub-domains cores time (s) speed-up
ncjoin (serial) - 1 597 -
ncjoin_mpi 4x3 12 135 4.4
ncjoin_mpi 5x5 25 73 8.2
ncjoin_mpi 10x5 50 48 12.4
ncjoin_mpi 10x10 100 37 16.1

Overall, we achieve a speed-up but it does not scale as we had hoped.
The reading of the sub-domains seems to scale well with increased processes, however, the writing does not scale adequately.

netCDF parallel calls:
The read in partial files are opened with:
ierr=nf90_open (ncname(node),
& IOR(NF90_NOWRITE, NF90_MPIIO), ncid(node),
& comm = MPI_COMM_WORLD, info = MPI_INFO_NULL)

The single (full-domain) output file for writes is opened with:
ierr=nf90_open (nctargname,
& IOR(NF90_WRITE, NF90_MPIIO), nctarg,
& comm = MPI_COMM_WORLD, info = MPI_INFO_NULL)

and each variable in the output file are given collective access:
do i=1,nvars
  ierr=nf90_var_par_access(nctarg, varid(i), NF90_COLLECTIVE)
enddo

Questions:

  1. In (1) above we assume that chunks start from ‘SW corner’ of data (i=j=k=1), and if your chunk sizes do not divide perfectly by the total dimension size, then the end chunks on the E and N edges will just be partially filled. Is this correct?

  2. I was only able to run the code using collective mode. If I set parallel access to nf90_independent then I got an ‘hdf error’ error code. I had assumed that independent mode would give the best performance since a process can write whenever it wants, and since processes (if my code is correct) write into their own chunks, there would then be no contention on the file?
    However, I found resources online that suggested collective mode gives best performance. Can you confirm which approach is best, and perhaps why I was unable to run with independent mode?

  3. As mentioned, the reading scales well but the writing does not. Is this to be expected in anyway? Have I misused the netCDF file open call?

  4. Any suggestions to improve our results welcomed?

Many thanks, Devin