I got into trouble while reading an HDF5 file in parallel. The file contains only a single dataset of size 196608x 98304 with double precision entries. The data is a matrix which should be read to a block-cyclic distributed version of it such that it is compatible with SCALAPACK. The file is read with 512 processes in parallel and MPI-IO is setup with collective operations. The processes are arranged in a 32x16 grid, where each process is identified by its grid position (MYROW, MYCOL). The matrix is distributed with MB x NB blocks ( MB = NB = 192 ) where each process owns and LMxLN part of the matrix (6144x6144). The matrix is stored in chunks of 4032 x 4032.
The main part of the read function looks like
DIMS(1) = LM / MB
DIMS(2) = LN / NB
OFFSET(1) = MYROW*MB
OFFSET(2) = MYCOL*NB
DIM_BLOCK(1) = MB
DIM_BLOCK(2) = NB
STRIDE(1) = (MB*NPROW)
STRIDE(2) = NB*NPCOL
CALL H5SSELECT_HYPERSLAB_F(FILESPACE, H5S_SELECT_SET_F, OFFSET, DIMS, HERROR, STRIDE, DIM_BLOCK )
! Select the memory space matching the filespace
DIMS(1) = LM
DIMS(2) = LN
OFFSET(1) = 0
OFFSET(2) = 0
CALL H5SSELECT_HYPERSLAB_F(MEMSPACE, H5S_SELECT_SET_F, OFFSET, DIMS, HERROR)
CALL H5PCREATE_F(H5P_DATASET_XFER_F, PLIST, HERROR)
CALL H5PSET_DXPL_MPIO_F(PLIST, H5FD_MPIO_COLLECTIVE_F, HERROR)
CALL H5DREAD_F(DSET_ID, H5T_NATIVE_DOUBLE, A(1), DIM_LOCAL, HERROR, FILE_SPACE_ID=FILESPACE, MEM_SPACE_ID=MEMSPACE, XFER_PRP = PLIST)
The problem is that the read performance is very slow compared to the capabilities of the filesystem and the performance of the write operation.
With the 196608x 98304 matrix I get a write performance of 927 MB/s but can only read with 320 MB/s. If chunking is turned off, I get a writing speed of 1153 MB/s and a reading speed of 870 MB/s.
If use a smaller problem of 98304x49152 (half in each dimension) I obtain the following values with chunking: Write: 1032 MB/s Read: 1490 MB/s and without chunks Write: 1200 MB/s Read: 2011 MB/s.
I already tried different chunk cache and buffer settings but nothing helped to recover the performance for the large data sets. Furthermore, I recognized the for large datasets It takes a long time until the actual writing starts. I reduced this issue by setting the H5D_ALLOC_TIME_LATE_F and H5D_FILL_TIME_NEVER_F but is vanishes not completely.
We need the chunking since we plan to compress the dataset since for the final application a single matrix is up to 8 TB ( 1000000 x 1000000 ) and there we need a fast reading and writing and compression to store at least some of them.
System Information :
- Filesystem BeeGFS. 2.5 GB/s read/write peak
- HDF 1.10.5
- OpenMPI 4.0.1
- OmniPATH Network between the nodes and the filesystem.