Slow parallel read performance


#1

I got into trouble while reading an HDF5 file in parallel. The file contains only a single dataset of size 196608x 98304 with double precision entries. The data is a matrix which should be read to a block-cyclic distributed version of it such that it is compatible with SCALAPACK. The file is read with 512 processes in parallel and MPI-IO is setup with collective operations. The processes are arranged in a 32x16 grid, where each process is identified by its grid position (MYROW, MYCOL). The matrix is distributed with MB x NB blocks ( MB = NB = 192 ) where each process owns and LMxLN part of the matrix (6144x6144). The matrix is stored in chunks of 4032 x 4032.

The main part of the read function looks like

        DIMS(1) = LM / MB
        DIMS(2) = LN / NB
        OFFSET(1) = MYROW*MB
        OFFSET(2) = MYCOL*NB

        DIM_BLOCK(1) = MB 
        DIM_BLOCK(2) = NB
        STRIDE(1) = (MB*NPROW) 
        STRIDE(2) = NB*NPCOL

        CALL H5SSELECT_HYPERSLAB_F(FILESPACE, H5S_SELECT_SET_F, OFFSET, DIMS, HERROR, STRIDE, DIM_BLOCK )
       ! Select the memory space matching the filespace
        DIMS(1) = LM 
        DIMS(2) = LN
        OFFSET(1) = 0
        OFFSET(2) = 0
        CALL H5SSELECT_HYPERSLAB_F(MEMSPACE, H5S_SELECT_SET_F, OFFSET, DIMS, HERROR)

        CALL H5PCREATE_F(H5P_DATASET_XFER_F, PLIST, HERROR)
        CALL H5PSET_DXPL_MPIO_F(PLIST, H5FD_MPIO_COLLECTIVE_F, HERROR)
        CALL H5DREAD_F(DSET_ID, H5T_NATIVE_DOUBLE, A(1), DIM_LOCAL, HERROR, FILE_SPACE_ID=FILESPACE, MEM_SPACE_ID=MEMSPACE, XFER_PRP = PLIST)

The problem is that the read performance is very slow compared to the capabilities of the filesystem and the performance of the write operation.

With the 196608x 98304 matrix I get a write performance of 927 MB/s but can only read with 320 MB/s. If chunking is turned off, I get a writing speed of 1153 MB/s and a reading speed of 870 MB/s.

If use a smaller problem of 98304x49152 (half in each dimension) I obtain the following values with chunking: Write: 1032 MB/s Read: 1490 MB/s and without chunks Write: 1200 MB/s Read: 2011 MB/s.

I already tried different chunk cache and buffer settings but nothing helped to recover the performance for the large data sets. Furthermore, I recognized the for large datasets It takes a long time until the actual writing starts. I reduced this issue by setting the H5D_ALLOC_TIME_LATE_F and H5D_FILL_TIME_NEVER_F but is vanishes not completely.

We need the chunking since we plan to compress the dataset since for the final application a single matrix is up to 8 TB ( 1000000 x 1000000 ) and there we need a fast reading and writing and compression to store at least some of them.

System Information :

  • Filesystem BeeGFS. 2.5 GB/s read/write peak
  • HDF 1.10.5
  • OpenMPI 4.0.1
  • OmniPATH Network between the nodes and the filesystem.

#2

Think of the IO function in a mathematical sense, varying some parameter will give you the partial differential respect to that. My suggestion is to focus on the chunk size, since you are interested in this feature.

You already have a result without chunks enabled, so you know that underlying platform is at least capable of ~2GB/s, you also know that if you piece data into infinitesimal small units overhead vs actual work is high. This is the worst case… there must be an optimal in between.

By taking some spot measurement you can guess some of the optimal parameters respect to your objective and the gradient.
Expect fluctuations at multiples of underlying buffer size yielding to a non-convex surface. If you have the computational power available – since you have a cluster – use rugged method such as Differential Evolution to guess/find the right parameters for you.

hope this black box approach helps your case.
steve


#3

My problem is not to find the optimal parameters. More interesting is where the performance loss comes from. Especially when the size of the dataset crosses the 100GB, the write and read gets slow. Using the monitoring of the filesystem I see some strange things. During writing the first half of the writing procedure nearly nothing happens on the I/O side and then its writes with 2.4 GB/s so in average this leads to the ~ 1 GB/s. In the case of reading the dataset it starts immediately reading with a relative constant rate but for increasing dataset size the rate decreases. Everything with a fixed number of processes. And thus I think is more a general problem than finding the optimal parameters.

Some more test with Intel MPI 2018 changed the values a bit. The 196608x 98304 examples (chunked) writes with 2 GB/s and reads with 890.MB/s. So in fact it doubles the performance.


#4

I am not familiar with OmniPath nor BeeGFS, did you try to contact the system integrator to comment on this?
The 2.5GB/sec is per IO server or total? ( I take the and reads with 890. GB/s is a typo?)

Did you want to run an IOR test? Here is an example run of 350GByte transfer, 600 tasks collective IO, clicked at ~22.2GB/s Write / ~28.6GB/s read on AWS EC2 60 nodes 25Gb/s ethernet interconnect total 600 tasks, collective IO with most recent parallel HDF5
against the same OMPI you have.

Began               : Thu Jul 18 17:23:56 2019
Command line        : /usr/local/bin/ior -a MPIIO -b 10MB -t 10MB -s 60 -c -C
Machine             : Linux master
TestID              : 0
StartTime           : Thu Jul 18 17:23:56 2019
Path                : /home/steven/.local/share/mpi-hello-world
FS                  : 94.5 TiB   Used FS: 0.0%   Inodes: 2932031007402.7 Mi   Used Inodes: 0.0%

Options: 
api                 : MPIIO
apiVersion          : (3.1)
test filename       : testFile
access              : single-shared-file
type                : collective
segments            : 60
ordering in a file  : sequential
ordering inter file : constant task offset
task offset         : 1
tasks               : 600
clients per node    : 60
repetitions         : 1
xfersize            : 10 MiB
blocksize           : 10 MiB
aggregate filesize  : 351.56 GiB

Results: 

access    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
------    ---------  ---------- ---------  --------   --------   --------   --------   ----
write     21193      10240      10240      1.82       14.90      4.33       16.99      0   
read      27351      10240      10240      0.265332   12.63      5.08       13.16      0   
remove    -          -          -          -          -          -          0.009275   0   
Max Write: 21192.94 MiB/sec (22222.41 MB/sec)
Max Read:  27351.10 MiB/sec (28679.71 MB/sec)

Summary of all tests:
Operation   Max(MiB)   Min(MiB)  Mean(MiB)     StdDev   Max(OPs)   Min(OPs)  Mean(OPs)     StdDev    Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt   blksiz    xsize aggs(MiB)   API RefNum
write       21192.94   21192.94   21192.94       0.00    2119.29    2119.29    2119.29       0.00   16.98679     0    600  60    1   0     1        1         0    0     60 10485760 10485760  360000.0 MPIIO      0
read        27351.10   27351.10   27351.10       0.00    2735.11    2735.11    2735.11       0.00   13.16217     0    600  60    1   0     1        1         0    0     60 10485760 10485760  360000.0 MPIIO      0
Finished            : Thu Jul 18 17:24:26 2019

#5

I think your chunk size is mismatched with your read pattern. You said the selected chunk size is 4032 x 4032, but each read process accesses a subsection of 6144 x 6144. This is a non-integral fit. This means that many chunks must be read and decompressed by more than one process. Your timing reports are in the right ballpark for this kind of impact.

The remedy is to select a different chunk strategy such that chunks do not cross read subsections. Each chunk should be either a reader subsection, or divide evenly into a subsection. Specifically for this case, chunk edges should be 6144, or an integral fraction of 6144.

Also, with larger chunk size, be careful to not exceed the chunk cache size. A conservative start would be chunks of 1024 x 1024 to check for improved performance, then tune further if desired.


#6

An IOR run showes up, that even without HDF5 the MPI-IO read performance drops dramatically when the dataset get much larger than the memory of our storage servers. (Running IOR with the settings on the BeeGFS wiki, this point is not reached and thus the values look better). I will talk to the admistrators if they can fix the performance problem by adding RAM etc…

But still two problems are open. When we are writing to the datasets the time between the call of H5Dwrite_f and the moment the acutal writing starts is quite large, in the above setup it takes up to several minutes, although writing fill values is turned off. And second for datasets getting larger than 2 GB / process( overall problem size 393216 x 393216 -> 1.2 TB dataset, The program crashes with the following error:

Reading from remote process' memory failed. Disabling CMA support
Assertion failure at /builddir/build/BUILD/libpsm2-10.3.8/ptl_am/ptl.c:152: nbytes == req->recv_msglen

@dave.allured I adjusted the Chunk sizes such that they fit better to the datalayout, which helped a bit from the performance point of view.

Edit: Again test with Intel MPI instead of OpenMPI neither has the startup problem before beginning the write nor the problem with the large problem.