Parallel writes to single H5 slower than single-process writes (h5py+MPI, uncompressed)

skubi · November 27, 2024, 9:29pm

I’ve been trying to get parallel writes to a single H5 file work. As a test, all processes generate 50 random numpy arrays with a shared seed and create 50 datasets to store them. I then divide the writing workload between available processes. I find that as I increase the number of processes, both the total time to write the data and the time per process increase substantially as I increase the number of processes. The md5sum of the output test.h5 file is identical as I vary the number of processes (613067f62b4dfbba9365cba3bce41a49 on my machine).

Any information to help me better understand why parallel writing is harming rather than helping performance would be much appreciated!

Here is the minimal example code.

from mpi4py import MPI
import h5py
import numpy as np
import time

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()

with h5py.File("test.h5", "w", driver='mpio', comm=comm) as file:
    dataset_count = 50
    shape = (1000000,)
    np.random.seed(42)
    datasets = []
    
    for i in range(dataset_count):
        file.create_dataset(f"/{i}", shape=shape, dtype=np.int64)
        datasets.append(np.random.randint(0, 4200000, shape[0]))

    accum = 0
    write_count = 0
    written_to = []
    for i in range(dataset_count):
        r = i % size
        if rank == r:
            start = time.time()
            write_to = f"/{i}"
            written_to.append(write_to)
            file[write_to][:] = datasets[i]
            end = time.time()
            accum += end-start
            write_count += 1

print(f"Rank {rank} data write to {written_to} runtime: {accum}s with {size} processes, {write_count} writes, average {accum/write_count:.3f}s/write")

Output for 1 process:

mpiexec -n 1 python test.py

Rank 0 data write to ['/0', '/1', '/2', '/3', '/4', '/5', '/6', '/7', '/8', '/9', '/10', '/11', '/12', '/13', '/14', '/15', '/16', '/17', '/18', '/19', '/20', '/21', '/22', '/23', '/24', '/25', '/26', '/27', '/28', '/29', '/30', '/31', '/32', '/33', '/34', '/35', '/36', '/37', '/38', '/39', '/40', '/41', '/42', '/43', '/44', '/45', '/46', '/47', '/48', '/49'] runtime: 0.13713431358337402s with 1 processes, 50 writes, average 0.003s/write

mpiexec -n 5 python test.py

Rank 0 data write to ['/0', '/1', '/2', '/3', '/4', '/5', '/6', '/7', '/8', '/9', '/10', '/11', '/12', '/13', '/14', '/15', '/16', '/17', '/18', '/19', '/20', '/21', '/22', '/23', '/24', '/25', '/26', '/27', '/28', '/29', '/30', '/31', '/32', '/33', '/34', '/35', '/36', '/37', '/38', '/39', '/40', '/41', '/42', '/43', '/44', '/45', '/46', '/47', '/48', '/49'] runtime: 0.13713431358337402s with 1 processes, 50 writes, average 0.003s/write

mpiexec -n 5 python test.py

Rank 2 data write to ['/2', '/7', '/12', '/17', '/22', '/27', '/32', '/37', '/42', '/47'] runtime: 0.16770410537719727s with 5 processes, 10 writes, average 0.017s/write
Rank 3 data write to ['/3', '/8', '/13', '/18', '/23', '/28', '/33', '/38', '/43', '/48'] runtime: 0.18259263038635254s with 5 processes, 10 writes, average 0.018s/write
Rank 4 data write to ['/4', '/9', '/14', '/19', '/24', '/29', '/34', '/39', '/44', '/49'] runtime: 0.1777360439300537s with 5 processes, 10 writes, average 0.018s/write
Rank 0 data write to ['/0', '/5', '/10', '/15', '/20', '/25', '/30', '/35', '/40', '/45'] runtime: 0.1710968017578125s with 5 processes, 10 writes, average 0.017s/write
Rank 1 data write to ['/1', '/6', '/11', '/16', '/21', '/26', '/31', '/36', '/41', '/46'] runtime: 0.15236258506774902s with 5 processes, 10 writes, average 0.015s/write

hyoklee · December 9, 2024, 5:11pm

Hi, @skubi!

Would you please write a C code for us to determine what the root cause is?
In general, for any issue related to performance, it’s better for us to eliminate Python stack.

mlarson · December 9, 2024, 5:16pm

From a cursory inspection of the program, I think that most of the increase in recorded time may be due to repeated initialization/dataset opening costs. The total runtime of a rank in the multi-rank case and a rank in the 1-rank case are fairly similar (0.14s vs ~0.17s), and the ~5x increase in average time per write seems like mostly an artifact of the fact that each rank is doing 1/5th as many writes.

Increasing the amount of data written enough that time spent writing is much longer than initialization time should provide you with a more accurate metric for the overhead of parallization, if any.

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Parallel writes to single H5 slower than single-process writes (h5py+MPI, uncompressed)