Need help with appending to multiple HDF5 datasets in parallel (with mpi)

doron.behar · February 19, 2025, 10:49am

So I have a certain HDF5 database I’d like to write to it a few datasets in parallel. Since I am a noob with mpi4py and I barely understand how it works, I asked for help from an AI,and together we sketched this minimal example Python script:

from mpi4py import MPI
import h5py
import numpy as np
import time
import os

# Monkey-patch h5py.Dataset with the append method
def append_h5py_dataset(self, value):
    value = np.array(value)
    if value.shape != self.shape[1:]:
        raise ValueError(
            f"value has shape {value.shape} while every line's shape in the "
            f"data is {self.shape}."
        )
    self.resize(self.shape[0] + 1, axis=0)
    self[-1] = value
h5py.Dataset.append = append_h5py_dataset

# Initialize MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()

# Abort if not run with mpirun
if size == 1:
    raise RuntimeError("This script must be run with mpirun or mpiexec.")

# List of integers between 10 and 20, size 6
integers = [12, 14, 16, 18, 11, 13]

# File path
file_path = "parallel_hdf5_example.h5"

# Remove the file if it exists (only rank 0 does this to avoid race conditions)
if rank == 0:
    if os.path.exists(file_path):
        os.remove(file_path)
comm.Barrier()  # Synchronize all processes after file removal

# Function to process an integer
def process_integer(i):
    rng = np.random.default_rng(i)
    try:
        with h5py.File(file_path, mode="a", driver='mpio', comm=comm) as h5file:
            # Create dataset if it doesn't exist
            if f"/{i}" not in h5file:
                # Create dataset with a single float
                h5file.create_dataset(f"/{i}", data=[rng.uniform(0, 1)], maxshape=(None,))
            
            # Append data in a loop
            dataset = h5file[f"/{i}"]
            for _ in range(i):  # Use `i` as the loop range
                time.sleep(0.1)  # Simulate intensive computation
                new_data = rng.uniform(0, 1)  # Single float
                print(f"integer {i} gained new data {new_data}")
                dataset.append(new_data)  # Append directly (no list wrapping)
            # Print completion message
            print(f"Process {rank} finished processing integer {i}")
    except Exception as e:
       print(f"Process {rank} encountered an error while processing integer {i}: {e}")

# Distribute integers across processes
for idx, i in enumerate(integers):
    if idx % size == rank:
        print(f"Process {rank} processing integer {i}")
        process_integer(i)

# Synchronize all processes
comm.Barrier()

if rank == 0:
    print("All processes have finished.")

When running it, only the rank 0 process finishes processing it’s integer… If I add an h5file.flush() statement right after the for _ in range(i) loop, it hangs there and doesn’t print me the finished processing line no matter what is the rank… Also the except Exception thing doesn’t print anything…

Any idea why is that? Maybe it is due to the way I append to the datasets? Is it a known issue that resizing datasets in parallel with mpi can cause issues?

Installed versions info:

h5py==3.12.1 (I know that 3.13.0 was just released but I haven’t tested it yet).
mpi4py==4.0.1 (I know that 4.0.3 was also recently released but it too wasn’t tested).

gheber · February 20, 2025, 6:45pm

Doron, how are you? (MPI-)Parallel HDF5 works if and only if your application abides by a certain etiquette (collective calling requirements). This etiquette ensures that each MPI rank in a communicator has the same view of the underlying HDF5 file. In particular, when changes are made to the file (group/dataset/attribute creation, etc.), each rank must “know” that fact. If I read your example right, only the rank that processes a particular integer creates the underlying dataset. (Correct?) If that’s the case, then that’s a violation of this etiquette because other ranks aren’t aware of that new dataset. In other words, all ranks must call H5Dcreate for a dataset regardless of whether they ever write to or read from that dataset. The same applies to dataset extension (or shrinking), i.e., all ranks must witness a change in dataset shape. Otherwise, the corresponding metadata seen by individual ranks would be inconsistent. OK? Here are a few references on parallel HDF5 that you might find helpful:

Best regards,
Gerd

doron.behar · February 21, 2025, 2:20pm

Thanks @gheber for the very nice reply! Indeed I also read in h5py’s docs about collective v.s independent operations (here), and although it is not mentioned there explicitly, creating a dataset indeed seems like a collective operation, and hence creating the datasets needs to be done no matter what is the MPI rank…

Correct :).

That’s kind of surprising, because in my case, there is no interaction between the datasets, but I can imagine that managing the divisions between the datasets and groups etc - the locations in the file where each dataset starts and ends can be hard to manage when multiple processes want to create datasets. I just hoped that somehow h5py.File would somehow magically do that.

Thanks again @gheber for the helpful reply.

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Need help with appending to multiple HDF5 datasets in parallel (with mpi)