Advice on storing many datasets in parallel ?

Hello Folks,

I have a MPI cluster workflow that generate millions of small datasets in parallel. So far I have it programed where all the worker ranks, send these small datasets to rank-0, where it is written to an HDF5 file serially. The bottleneck is storing the data. I have read the HDF5 documentation extensively and see many possible solutions. So, I just wanted to get some advice on the best approach and avoid wasting time on a bad solution.

There is the MPI file driver, however, the only examples I have seen on that is when you have many ranks writing to the same dataset. So it seems that driver is meant for large distributed datasets. Does that work for writing many small datasets from multiple processes? One concern I have is the fact that file system IO becomes very slow on our cluster when the you have access on different nodes. If you can keep all the IO on a single node it’s better (not great, but better).

Given the IO limitations of our cluster … I was also looking for solutions where each rank/node could have their own local hdf5 file. I see that there is a multi-file driver for example. However, I guess that you would still have to open the root file. Then internally the library opens all the other multi-file parts as needed. So, it wasn’t clear how I could control multi-file so ranks could control a file exclusively. Maybe there is something I am missing?

The solution that seems most suitable, maybe not as elegant … is the mount option. I can create hdf5 files separately by each rank. Then use the mount functionality to read the data later in a transparent way. It’s not as elegant … but it’s at least a solution that I know for certain that I can have parallel writes. I can see that you can define linkages built into a file, so these other files can be mounted automatically. So, subsequent data processing doesn’t need to have detailed information on the multi-file structure.

My final option is similar to above, separate files per rank. But then use h5copy after to combine later into 1 file. Is h5copy extra efficient? does it just append data to the file efficiently? Or does it still get stuck having to unpack and decompress, just to rebuild it all over again.

Thank-you all for you helpful advice!
Mike

I would first try to avoid creating so many datasets in the first place. That said, if you can create all the datasets and the general HDF5 layout on one rank, close the file and reopen it for all ranks for the raw data, that might save you some time.

The MPI-IO Driver is probably not a good solution.

A. Parallel HDF5 is optimized for the opposite workload: many ranks writing to the same large dataset with collective I/O aggregating small writes into large ones.

For millions of small datasets, you hit two fundamental problems:

  1. Metadata serialization: Dataset creation operations require metadata updates that effectively serialize across ranks. Even with independent I/O, the metadata operations become a brutal bottleneck.

  2. Small-write pathology: MPI-IO and parallel file systems are optimized for large, contiguous writes. Millions of small datasets mean millions of small, scattered writes—exactly what these systems handle poorly.

Given your cluster’s node-locality constraints, parallel HDF5 would likely make things worse.

B. Multi-file/Split/Family Drivers — Wrong Tool

C. Definitely look into subifling as it is designed for your scenario

  1. Automatically creates multiple subfiles (typically one per node or per IOC)
  2. Presents a single logical HDF5 file to your application
  3. Keeps I/O node-local when configured properly
  4. Uses dedicated I/O concentrator threads

However, this most likely will not address the metadata bottleneck caused by so many datasets.

C. File-Per-Process/Node with External Links — Your Best Bet

Your mount/external-link approach is the proven HPC pattern for this workload.

  1. Write Phase, (fully parallel, no contention)
// Each rank creates its own file

// Write millions of small datasets - no coordination needed
  1. Create Master File with External Links (Single Rank, Post-Processing)

// Rank 0 creates the “virtual” unified view

D. For VDS, you will have to work around the fact that it is not meant for parallel I/O, and it only makes sense if millions of datasets are part of a larger logical whole dataset.

For your use case, external links are more efficient than h5copy because they involve no data movement.
Also, h5copy is NOT just appending bytes. The best case is a direct chunk copy from the source to the destination, with identical chunk layout and compression. Worst case, if chunk layouts differ or you’re changing compression settings, each chunk is decompressed, copied to memory, and recompressed. If you need to restructure, h5repack would give you more control.

@hyoklee definitely wins the prize in the artistry category, but let’s first ground this discussion a bit.

  1. What’s your definition of ‘small.’ 1 KB, 256 KB, 1 MB, 8 MB, 100 MB?
  2. What kinds of datatypes are involved? Are they all different, or are there a handful of common datatypes?
  3. Ditto for shapes (rank and extent).
  4. Do you know these sizes a priori (based on input parameters), or are they dynamic and only known at runtime?
  5. Do you have node-local storage for buffering purposes, and how much?

G.

1 Like

Thank-you very much for responding to each of my idea’s in turn like that. I greatly clears up where I should prioritize my work … so my original suspicion, were correct. So, thank-you!

@brtnfld and @hyoklee are both very helpful … they are both giving me some good directions to look deeper. I am thinking the file per rank, then links within a master is the way to go … but maybe you have some additional insights?

Let me answer your questions. The rank (2), number of columns (6), and the data type (double), is known a-priori. The number of rows is only known at run-time. The size of a single dataset is typically (99%) up to 10’s of KB, the maximum size is 3MB in rare (0.000001%) cases. There calculation produces approximately 10^7 datasets. Node local storage, is created with ram-disk. I can use approximately ~GB local RAM buffering per rank without choking the rest of the system. The calculation is not long running, so if buffered content in ram is lost, it is not a major loss.

I sense maybe 2 additional insights from you gheber. Are there options to configure buffering in HDF5 object? Also another possibility is to pack all the datasets into 1 table … then use meta-data to index the rows where different parts begin and end…

In case it’s relevant, the file-system is cephfs … it runs off a separate file server that gets overloaded easily.

@mimc, after writing the data, how do you access/analyze it?

I would distinguish between how you acquire data and how you present it to users/applications. Sometimes they are the same, but not by necessity.

Another approach to address the problem of many small datasets is to create one or a few large datasets and then derive the small datasets as subsets/selections from the “mother dataset(s)” via external links, VDS, etc.

Does that make sense? Since you are dealing with Nx6 floating-point datasets, just have one (or one per rank) [*]x6 and simultaneously assemble a “descriptor dataset” that keeps track of the row offset where a new small dataset begins. You can keep that descriptor dataset in memory (or store it), but you only need it to later create the selections/VDS for the small (sub-)datasets. That way, you are storing one or a few large datasets in parallel and, eventually, creating metadata for what appear to be small datasets.

Does that make sense?

G.

This HDF5 development, is part of a larger effort to clean up and organize a workflow. In the past, everything was stored as a complicated series of csv files, the system is not bad at writing CSV files in parallel. However it was a mess keeping track afterwards. So, the aim is to have an efficient API to manage the data, so things are easier for the user. The data is used for different purposes so there isn’t one consumption pattern that I can optimize for.

I think I have some good ideas now … I will have a small master hdf5 that keeps track of the overall structure. Then a sub-folder of daughter datasets. I like the idea that you present, of keeping multiple sets in the same large dataset. This gives me the option to merge and split daughter files as needed for different work-flows.

Note, a few months ago, I was experimenting with SQL databases… there are some nice features there … but it had some terrible scaling for larger parallel calculations. So, it’s been quite a journey learning about all the different options. HDF5 seems to have more flexibility …

Anyway, thank-you all for the tips …