Extremely slow read speeds with multiple processes

Hi all,

I am running an MPI-enabled C++ program on 4 processes, where each process writes to its own h5 file. The files are each about 10GB in size with 10,000 datasets of roughly similar sizes (so each about 1MB). The datasets are written using H5LTmake_dataset_char and read using H5LTread_dataset_char (using HDF5 1.8.20) within the same execution. I make about 50,000 calls to both functions in each process, and the code spends a total of about 1 minute in each (which is consistent with HDD read/write speeds).

The problem arises when I run multiple instances of this program at the same time. For example, I ran 3 instances (so 3x4 processes, running on a 24-core machine with 256GB of RAM), with identical operations except that they’re writing/reading the h5 files in different directories. No process is using the same h5 file. In this case, the writing speed is nearly identical — perhaps 2 minutes instead of 1 minute of wallclock time. However, the wallclock time for reading jumps to 2 hours!

The usual advice for slow read speeds (e.g. tuning the chunk size or number of datasets) seems to be inapplicable because the behavior of a single instance of this program is completely sensible. Does anyone have any insight into what could be causing the massive slowdown in this case?

Thanks for your help!

From the information you posted there is no way to tell why your software runs slow. The only way to do that is by profiling it, which will show you the hot spots. Posting the profiling information will lead to analysis as opposed to guessing game.

Why the 10 000 datasets? Did you know when looking up/creating a dataset you are doing operations on an internal data structure? – if so this will show up on your profile (see links).

Chunk size, and accessing datasets at chunk boundaries matters indeed. Since you are using C++ you may be interested in H5CPP an easy to use template library with tediously profiled, optimised calls.

Here is an example of a good profile from H5CPP armadillo and compound datatype As you see most of the work is strictly IO related, little to no waste. This is what you will get with each H5CPP operators.

Another example of a not so good one saved from some older profiling sessions. Notice the IO calls don’t even show up, instead most work is related some indexing and related calls. ( the bad performance never made it into H5CPP, instead I added a custom built high-throughput filtering pipeline.)

best wishes: steven@vargaconsulting.ca

Hi Steven, thank you for the information. I have profiled the user code, which is how I have determined that the specific call to H5LTread_dataset_char is the overwhelmingly dominant expense in the case where I have multiple instances of the executable running. I have not profiled the HDF5 library itself, which I was hoping I wouldn’t have to do, but if it is, then so be it. :slight_smile:

I was also hoping there was some institutional intuition in the HDF5 group about the following curiosities:

  • How does running multiple instances of the executable degrade read performance by almost 2 orders of magnitude?
  • Why does it only affect reading, while writing is completely unaffected?
  • If the default chunk size is a poor choice, wouldn’t that also affect running just a single instance of the executable?

These seem like questions that could be answered by someone familiar with the gory details of the implementation in the HDF5 library. If not, well, then I will report back with profiling results!

10,000 datasets is because this file is keeping track of events, and there are 10,000 events. The way HDF5 is being used is nearly trivial though (a call to H5LT{make,read}_dataset_char for each dataset), so if this is an inherent problem with HDF5, perhaps it is not the appropriate file format to use for this data?

In addition to profiling HDF5 itself, I’m happy to provide any additional information if it would help provide some context for this issue.

Thanks again!

I don’t speak for the HDFGroup, as far as I know they do provide support services with a contract.

How does running multiple instances of the executable degrade read performance by almost 2 orders of magnitude?

Improper use usually is the most common case. HDF5 CAPI is a complex library, requires understanding of the inner mechanism, and so on. Making calls to the API without profiling them can lead to sub optimal results.
On AWS EC2 with H5CLUSTER a new supercomputer class rental cluster I am currently working on the HDF5 system can attain 100’s of GByte/sec throughput, with near linear scaling.

When I designed H5CPP the IO call complexity factors were taken into considerations. Did you know you could use any of the templates as drop in replacements for HDF5 CAPI calls? Since the hid_t are binary compatible with h5::ds_t you can just try if it works for you. Here is an example:

char * buffer = malloc(...);
hid_t loc_id = H5Dopen(...);
// herr_t H5LTmake_dataset_char ( hid_t loc_id, const char *dset_name, int rank, const hsize_t *dims, const char *buffer )
h5::read(loc_id, buffer, count{1024});

Of course since h5::read is not based on HDF5 Lite API, you should get significant improvement on certain calls.

Here is another example:

#include <armadillo>
#include <h5cpp/all>
...
auto fd = h5::open("some_file.h5", H5F_ACC_RDWR);
/* the RVO arma::Mat<double> object will have the size 10x5 filled*/
try {
    /* will drop extents of unit dimension returns a 2D object */
    auto M = h5::read<arma::mat>(fd,"path/to/object", 
            h5::offset{3,4,1}, h5::count{10,1,5}, h5::stride{3,1,1} ,h5::block{2,1,1} );
} catch (const std::runtime_error& ex ){
    ...
}

Hope it helps!
ps.:
if you want performance, either roll your own plain api calls or use H5CPP

1 Like

Daniel,

It is hard to say what is going on without seeing the application and profiling HDF5 library. Please make sure any identifiers opened by application are closed as soon as object is not needed anymore.

Before diving into HDF5 library profiling, I would first check what is going with the application memory and then the system itself.

If your application is not using all system memory, then may be you can try to use 12 10GB binary files and read from them as you are reading HDF5 files (i.e., with 3 MPI C programs; each uses 4 processes and each process reads from the corresponding file by 1MB at a time)?

Both LT functions the application uses are a very thin layer on top of HDF5. It is hard to believe that they cause any problems, especially, since datasets are contiguous. (Is this correct?)

Thank you!

Elena