HDF5 1.10, 1.12 dramatic drop of performance vs 1.8

m.deij · June 22, 2022, 1:07pm

We’re writing small HDF5 files (a few megabytes) with results from simulations. This is done incrementally with chunked datasets. Files are written to an NFS filesystem without compression.

I find that the performance of HDF5 1.8.18 is good: a simulation completes in about 1 minute with 2000 writes to about 30 datasets.

When I use the exact same code with HDF5 1.10.8 or 1.12.2, the same simulation takes more than 30 minutes to complete, with almost all the time spent writing to the HDF5 file.

Is there something I can do to regain the 1.8.18 performance?

One thing I already I checked: writing to non-NFS filesystem does not help.

gheber · June 22, 2022, 1:25pm

Yes. Let’s work together, reproduce, and understand what’s going on!
I think there are at least two approaches we can take:

You profile both runs (w/ 1.8.18 and 1.10.8) and we look at the output for clues.
We develop a simple reproducer to investigate.

I think we should do both, but 2. will take more time. Are you familiar with the Google Performance Tools (gperftools)? You can use pprof --callgrind to generate profiles which can be visualized in KCachegrind. Do you think you can create those and post them here?

Thanks, G.

m.deij · June 22, 2022, 2:09pm

Thanks for your quick answer. I will try to see where I get with profiling.

steven · June 22, 2022, 3:05pm

Can you please clarify the spec, so others can verify your results:
30 datasets 3000 write requests into each files: 30x2000 what datatype and dimension?

example for small matrices
example for profiling
benchmark

best: steve

m.deij · June 22, 2022, 3:53pm

I have the files, but can’t upload them (new user status - can that be removed? I promise I’ll behave )

I have put the callgrind files here:
https://nextcloud.marin.nl/index.php/s/r5GMFswRQAfmmKP

The refresco_new.callgrind file is the one with the bad performance. It took 84 seconds to run this, whereas refresco.callgrind (with HDF5 1.8.18) took about 9 seconds.

I notice that as the simulation progresses, it gets slower and slower.

When looking with qcachegrind I see marked differences. 85% of the time is spent in H5Fget_obj_count.

It is most likely related to this topic, as the code calls h5open_f/h5close_f a lot, as in each write.

I’ll take a look at the solution from https://github.com/HDFGroup/hdf5/pull/1657

m.deij · June 22, 2022, 3:53pm

Ok while my previous answer was pending I re-built the HDF5 lib with the patch applied and it worked, performance is back again on the same level as 1.8. Thanks for pointing me in the right direction.

m.deij · June 23, 2022, 12:58pm

The layout of the file is this:

54 datasets of 200 reals and 38 datasets of 2000 reals. Each real is written on its own using extendable dataset with chunk size of 1.

This amounts to 14160 writes and 14160 calls to h5open_f/h5close_f.

gheber · June 23, 2022, 1:03pm

Can you elaborate on that? What’s the reasoning behind that?

Could you run h5stat on one of your files and post the output?

Thanks, G.

m.deij · June 24, 2022, 6:23am

So the idea is that at each time step and/or each outer iteration of the non-linear solver, a number of results gets written to the file, including residual levels, forces and moments, etc.

As the number of outer iterations is not fixed - upon convergence to a certain residual level the solver continues the next timestep - the dataset is extended with a chunk size of 1 each time data needs to be written.

Ultimately, the file may also be read while the simulation is running to monitor the progress, but we’re not doing that yet.

One of the reasons of choosing hdf5 is that we can write to an agreed-upon format that is understood by our data processing pipelines. This format is also used to write results from our experimental facilities and allows us to use the same data processing on e.g. forces, moments from simulation results and experimental results.

If there is a better way of doing this, please let us know

m.deij · June 24, 2022, 6:26am

Output from h5stat:

File information
        # of unique groups: 9
        # of unique datasets: 75
        # of unique named datatypes: 0
        # of unique links: 0
        # of unique other: 0
        Max. # of links to object: 1
        Max. # of objects in group: 11
File space information for file metadata (in bytes):
        Superblock: 96
        Superblock extension: 0
        User block: 0
        Object headers: (total/unused)
                Groups: 10608/0
                Datasets(exclude compact data): 82192/864
                Datatypes: 0/0
        Groups:
                B-tree/List: 9488
                Heap: 2752
        Attributes:
                B-tree/List: 0
                Heap: 0
        Chunked datasets:
                Index: 1096208
        Datasets:
                Heap: 0
        Shared Messages:
                Header: 0
                B-tree/List: 0
                Heap: 0
        Free-space managers:
                Header: 0
                Amount of free space: 0
Small groups (with 0 to 9 links):
        # of groups with 6 link(s): 2
        # of groups with 8 link(s): 2
        Total # of small groups: 4
Group bins:
        # of groups with 1 - 9 links: 4
        # of groups with 10 - 99 links: 5
        Total # of groups: 9
Dataset dimension information:
        Max. rank of datasets: 1
        Dataset ranks:
                # of dataset with rank 1: 75
1-D Dataset information:
        Max. dimension size of 1-D datasets: 873
        Small 1-D datasets (with dimension sizes 0 to 9):
                # of datasets with dimension sizes 1: 11
                Total # of small datasets: 11
        1-D Dataset dimension bins:
                # of datasets with dimension size 1 - 9: 11
                # of datasets with dimension size 10 - 99: 36
                # of datasets with dimension size 100 - 999: 28
                Total # of datasets: 75
Dataset storage information:
        Total raw data size: 99260
        Total external raw data size: 0
Dataset layout information:
        Dataset layout counts[COMPACT]: 0
        Dataset layout counts[CONTIG]: 0
        Dataset layout counts[CHUNKED]: 75
        Dataset layout counts[VIRTUAL]: 0
        Number of external files : 0
Dataset filters information:
        Number of datasets with:
                NO filter: 75
                GZIP filter: 0
                SHUFFLE filter: 0
                FLETCHER32 filter: 0
                SZIP filter: 0
                NBIT filter: 0
                SCALEOFFSET filter: 0
                USER-DEFINED filter: 0
Dataset datatype information:
        # of unique datatypes used by datasets: 1
        Dataset datatype #0:
                Count (total/named) = (75/0)
                Size (desc./elmt) = (22/4)
        Total dataset datatype count: 75
Small # of attributes (objects with 1 to 10 attributes):
        # of objects with 10 attributes: 8
        Total # of objects with small # of attributes: 8
Attribute bins:
        # of objects with 10 - 99 attributes: 84
        Total # of objects with attributes: 84
        Max. # of attributes to objects: 16
Free-space persist: FALSE
Free-space section threshold: 1 bytes
Small size free-space sections (< 10 bytes):
        Total # of small size sections: 0
Free-space section bins:
        Total # of sections: 0
File space management strategy: H5F_FSPACE_STRATEGY_FSM_AGGR
File space page size: 4096 bytes
Summary of file space information:
  File metadata: 1201344 bytes
  Raw data: 99260 bytes
  Amount/Percent of tracked free space: 0 bytes/0.0%
  Unaccounted space: 17392 bytes
Total space: 1317996 bytes

steven · June 24, 2022, 2:01pm

This pattern is frequent with sensor networks, HFT trading, etc,. For C++ H5CPP offers you h5::append operator to internally buffer packets to chunk size, then send it to its own compression pipeline based on BLAS level 3 blocking and finally making a H5Dwrite_chunk call. Unfortunately C++17 is not an option for everyone, but you could compile and “C” export a subroutine and link it to fortran/C/ etc?

The mechanism can saturate a commodity computer IO band width, using a single core, which is your IO thread.

To solve the entire problem: IPC or interprocess communication is to your help. The simplest data structure is a queue, I will get back to this later; a good alternative middle ware is ZeroMQ, Kafka, RabbitMQ or what not are also options.
ZeroMQ is popular where both latency and throughput matters, and want some wiggle room when it comes to deployment: [inter thread | interprocess | UDP | TCP | multicast: pgm. epgm, etc…] In fact It is like a Swiss army knife. ZeroMQ is so useful to solve these sort of problems, it deserves an article on its own.

Let’s get to the hand rolled queues. A tough one indeed, this is not what you would do on your own, but if you are up for it here is a threaded version with mutex/lock, then there are other varieties: lock free and wait free queues.

All of these solutions are to decouple software components starting from single process multithread to robust multicomputer multiprocess layouts. Where is HDF5 coming into the picture? The Disk IO thread(s) are of course; and since a single thread can saturate the IO bandwidth on commodity hardware, with this approach you get robust event recorder which can scale from intra process to multinode system.

hope it helps: steve
(diagramm is shamelessly stolen from the internet)

gheber · June 27, 2022, 5:56pm

I think that’s rather troublesome. Your metadata to data ratio is rather unusual. For each byte of data you have over 10 bytes of metadata. That’s the price you pay for chunk size 1 (chunk index overhead). Assuming 4 bytes per element, just make the chunk size something like 131,072 or 262,144 and turn on compression. You can still (logically) extend the dataset one element at a time, but the (chunk) allocation cost will be amortized and the metadata to data ratio will be less than 0.01. OK?

G.

m.deij · June 28, 2022, 1:04pm

@gheber thank you for taking a look, I agree that this is “rather unusual”
I’ll take a closer look at the code and see if it can be modified in such a way as to use a larger chunk size.

m.deij · June 29, 2022, 8:42am

I have created a small example (see attached), where a dataset is extended with one entry at each write. Depending on the chunk size, the metadata size is as follows:

chunk size	data	metadata
1	120000	1128952
50	120000	26456
2048	122880	3400
131072	524288	3400

This will help in setting a sensible default for the chunk size depending on the expected size of the data being written.

chunk_example.f90 (7.1 KB)

gheber · June 29, 2022, 12:08pm

Very interesting & thanks for sharing. Assuming the data is reasonably compressible, after enabling compression, the data sizes should be even smaller. G.

steven · July 4, 2022, 1:18pm

Based on Richard D. Snyder FZMQ fortran package here I post a Fortran and H5CPP event processor, decoupled with 0ZMQ middle-ware. One advantage of this setup is that the framework allows various levels of robustness, as well as multiprocessing and multithreading. Another is the flexible interconnect allows prototyping in python/julia/R/matlab
before final implementation, or plug in HDF5 into existing software.

The events are 8bytes integers, a tiny fragment indeed, and the performance is 8'829'670 events/sec on a Lenovo X1 11th Gen Intel®;

program main
    use, intrinsic  :: iso_c_binding
    use             :: zmq
    type(c_ptr)     :: ctx, sock
    integer(c_int) ::  res, rc
    integer(c_size_t), target :: i, size=8

    ! push - pull pattern
    ctx = zmq_ctx_new()
    sock = zmq_socket(ctx, ZMQ_PUSH)
    rc = zmq_connect(sock, 'tcp://localhost:5555')
    
    do i=1, 10**8
        res = zmq_send(sock, c_loc(i), size, 0)        
    end do
    ! in our simple data exchange `0x0` represents end of stream
    ! send close sgnal to `recv`
    i = 0
    res = zmq_send(sock, c_loc(i), size, 0)
    
    rc = zmq_close(sock)
    rc = zmq_ctx_term(ctx)
end program main

and the receiver side:


int main() {
  h5::fd_t fd = h5::create("collected-data.h5", H5F_ACC_TRUNC);
  void *ctx = zmq_ctx_new ();
  void *sock = zmq_socket (ctx, ZMQ_PULL);
  int rc = zmq_bind (sock, "tcp://*:5555");

  // we're roughly controlling IO caches with chunk size, technically you want it about 
  // to be the underlying buffer size: 1MB for jumbo ethernet frames, 64Kb for low latency interconnects
  h5::pt_t pt = h5::create<int64_t>(fd, "some channel xyz",
     h5::max_dims{H5S_UNLIMITED}, h5::chunk{1024});
  int64_t buffer=1;
  do
     if( zmq_recv (sock, &buffer, sizeof(int64_t), 0) >= 0) 
        h5::append(pt, buffer);
  while(buffer); // `0x0` terminates transmission
}

The project maybe downloaded from this github link

m.deij · July 5, 2022, 6:20am

Cool, thanks for sharing.

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

HDF5 1.10, 1.12 dramatic drop of performance vs 1.8