HDF5 write perf - Regular API vs Direct Chunk on uncompressed dataset

samuel.debionne · May 5, 2022, 7:48am

First some background information. Our use case is tomography. Frames (2D images) are stored in uncompressed 3D dataset, the third dimension being the projections [nprojs, nrows, ncols].

The write pattern, performed by he DAQ, writes an image at time [i, :, :].

The read pattern, for processing, reads a line of the image for every projections [:, i, :].

So far, from the consumer perspective, the best performance / complexity compromised (without repacking the input file) is to use a dataset without chunking.

Now the question that concerns the write performance: why is there a non-negligible penalty using the regular API (with a “trivial” hyperslab [i, :, :]) vs using the direct chunk API (with a dataset chunked with a chunk [1, nrows, ncols])?

I am comparing the following loops:

Regular HDF5 API

def benchmark():
    for i in np.arange(nrows):
        dset.write_direct(frm, dest_sel=np.s_[i,:,:])

and Direct Chunk API

def benchmark():
    for i in np.arange(nrows):
        dset.id.write_direct_chunk([i, 0, 0], frm)

I measured that the former is about 50% slower with h5py 3.6.0 and hdf5 1.12.1

gheber · May 5, 2022, 4:33pm

I believe the naming is a little misleading here. Some directs are more direct than others I think write_direct ends up calling H5Dwrite while write_direct_chunk is calling H5Dwrite_chunk. What’s ‘direct’ about write_direct is that the buffer that goes to H5Dwrite is shared with the NumPy array, i.e., there is no copy. Despite there being no compression or filters, or datatype conversion, etc., H5Dwrite still routes the bytes via a slower code path through the library. On the other hand, H5Dwrite_chunk pretty much avoids that detour and goes straight to pwrite.

G.

samuel.debionne · May 5, 2022, 5:03pm

Thank you for the feedback!

I used write_direct to get a faire comparison with write_direct_chunk and be as close as possible to the C API.

H5Dwrite still routes the bytes via a slower code path through the library. On the other hand, H5Dwrite_chunk pretty much avoids that detour and goes straight to pwrite.

That’s what I am really interesting in understanding. In this particular use case, I would expect the H5Dwrite to be pretty much straight to pwrite too. Any internal buffers / copies? Writing data linearly in an uncompressed dataset is such a common use case that it would be worth avoiding any detour …

How about a H5Dwrite_direct_simple_hyperslab ?

derobins · May 5, 2022, 5:48pm

H5Dwrite_chunk() bypasses the filter pipeline. It was originally developed to support a detector that performed hardware compression, so it was optimal to just write the already-compressed data directly to the file but still marking it as compressed. It also bypasses things like type conversion and dataspace operations. As the documentation for the function notes, it’s a low-level function and requires care in its use, as the application now has to be careful about datatypes, compression, and chunk boundaries, instead of the HDF5 library. It’s pretty easy to write garbage that will be unreadable later.

The h5py read/write_direct calls are only found in h5py. I’m not as familiar with them, but it looks like they exist so you can avoid copying data in/out of numpy arrays. I don’t know how that compares to what write_direct_chunk does under the hood.

rivers · May 5, 2022, 6:13pm

My main application for HDF5 is also tomography. We use the C++ EPICS area detector framework for detectors, with an HDF5 plugin that uses HDF5 direct chunk write. It achieves write speeds > 500 MB/s to a network file system. For reconstruction we use Python, and can read an entire 8 GB dataset [1800,1200,1920] in 4 seconds, or about 2 GB/s. You have mentioned relative speeds, but I am also curious about the absolute speeds you are obtaining.

steven · May 6, 2022, 12:38am

@rivers Hi Mark, Thanks for the link! From the implementation in NDFileHDF5Dataset.cpp I notice that you are an early user of the H5DOwrite_chunk(..) the second one that I spotted was that you don’t block (break up into pieces) the NDarray *pArray into chunk size, can I ask the reason behind this?

best wishes: steven

samuel.debionne · May 6, 2022, 6:53am

Thank you all for the feedback!

@rivers We use Lima at ESRF, that uses HDF5 direct chunk write as well to parallelize compression and reuse compressed buffer from the detector as @derobins mentioned. I could give you absolute performance number but I am not sure that would say much. I am doing my tests on an IBM Power9 with a NVMe SSD.

For reconstruction we use Python, and can read an entire 8 GB dataset [1800,1200,1920] in 4 seconds, or about 2 GB/s.

I am forwarding your numbers to our scientists -they are the ones complaining about the read speed of the files Lima writes. Unfortunately, I dont think that reading the full dataset to memory is an option for us (typical dataset is 10000x2560x2560), hence we read n projections at a time.

@derobins I understand what H5Dwrite_chunk bypasses. What I don’t understand is the performance penalty of H5Dwrite vs H5Dwrite_chunk when there is no compression, not type conversion and the write of an hyperslab is just a “append” operation to the dataset (no transformation whatsoever). In this case I would expect performance close to H5Dwrite_chunk.

The rational to generate a non-chunked dataset is that chunks slow down reading with the typical tomo read pattern explained in my original post.

wkliao · May 6, 2022, 7:06pm

Is it possible due to the fill mode is enabled?
Chunks are written twice.

samuel.debionne · May 9, 2022, 8:05am

Not that I am aware of, but that could be a good clue. The dataset creation is “by default”, I am not setting any fill_value. I couldn’t find much info on “fill mode”, is that something enabled at creation time with a specific dcpl (dataset creation property list)?

EDIT: According to https://portal.hdfgroup.org/display/HDF5/H5P_SET_FILL_TIME, the default policy is “Write fill values to the dataset when storage space is allocated only if there is a user-defined fill value”, so I should not be affected by the fill (and double writing).

samuel.debionne · May 10, 2022, 2:23pm

What could slow down H5Dwrite (vs H5Dwrite_chunk) when having no data conversion, no data scattering (contiguous layout), no chunking, no filters?

gheber · May 11, 2022, 4:48am

Can you tell us the approximate dimensions of your images? I propose we write a simple C program to compare the two. Unless I’m not understanding your example, my expectation is that the performance of a sequence of H5Dwrite ops of 2D slices of a contiguously laid out 3D dataset (with time/projection being the slowest dimension) should be comparable or slightly faster than an equivalent set of H5Dwrite_chunk calls (with a not too small chunk size). (HDF5 I/O test does something similar but it’s easier to write a simple example.)

G.

samuel.debionne · May 11, 2022, 6:50am

Our images are 2560x2560 with about 10000 projections (typical dataset is 10000x2560x2560). I can write a benchmark in C (if the python version in the OP is no sufficient).
I am glad to here that the expected perf should should be comparable or slightly faster, probably a mistake on my side then!
I remember a presentation that introduced a web page with performance regression tests (probably running HDF5 I/O test suite) but I can not find it anymore. Could you remind me the URL?

samuel.debionne · May 11, 2022, 8:10am

I have just found out that the performance comparison depends on the size of the dataset in the slowest dimension.

TLDR: H5Dwrite_chunk is more than two times faster with n=100, but is on par with H5Dwrite with n=10000.

hdf5-perf

I also noticed something suspicious when looking at the data with h5ls:

$ h5ls -v input_with_chunk.h5
Opened "input_with_chunk.h5" with sec2 driver.
data                     Dataset {1000/1000, 2560/2560, 2560/2560}
    Location:  1:800
    Links:     1
    Chunks:    {1, 2560, 2560} 13107200 bytes
    Storage:   13107200000 logical bytes, 26214400000 allocated bytes, 50.00% utilization
    Type:      native short

$ h5ls -v input_no_chunk.h5
Opened "input_no_chunk.h5" with sec2 driver.
data                     Dataset {1000/1000, 2560/2560, 2560/2560}
    Location:  1:800
    Links:     1
    Storage:   13107200000 logical bytes, 13107200000 allocated bytes, 100.00% utilization
    Type:      native short

Why does the chunked dataset have “50.00% utilization” of Storage?

generate.py (551 Bytes)
generate_direct_chunk.py (568 Bytes)

gheber · May 12, 2022, 1:44pm

This is not what I would expect. Attached is a simple C program that, I believe, can be configured (see the macros near the top) to mimic your setup. Would you mind trying that and report back?

(I did not include any timers, because I’m not sure what exactly we are timing.)

Thanks, G.

conti_v_chunky.c (3.1 KB)

samuel.debionne · May 12, 2022, 1:53pm

Thanks for the C program, trying it right away. Any idea why the chunked output file shows Storage: 13107200000 logical bytes, 26214400000 allocated bytes, 50.00% utilization with h5ls?

gheber · May 12, 2022, 2:52pm

That looks fishy. It means that we are using only half of the allocated space, i.e., we are allocating space that we end up not using or abandoning. If you build and run my example with CHUNKED and WRITE_CHUNK enabled, the output looks like this:

gerd@penguin:~$ ~/packages/bin/h5ls -v foo.h5 
Opened "foo.h5" with sec2 driver.
data                     Dataset {100/100, 800/800, 600/600}
    Location:  1:800
    Links:     1
    Chunks:    {1, 800, 600} 960000 bytes
    Storage:   96000000 logical bytes, 96000000 allocated bytes, 100.00% utilization
    Type:      native short

G.

samuel.debionne · May 12, 2022, 4:06pm

Here are my measurements (using clock_gettime(CLOCK_PROCESS_CPUTIME_ID around the write loop) with a 1000x2560x2560 dataset:

+----------------------------+--------------+-----------------+----------------------+
|             -              | direct chunk | hyperslab chunk | hyperslab contiguous |
+----------------------------+--------------+-----------------+----------------------+
| regular                    |     1.631005 |        3.368258 |             2.069583 |
| never fill                 |     1.632858 |        1.641345 |             2.069205 |
| latest fmt                 |     1.643421 |        3.233817 |             2.068250 |
| never fill + latest format |     1.633859 |        1.611976 |             2.029573 |
+----------------------------+--------------+-----------------+----------------------+

Here is the log:

# direct chunk
gcc conti_v_chunky.c -DCHUNKED -DWRITE_CHUNK -o conti_v_chunky -I$CONDA_PREFIX/include -L$CONDA_PREFIX/lib -lhdf5 && LD_LIBRARY_PATH=$CONDA_PREFIX/lib ./conti_v_chunky && h5ls -v foo.h5
write took 1.631005 s
Opened "foo.h5" with sec2 driver.
data                     Dataset {1000/1000, 2560/2560, 2560/2560}
    Location:  1:800
    Links:     1
    Chunks:    {1, 2560, 2560} 13107200 bytes
    Storage:   13107200000 logical bytes, 13107200000 allocated bytes, 100.00% utilization
    Type:      native short
# hyperslab chunk
gcc conti_v_chunky.c -DCHUNKED -o conti_v_chunky -I$CONDA_PREFIX/include -L$CONDA_PREFIX/lib -lhdf5 && LD_LIBRARY_PATH=$CONDA_PREFIX/lib ./conti_v_chunky && h5ls -v foo.h5
write took 3.368258 s
Opened "foo.h5" with sec2 driver.
data                     Dataset {1000/1000, 2560/2560, 2560/2560}
    Location:  1:800
    Links:     1
    Chunks:    {1, 2560, 2560} 13107200 bytes
    Storage:   13107200000 logical bytes, 13107200000 allocated bytes, 100.00% utilization
    Type:      native short
# hyperslab contiguous
gcc conti_v_chunky.c -o conti_v_chunky -I$CONDA_PREFIX/include -L$CONDA_PREFIX/lib -lhdf5 && LD_LIBRARY_PATH=$CONDA_PREFIX/lib ./conti_v_chunky && h5ls -v foo.h5
write took 2.069583 s
Opened "foo.h5" with sec2 driver.
data                     Dataset {1000/1000, 2560/2560, 2560/2560}
    Location:  1:800
    Links:     1
    Storage:   13107200000 logical bytes, 13107200000 allocated bytes, 100.00% utilization
    Type:      native short

## NEVER_FILL
# direct chunk
gcc conti_v_chunky.c -DNEVER_FILL -DCHUNKED -DWRITE_CHUNK -o conti_v_chunky -I$CONDA_PREFIX/include -L$CONDA_PREFIX/lib -lhdf5 && LD_LIBRARY_PATH=$CONDA_PREFIX/lib ./conti_v_chunky && h5ls -v foo.h5
write took 1.632858 s
Opened "foo.h5" with sec2 driver.
data                     Dataset {1000/1000, 2560/2560, 2560/2560}
    Location:  1:800
    Links:     1
    Chunks:    {1, 2560, 2560} 13107200 bytes
    Storage:   13107200000 logical bytes, 13107200000 allocated bytes, 100.00% utilization
    Type:      native short
# hyperslab chunk
gcc conti_v_chunky.c -DNEVER_FILL -DCHUNKED -o conti_v_chunky -I$CONDA_PREFIX/include -L$CONDA_PREFIX/lib -lhdf5 && LD_LIBRARY_PATH=$CONDA_PREFIX/lib ./conti_v_chunky && h5ls -v foo.h5
write took 1.641345 s
Opened "foo.h5" with sec2 driver.
data                     Dataset {1000/1000, 2560/2560, 2560/2560}
    Location:  1:800
    Links:     1
    Chunks:    {1, 2560, 2560} 13107200 bytes
    Storage:   13107200000 logical bytes, 13107200000 allocated bytes, 100.00% utilization
    Type:      native short
# hyperslab contiguous
gcc conti_v_chunky.c -DNEVER_FILL -o conti_v_chunky -I$CONDA_PREFIX/include -L$CONDA_PREFIX/lib -lhdf5 && LD_LIBRARY_PATH=$CONDA_PREFIX/lib ./conti_v_chunky && h5ls -v foo.h5
write took 2.069205 s
Opened "foo.h5" with sec2 driver.
data                     Dataset {1000/1000, 2560/2560, 2560/2560}
    Location:  1:800
    Links:     1
    Storage:   13107200000 logical bytes, 13107200000 allocated bytes, 100.00% utilization
    Type:      native short

## LATEST_FMT
# direct chunk
gcc conti_v_chunky.c -DLATEST_FMT -DCHUNKED -DWRITE_CHUNK -o conti_v_chunky -I$CONDA_PREFIX/include -L$CONDA_PREFIX/lib -lhdf5 && LD_LIBRARY_PATH=$CONDA_PREFIX/lib ./conti_v_chunky && h5ls -v foo.h5
write took 1.643421 s
Opened "foo.h5" with sec2 driver.
data                     Dataset {1000/1000, 2560/2560, 2560/2560}
    Location:  1:195
    Links:     1
    Modified:  2022-05-12 17:55:14 CEST
    Chunks:    {1, 2560, 2560} 13107200 bytes
    Storage:   13107200000 logical bytes, 13107200000 allocated bytes, 100.00% utilization
    Type:      native short
# hyperslab chunk
gcc conti_v_chunky.c -DLATEST_FMT -DCHUNKED -o conti_v_chunky -I$CONDA_PREFIX/include -L$CONDA_PREFIX/lib -lhdf5 && LD_LIBRARY_PATH=$CONDA_PREFIX/lib ./conti_v_chunky && h5ls -v foo.h5
write took 3.233817 s
Opened "foo.h5" with sec2 driver.
data                     Dataset {1000/1000, 2560/2560, 2560/2560}
    Location:  1:195
    Links:     1
    Modified:  2022-05-12 17:55:16 CEST
    Chunks:    {1, 2560, 2560} 13107200 bytes
    Storage:   13107200000 logical bytes, 13107200000 allocated bytes, 100.00% utilization
    Type:      native short
# hyperslab contiguous
gcc conti_v_chunky.c -DLATEST_FMT -o conti_v_chunky -I$CONDA_PREFIX/include -L$CONDA_PREFIX/lib -lhdf5 && LD_LIBRARY_PATH=$CONDA_PREFIX/lib ./conti_v_chunky && h5ls -v foo.h5
write took 2.068250 s
Opened "foo.h5" with sec2 driver.
data                     Dataset {1000/1000, 2560/2560, 2560/2560}
    Location:  1:195
    Links:     1
    Modified:  2022-05-12 17:55:20 CEST
    Storage:   13107200000 logical bytes, 13107200000 allocated bytes, 100.00% utilization
    Type:      native short

## NEVER_FILL and LATEST_FMT
# direct chunk
gcc conti_v_chunky.c -DNEVER_FILL -DLATEST_FMT -DCHUNKED -DWRITE_CHUNK -o conti_v_chunky -I$CONDA_PREFIX/include -L$CONDA_PREFIX/lib -lhdf5 && LD_LIBRARY_PATH=$CONDA_PREFIX/lib ./conti_v_chunky && h5ls -v foo.h5
write took 1.633859 s
Opened "foo.h5" with sec2 driver.
data                     Dataset {1000/1000, 2560/2560, 2560/2560}
    Location:  1:195
    Links:     1
    Modified:  2022-05-12 17:55:22 CEST
    Chunks:    {1, 2560, 2560} 13107200 bytes
    Storage:   13107200000 logical bytes, 13107200000 allocated bytes, 100.00% utilization
    Type:      native short
# hyperslab chunk
gcc conti_v_chunky.c -DNEVER_FILL -DLATEST_FMT -DCHUNKED -o conti_v_chunky -I$CONDA_PREFIX/include -L$CONDA_PREFIX/lib -lhdf5 && LD_LIBRARY_PATH=$CONDA_PREFIX/lib ./conti_v_chunky && h5ls -v foo.h5
write took 1.611976 s
Opened "foo.h5" with sec2 driver.
data                     Dataset {1000/1000, 2560/2560, 2560/2560}
    Location:  1:195
    Links:     1
    Modified:  2022-05-12 17:55:24 CEST
    Chunks:    {1, 2560, 2560} 13107200 bytes
    Storage:   13107200000 logical bytes, 13107200000 allocated bytes, 100.00% utilization
    Type:      native short
# hyperslab contiguous
gcc conti_v_chunky.c -DNEVER_FILL -DLATEST_FMT -o conti_v_chunky -I$CONDA_PREFIX/include -L$CONDA_PREFIX/lib -lhdf5 && LD_LIBRARY_PATH=$CONDA_PREFIX/lib ./conti_v_chunky && h5ls -v foo.h5
write took 2.029573 s
Opened "foo.h5" with sec2 driver.
data                     Dataset {1000/1000, 2560/2560, 2560/2560}
    Location:  1:195
    Links:     1
    Modified:  2022-05-12 17:55:26 CEST
    Storage:   13107200000 logical bytes, 13107200000 allocated bytes, 100.00% utilization
    Type:      native short

gheber · May 12, 2022, 4:29pm

Thank you for running this. Very interesting. No surprises in the direct chunk camp. Under hyperslab chunk we see the effect of (avoiding) double writing as @wkliao pointed out. The file format makes little difference in this straightforward case. I’m a little surprised that hyperslab contiguous is consistently about 25% slower than direct chunk, and I don’t have an explanation. Let me reflect on this!

Do you see similar dependence on the number of projections as in your earlier graph?

Thanks, G.

samuel.debionne · May 19, 2022, 8:49am

Here are the measurements for different number of projections.

with 100x2560x2560:

+----------------------------+--------------+-----------------+----------------------+
|             -              | direct chunk | hyperslab chunk | hyperslab contiguous |
+----------------------------+--------------+-----------------+----------------------+
| regular                    |     0.164985 |        0.320678 |             0.207087 |
| never fill                 |     0.163030 |        0.165883 |             0.207121 |
| latest fmt                 |     0.206406 |        0.360151 |             0.207332 |
| never fill + latest format |     0.206511 |        0.211770 |             0.206871 |
+----------------------------+--------------+-----------------+----------------------+

with 10000x2560x2560:

+----------------------------+--------------+-----------------+----------------------+
|             -              | direct chunk | hyperslab chunk | hyperslab contiguous |
+----------------------------+--------------+-----------------+----------------------+
| regular                    |    21.050749 |       35.889814 |            23.982538 |
| never fill                 |    18.917586 |       20.929941 |            23.291541 |
| latest fmt                 |    19.921759 |       35.472971 |            23.935622 |
| never fill + latest format |    19.646342 |       19.800098 |            24.406450 |
+----------------------------+--------------+-----------------+----------------------+

hyperslab contiguous is consistently slower with different number of projections. Any clue on your side?

samuel.debionne · May 19, 2022, 9:04am

I am a bit puzzled with the performance of h5py: the loop in the OP that writes hyperslab contiguous takes 76.942379. That’s 3x it’s C equivalent…

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

HDF5 write perf - Regular API vs Direct Chunk on uncompressed dataset