HDF5 write perf - Regular API vs Direct Chunk on uncompressed dataset

Not that I am aware of, but that could be a good clue. The dataset creation is “by default”, I am not setting any fill_value. I couldn’t find much info on “fill mode”, is that something enabled at creation time with a specific dcpl (dataset creation property list)?

EDIT: According to https://portal.hdfgroup.org/display/HDF5/H5P_SET_FILL_TIME, the default policy is “Write fill values to the dataset when storage space is allocated only if there is a user-defined fill value”, so I should not be affected by the fill (and double writing).

What could slow down H5Dwrite (vs H5Dwrite_chunk) when having no data conversion, no data scattering (contiguous layout), no chunking, no filters?

Can you tell us the approximate dimensions of your images? I propose we write a simple C program to compare the two. Unless I’m not understanding your example, my expectation is that the performance of a sequence of H5Dwrite ops of 2D slices of a contiguously laid out 3D dataset (with time/projection being the slowest dimension) should be comparable or slightly faster than an equivalent set of H5Dwrite_chunk calls (with a not too small chunk size). (HDF5 I/O test does something similar but it’s easier to write a simple example.)

G.

Our images are 2560x2560 with about 10000 projections (typical dataset is 10000x2560x2560). I can write a benchmark in C (if the python version in the OP is no sufficient).
I am glad to here that the expected perf should should be comparable or slightly faster, probably a mistake on my side then!
I remember a presentation that introduced a web page with performance regression tests (probably running HDF5 I/O test suite) but I can not find it anymore. Could you remind me the URL?

I have just found out that the performance comparison depends on the size of the dataset in the slowest dimension.

TLDR: H5Dwrite_chunk is more than two times faster with n=100, but is on par with H5Dwrite with n=10000.

hdf5-perf

I also noticed something suspicious when looking at the data with h5ls:

$ h5ls -v input_with_chunk.h5
Opened "input_with_chunk.h5" with sec2 driver.
data                     Dataset {1000/1000, 2560/2560, 2560/2560}
    Location:  1:800
    Links:     1
    Chunks:    {1, 2560, 2560} 13107200 bytes
    Storage:   13107200000 logical bytes, 26214400000 allocated bytes, 50.00% utilization
    Type:      native short

$ h5ls -v input_no_chunk.h5
Opened "input_no_chunk.h5" with sec2 driver.
data                     Dataset {1000/1000, 2560/2560, 2560/2560}
    Location:  1:800
    Links:     1
    Storage:   13107200000 logical bytes, 13107200000 allocated bytes, 100.00% utilization
    Type:      native short

Why does the chunked dataset have “50.00% utilization” of Storage?

generate.py (551 Bytes)
generate_direct_chunk.py (568 Bytes)

This is not what I would expect. Attached is a simple C program that, I believe, can be configured (see the macros near the top) to mimic your setup. Would you mind trying that and report back?

(I did not include any timers, because I’m not sure what exactly we are timing.)

Thanks, G.

conti_v_chunky.c (3.1 KB)

Thanks for the C program, trying it right away. Any idea why the chunked output file shows Storage: 13107200000 logical bytes, 26214400000 allocated bytes, 50.00% utilization with h5ls?

That looks fishy. It means that we are using only half of the allocated space, i.e., we are allocating space that we end up not using or abandoning. If you build and run my example with CHUNKED and WRITE_CHUNK enabled, the output looks like this:

gerd@penguin:~$ ~/packages/bin/h5ls -v foo.h5 
Opened "foo.h5" with sec2 driver.
data                     Dataset {100/100, 800/800, 600/600}
    Location:  1:800
    Links:     1
    Chunks:    {1, 800, 600} 960000 bytes
    Storage:   96000000 logical bytes, 96000000 allocated bytes, 100.00% utilization
    Type:      native short

G.

Here are my measurements (using clock_gettime(CLOCK_PROCESS_CPUTIME_ID around the write loop) with a 1000x2560x2560 dataset:

+----------------------------+--------------+-----------------+----------------------+
|             -              | direct chunk | hyperslab chunk | hyperslab contiguous |
+----------------------------+--------------+-----------------+----------------------+
| regular                    |     1.631005 |        3.368258 |             2.069583 |
| never fill                 |     1.632858 |        1.641345 |             2.069205 |
| latest fmt                 |     1.643421 |        3.233817 |             2.068250 |
| never fill + latest format |     1.633859 |        1.611976 |             2.029573 |
+----------------------------+--------------+-----------------+----------------------+

Here is the log:

# direct chunk
gcc conti_v_chunky.c -DCHUNKED -DWRITE_CHUNK -o conti_v_chunky -I$CONDA_PREFIX/include -L$CONDA_PREFIX/lib -lhdf5 && LD_LIBRARY_PATH=$CONDA_PREFIX/lib ./conti_v_chunky && h5ls -v foo.h5
write took 1.631005 s
Opened "foo.h5" with sec2 driver.
data                     Dataset {1000/1000, 2560/2560, 2560/2560}
    Location:  1:800
    Links:     1
    Chunks:    {1, 2560, 2560} 13107200 bytes
    Storage:   13107200000 logical bytes, 13107200000 allocated bytes, 100.00% utilization
    Type:      native short
# hyperslab chunk
gcc conti_v_chunky.c -DCHUNKED -o conti_v_chunky -I$CONDA_PREFIX/include -L$CONDA_PREFIX/lib -lhdf5 && LD_LIBRARY_PATH=$CONDA_PREFIX/lib ./conti_v_chunky && h5ls -v foo.h5
write took 3.368258 s
Opened "foo.h5" with sec2 driver.
data                     Dataset {1000/1000, 2560/2560, 2560/2560}
    Location:  1:800
    Links:     1
    Chunks:    {1, 2560, 2560} 13107200 bytes
    Storage:   13107200000 logical bytes, 13107200000 allocated bytes, 100.00% utilization
    Type:      native short
# hyperslab contiguous
gcc conti_v_chunky.c -o conti_v_chunky -I$CONDA_PREFIX/include -L$CONDA_PREFIX/lib -lhdf5 && LD_LIBRARY_PATH=$CONDA_PREFIX/lib ./conti_v_chunky && h5ls -v foo.h5
write took 2.069583 s
Opened "foo.h5" with sec2 driver.
data                     Dataset {1000/1000, 2560/2560, 2560/2560}
    Location:  1:800
    Links:     1
    Storage:   13107200000 logical bytes, 13107200000 allocated bytes, 100.00% utilization
    Type:      native short

## NEVER_FILL
# direct chunk
gcc conti_v_chunky.c -DNEVER_FILL -DCHUNKED -DWRITE_CHUNK -o conti_v_chunky -I$CONDA_PREFIX/include -L$CONDA_PREFIX/lib -lhdf5 && LD_LIBRARY_PATH=$CONDA_PREFIX/lib ./conti_v_chunky && h5ls -v foo.h5
write took 1.632858 s
Opened "foo.h5" with sec2 driver.
data                     Dataset {1000/1000, 2560/2560, 2560/2560}
    Location:  1:800
    Links:     1
    Chunks:    {1, 2560, 2560} 13107200 bytes
    Storage:   13107200000 logical bytes, 13107200000 allocated bytes, 100.00% utilization
    Type:      native short
# hyperslab chunk
gcc conti_v_chunky.c -DNEVER_FILL -DCHUNKED -o conti_v_chunky -I$CONDA_PREFIX/include -L$CONDA_PREFIX/lib -lhdf5 && LD_LIBRARY_PATH=$CONDA_PREFIX/lib ./conti_v_chunky && h5ls -v foo.h5
write took 1.641345 s
Opened "foo.h5" with sec2 driver.
data                     Dataset {1000/1000, 2560/2560, 2560/2560}
    Location:  1:800
    Links:     1
    Chunks:    {1, 2560, 2560} 13107200 bytes
    Storage:   13107200000 logical bytes, 13107200000 allocated bytes, 100.00% utilization
    Type:      native short
# hyperslab contiguous
gcc conti_v_chunky.c -DNEVER_FILL -o conti_v_chunky -I$CONDA_PREFIX/include -L$CONDA_PREFIX/lib -lhdf5 && LD_LIBRARY_PATH=$CONDA_PREFIX/lib ./conti_v_chunky && h5ls -v foo.h5
write took 2.069205 s
Opened "foo.h5" with sec2 driver.
data                     Dataset {1000/1000, 2560/2560, 2560/2560}
    Location:  1:800
    Links:     1
    Storage:   13107200000 logical bytes, 13107200000 allocated bytes, 100.00% utilization
    Type:      native short

## LATEST_FMT
# direct chunk
gcc conti_v_chunky.c -DLATEST_FMT -DCHUNKED -DWRITE_CHUNK -o conti_v_chunky -I$CONDA_PREFIX/include -L$CONDA_PREFIX/lib -lhdf5 && LD_LIBRARY_PATH=$CONDA_PREFIX/lib ./conti_v_chunky && h5ls -v foo.h5
write took 1.643421 s
Opened "foo.h5" with sec2 driver.
data                     Dataset {1000/1000, 2560/2560, 2560/2560}
    Location:  1:195
    Links:     1
    Modified:  2022-05-12 17:55:14 CEST
    Chunks:    {1, 2560, 2560} 13107200 bytes
    Storage:   13107200000 logical bytes, 13107200000 allocated bytes, 100.00% utilization
    Type:      native short
# hyperslab chunk
gcc conti_v_chunky.c -DLATEST_FMT -DCHUNKED -o conti_v_chunky -I$CONDA_PREFIX/include -L$CONDA_PREFIX/lib -lhdf5 && LD_LIBRARY_PATH=$CONDA_PREFIX/lib ./conti_v_chunky && h5ls -v foo.h5
write took 3.233817 s
Opened "foo.h5" with sec2 driver.
data                     Dataset {1000/1000, 2560/2560, 2560/2560}
    Location:  1:195
    Links:     1
    Modified:  2022-05-12 17:55:16 CEST
    Chunks:    {1, 2560, 2560} 13107200 bytes
    Storage:   13107200000 logical bytes, 13107200000 allocated bytes, 100.00% utilization
    Type:      native short
# hyperslab contiguous
gcc conti_v_chunky.c -DLATEST_FMT -o conti_v_chunky -I$CONDA_PREFIX/include -L$CONDA_PREFIX/lib -lhdf5 && LD_LIBRARY_PATH=$CONDA_PREFIX/lib ./conti_v_chunky && h5ls -v foo.h5
write took 2.068250 s
Opened "foo.h5" with sec2 driver.
data                     Dataset {1000/1000, 2560/2560, 2560/2560}
    Location:  1:195
    Links:     1
    Modified:  2022-05-12 17:55:20 CEST
    Storage:   13107200000 logical bytes, 13107200000 allocated bytes, 100.00% utilization
    Type:      native short

## NEVER_FILL and LATEST_FMT
# direct chunk
gcc conti_v_chunky.c -DNEVER_FILL -DLATEST_FMT -DCHUNKED -DWRITE_CHUNK -o conti_v_chunky -I$CONDA_PREFIX/include -L$CONDA_PREFIX/lib -lhdf5 && LD_LIBRARY_PATH=$CONDA_PREFIX/lib ./conti_v_chunky && h5ls -v foo.h5
write took 1.633859 s
Opened "foo.h5" with sec2 driver.
data                     Dataset {1000/1000, 2560/2560, 2560/2560}
    Location:  1:195
    Links:     1
    Modified:  2022-05-12 17:55:22 CEST
    Chunks:    {1, 2560, 2560} 13107200 bytes
    Storage:   13107200000 logical bytes, 13107200000 allocated bytes, 100.00% utilization
    Type:      native short
# hyperslab chunk
gcc conti_v_chunky.c -DNEVER_FILL -DLATEST_FMT -DCHUNKED -o conti_v_chunky -I$CONDA_PREFIX/include -L$CONDA_PREFIX/lib -lhdf5 && LD_LIBRARY_PATH=$CONDA_PREFIX/lib ./conti_v_chunky && h5ls -v foo.h5
write took 1.611976 s
Opened "foo.h5" with sec2 driver.
data                     Dataset {1000/1000, 2560/2560, 2560/2560}
    Location:  1:195
    Links:     1
    Modified:  2022-05-12 17:55:24 CEST
    Chunks:    {1, 2560, 2560} 13107200 bytes
    Storage:   13107200000 logical bytes, 13107200000 allocated bytes, 100.00% utilization
    Type:      native short
# hyperslab contiguous
gcc conti_v_chunky.c -DNEVER_FILL -DLATEST_FMT -o conti_v_chunky -I$CONDA_PREFIX/include -L$CONDA_PREFIX/lib -lhdf5 && LD_LIBRARY_PATH=$CONDA_PREFIX/lib ./conti_v_chunky && h5ls -v foo.h5
write took 2.029573 s
Opened "foo.h5" with sec2 driver.
data                     Dataset {1000/1000, 2560/2560, 2560/2560}
    Location:  1:195
    Links:     1
    Modified:  2022-05-12 17:55:26 CEST
    Storage:   13107200000 logical bytes, 13107200000 allocated bytes, 100.00% utilization
    Type:      native short

Thank you for running this. Very interesting. No surprises in the direct chunk camp. Under hyperslab chunk we see the effect of (avoiding) double writing as @wkliao pointed out. The file format makes little difference in this straightforward case. I’m a little surprised that hyperslab contiguous is consistently about 25% slower than direct chunk, and I don’t have an explanation. Let me reflect on this! :wink:

Do you see similar dependence on the number of projections as in your earlier graph?

Thanks, G.

Here are the measurements for different number of projections.

with 100x2560x2560:

+----------------------------+--------------+-----------------+----------------------+
|             -              | direct chunk | hyperslab chunk | hyperslab contiguous |
+----------------------------+--------------+-----------------+----------------------+
| regular                    |     0.164985 |        0.320678 |             0.207087 |
| never fill                 |     0.163030 |        0.165883 |             0.207121 |
| latest fmt                 |     0.206406 |        0.360151 |             0.207332 |
| never fill + latest format |     0.206511 |        0.211770 |             0.206871 |
+----------------------------+--------------+-----------------+----------------------+

with 10000x2560x2560:

+----------------------------+--------------+-----------------+----------------------+
|             -              | direct chunk | hyperslab chunk | hyperslab contiguous |
+----------------------------+--------------+-----------------+----------------------+
| regular                    |    21.050749 |       35.889814 |            23.982538 |
| never fill                 |    18.917586 |       20.929941 |            23.291541 |
| latest fmt                 |    19.921759 |       35.472971 |            23.935622 |
| never fill + latest format |    19.646342 |       19.800098 |            24.406450 |
+----------------------------+--------------+-----------------+----------------------+

hyperslab contiguous is consistently slower with different number of projections. Any clue on your side?

I am a bit puzzled with the performance of h5py: the loop in the OP that writes hyperslab contiguous takes 76.942379. That’s 3x it’s C equivalent…

H5Dread/write_chunk() reads or writes entire chunks. There are no dataspace operations in this case so it’ll be faster than a normal read or write that does perform dataspace checks and operations.

Thanks @derobins for your explanation, I note that it kind of contradicts the intuition of @gheber. Writing continuously in a file should be faster (and eventually compensate with the dataspace checks). What are the dataspace operations you are talking about? If the penalty is really unavoidable, then I was only half joking when I was suggesting a H5Dwrite_direct_continuous for “simple” dataspaces (that should result in straight pwrite append).
Again, I would like to avoid the penalty of chunking when reading a dataset in a way “orthogonal” to it was written -contiguous dataset performs much better.

I hope so see you all soon at the European HDF5 User meeting!

There are also dataspace checks and operations in hyperslab chunk, yet the performance is on par with direct chunk even for the smaller leading dimension. Looking at the gperftools / kcachegrind might give us a few clues on what’s really going on here. For comparision, it would also be interesting to get the numbers for HDF5 1.8.22. @samuel.debionne would you mind getting us numbers with 1.8.22?

Thanks, G.

With 1.8.22 (compiled from source with default options).

with 10000x2560x2560:

+----------------------------+--------------+-----------------+----------------------+
|             -              | direct chunk | hyperslab chunk | hyperslab contiguous |
+----------------------------+--------------+-----------------+----------------------+
| regular                    |    19.060877 |       38.609770 |            23.500788 |
| never fill                 |    19.578496 |       19.223011 |            24.316146 |
| latest fmt                 |    19.697542 |       35.719180 |            24.503973 |
| never fill + latest format |    19.084984 |       18.970671 |            24.817945 |
+----------------------------+--------------+-----------------+----------------------+

So the performance difference between chunk vs contiguous seems to also be in the 1.8 branch.

What should I look at with gperftools / kcachegrind?

Thank you for trying this. It’s good to see that there is no performance regression for this use case compared to 1.8.x. :v:

The trace from gperftools (pprof), visualized w/ [q,k]cachegrind would look something like this:
KcgShot3

We could then drill down into the call stack and get clues about where we are spending (wasting?) time.
We would also see the different code paths for the different APIs (direct chunk, hyperslab chunk, hyperslab contiguous).

OK? G.

That would be interesting of course but it’s also a lot of work to profile a library like HDF5… Maybe I should wait for the “HDF5 Test and Tune” session…
Can you confirm these numbers with your benchmarks just to make sure that there is not something fishy with my specific setup?

I am not sure if that will be useful but since I have them, here are the profils using gprof for writing a 1000x2560x2560 dataset using direct chunk, hyperslab chunk and hyperslab contiguous.

profile_hslab_contig.txt (481.2 KB)
profile_hslab_chunk.txt (485.5 KB)
profile_direct_chunk.txt (475.2 KB)

2 Likes

Thanks a ton! Did you run pprof with the --callgrind option? Otherwise, I can’t load the profiles into kcachegrind. Would you mind rerunning it?

Thanks, G.