HDF5 write perf - Regular API vs Direct Chunk on uncompressed dataset

H5Dread/write_chunk() reads or writes entire chunks. There are no dataspace operations in this case so it’ll be faster than a normal read or write that does perform dataspace checks and operations.

Thanks @derobins for your explanation, I note that it kind of contradicts the intuition of @gheber. Writing continuously in a file should be faster (and eventually compensate with the dataspace checks). What are the dataspace operations you are talking about? If the penalty is really unavoidable, then I was only half joking when I was suggesting a H5Dwrite_direct_continuous for “simple” dataspaces (that should result in straight pwrite append).
Again, I would like to avoid the penalty of chunking when reading a dataset in a way “orthogonal” to it was written -contiguous dataset performs much better.

I hope so see you all soon at the European HDF5 User meeting!

There are also dataspace checks and operations in hyperslab chunk, yet the performance is on par with direct chunk even for the smaller leading dimension. Looking at the gperftools / kcachegrind might give us a few clues on what’s really going on here. For comparision, it would also be interesting to get the numbers for HDF5 1.8.22. @samuel.debionne would you mind getting us numbers with 1.8.22?

Thanks, G.

With 1.8.22 (compiled from source with default options).

with 10000x2560x2560:

+----------------------------+--------------+-----------------+----------------------+
|             -              | direct chunk | hyperslab chunk | hyperslab contiguous |
+----------------------------+--------------+-----------------+----------------------+
| regular                    |    19.060877 |       38.609770 |            23.500788 |
| never fill                 |    19.578496 |       19.223011 |            24.316146 |
| latest fmt                 |    19.697542 |       35.719180 |            24.503973 |
| never fill + latest format |    19.084984 |       18.970671 |            24.817945 |
+----------------------------+--------------+-----------------+----------------------+

So the performance difference between chunk vs contiguous seems to also be in the 1.8 branch.

What should I look at with gperftools / kcachegrind?

Thank you for trying this. It’s good to see that there is no performance regression for this use case compared to 1.8.x. :v:

The trace from gperftools (pprof), visualized w/ [q,k]cachegrind would look something like this:
KcgShot3

We could then drill down into the call stack and get clues about where we are spending (wasting?) time.
We would also see the different code paths for the different APIs (direct chunk, hyperslab chunk, hyperslab contiguous).

OK? G.

That would be interesting of course but it’s also a lot of work to profile a library like HDF5… Maybe I should wait for the “HDF5 Test and Tune” session…
Can you confirm these numbers with your benchmarks just to make sure that there is not something fishy with my specific setup?

I am not sure if that will be useful but since I have them, here are the profils using gprof for writing a 1000x2560x2560 dataset using direct chunk, hyperslab chunk and hyperslab contiguous.

profile_hslab_contig.txt (481.2 KB)
profile_hslab_chunk.txt (485.5 KB)
profile_direct_chunk.txt (475.2 KB)

2 Likes

Thanks a ton! Did you run pprof with the --callgrind option? Otherwise, I can’t load the profiles into kcachegrind. Would you mind rerunning it?

Thanks, G.