h5dump a 2D large chunk with GZIP filter is very slow

michael.j.cook · October 30, 2018, 4:13pm

I have a 1.8.19 environment and an h5 file that has a two dimensional dataset with following specification:

h5ls -av myChunkedFile.h5/List/event
Opened “myChunkedFile.h5” with sec2 driver.
event Dataset {61751770/Inf, 5/5}

Location:  1:1832
Links:     1
Chunks:    {8000000, 5} 80000000 bytes
Storage:   617517700 logical bytes, 9243240 allocated bytes, 6680.75% utilization
Filter-0:  deflate-1 OPT {6}
Type:      native short
Address: 2560
       Flags    Bytes     Address          Logical Offset
    ========== ======== ========== ==============================
    0x00000000  1273973       5176 [0, 0, 0]
    0x00000000  1193401    1279149 [8000000, 0, 0]
    0x00000000  1193287    2472550 [16000000, 0, 0]
    0x00000000  1193159    3665837 [24000000, 0, 0]
    0x00000000  1191715    4858996 [32000000, 0, 0]
    0x00000000  1192051    6050711 [40000000, 0, 0]
    0x00000000  1193409    7242762 [48000000, 0, 0]
    0x00000000   812245    8436171 [56000000, 0, 0]

I want to h5dump the data associated with one chunk, for example, to dump the 2nd chunk:

h5dump -d /List/event -s "8000000,0" -c "8000000,5" myChunkedFile.h5

The single chunk h5dump takes an incredible amount of time (eg. hours), where as the h5dump of the entire dataset only takes minutes.

It appears to me that a h5dump with a “-s” start_offset and an associated GZIP filter results in the ‘chunk’ being decompressed for each data element in the chunk. As you can see, my dataset chunk size is 80 Mbytes. I have read that the default chunk cache size is 1 MByte.

There seems to be no way to specify the ‘chunk cache’ in the context of a h5dump.

Any suggestions?
Can anyone confirm that h5dump of a chunk does full chunk decompression for each data element in the chunk?

Mike

bljones · October 30, 2018, 5:26pm

Hi Mike,

I don’t see a way with h5dump to specify a larger chunk cache size. (I’ll check on that.)

However, I think it should help to switch your chunk size to 5,8000000.
When using C (which h5dump is written in), data elements are stored in row-major order, meaning the elements in a row are contiguous. Only one read access has to be done to read a row, but if reading a column, then multiple read accesses must be performed.

Here is a document with images (Figure 3 and Figure 4 under “Dataset Storage Order”) that describe the issue:

https://portal.hdfgroup.org/display/HDF5/Chunking+in+HDF5

Chunking is a dataset creation property, so you have to re-create the file with a different chunk size to change it. That can be done with the h5repack utility that comes with the HDF5 binary distribution.

-Barbara

bljones · October 31, 2018, 9:00pm

Hi Mike,

We discussed this and need to look at the issue further. I entered bug HDFFV-10620 for the issue.

Thanks!
-Barbara

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

h5dump a 2D large chunk with GZIP filter is very slow