H5repack GZIP=1 slow

pletnes · April 11, 2018, 1:20pm

Greetings!

I’ve got ~50TB data that I would like to compress. Experiments indicate that GZIP=1 gives about 3x compression, which is nice. However, running h5repack -f SHUF -f GZIP=1 on those files is very slow. “top” indicates that 100% CPU is used most of the time, so I don’t think it’s because it’s IO bound.

To recap the results shown below: gzip -1 uses 11 seconds to compress the file, whereas h5repack uses 68 seconds on the same file. Why this 6x discrepancy? Is there anything I can do to improve repack performance? 20 TB will take a lot of time in any case.

h5repack -f SHUF -f GZIP=1 in.h5 out.h5

real 1m8.134s
user 1m1.479s
sys 0m0.673s

h5repack -f SHUF in.h5 out-shuf-only.h5

real 0m52.370s
user 0m51.495s
sys 0m0.368s

h5repack -f GZIP=1 in.h5 out-gz=1-only.h5

real 1m1.611s
user 1m1.274s
sys 0m0.336s

gzip -1 -k in.h5

real 0m11.272s
user 0m11.075s
sys 0m0.196s

ls -lh
-rw-rw-r-- 1 paul paul 957M Apr 11 10:09 in.h5
-rw-rw-r-- 1 paul paul 440M Apr 11 10:09 in.h5.gz
-rw-rw-r-- 1 paul paul 450M Apr 11 11:19 out-gz=1-only.h5
-rw-rw-r-- 1 paul paul 426M Apr 11 11:17 out.h5
-rw-rw-r-- 1 paul paul 956M Apr 11 11:17 out-shuf-only.h5

Version info:
λ uname -a
Linux licetop-3 4.13.0-37-generic #42~16.04.1-Ubuntu SMP Wed Mar 7 16:03:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
λ h5repack -V
h5repack: Version 1.8.16
λ which h5repack
/usr/bin/h5repack

miller86 · April 11, 2018, 3:49pm

I am curious but what is the chunk size you used in creating the datasets in the files. I can understand that gzip on whole file could go faster than HDF5 because of HDF5’s obligation to obey chunking. But, if source dataset chunk size, in bytes, is on the order of 4-16 kilobytes, then I wouldn’t expect a 6x difference. Also, can you alter chunk-size in the re-packed file? If so, maybe try to choose a larger chunk size there – assuming doing so wont negatively impact any partial I/O workflows you have downstream. Another question is what is going on with the metadata in the re-packed file? In theory, HDF5 library ought to be smart enough to gather most of the metadata together and write it in large pages to just a few places in the re-packed file. However, if it winds up sprinkling metadata about in the re-packed file, that could also reduce I/O performance a lot. But, you said you didn’t think its spending much time in I/O anyways so maybe these comments are all off target.

pletnes · April 12, 2018, 6:55am

Interesting thoughts. The biggest dataset (a hyperspectral datacube) has these properties. Could the FLETCHER32 filter be slowing it down? Note that the first dimension is H5S_UNLIMITED and its length depends on the length of the recording. The chunks are essentially one frame. Recordings are performed over time and stored in H5S_UNLIMITED arrays, so I imagine that as the file grows, chunks from different datasets are interspersed throughout the file. Could that be the culprit?

We use the hyperspectral images differently in different analyses, so we haven’t established which chunk size is ideal for all purposes - suggestions are welcome. For many purposes, though, it makes sense to keep the last dimension (wavelength) in one chunk, so you don’t have to read multiple chunks to read one spectrum.

         DATASET "dataCube" {
            DATATYPE  H5T_STD_U16LE
            DATASPACE  SIMPLE { ( 2078, 968, 223 ) / ( H5S_UNLIMITED, 968, 223 ) }
            STORAGE_LAYOUT {
               CHUNKED ( 1, 166, 223 )
               SIZE 923130720 (0.972:1 COMPRESSION)
            }
            FILTERS {
               CHECKSUM FLETCHER32
            }
            FILLVALUE {
               FILL_TIME H5D_FILL_TIME_IFSET
               VALUE  0
            }
            ALLOCATION_TIME {
               H5D_ALLOC_TIME_INCR
            }

I also noticed something I didn’t see before, that there are 1D datasets with a chunk size of 1. Could this generate enough overhead that we’d see such slowdown?

         DATASET "timestamp" {
            DATATYPE  H5T_IEEE_F64LE
            DATASPACE  SIMPLE { ( 2078 ) / ( H5S_UNLIMITED ) }
            STORAGE_LAYOUT {
               CHUNKED ( 1 )
               SIZE 24936 (0.667:1 COMPRESSION)
            }
            FILTERS {
               CHECKSUM FLETCHER32
            }
            FILLVALUE {
               FILL_TIME H5D_FILL_TIME_IFSET
               VALUE  0
            }
            ALLOCATION_TIME {
               H5D_ALLOC_TIME_INCR
            }
         }
      }

Thoughts?

haiyingx · April 12, 2018, 5:42pm

In fact, if the timestamp is used as a dimension of another variable, it will slow things down too. So it’d be better not compress dimension data or meta data.

pletnes · April 16, 2018, 7:26am

Not sure what you mean. The timestamp is indeed a “dimension” of the data, but not in the sense that hdf5 knows about it - they are independent data sets. Does hdf5 support having one dataset being explicitly set as the dimension of another? Or is this a netCDF4 feature?

miller86 · April 16, 2018, 4:17pm

So, I was hoping someone from THG might chime in. But, there are a few things I am seeing in the output you provided above that might be cause for concern. First, “dataCube” appears to be getting slightly expanded (not compressed) in size. At least the compression ratio looks like it is a tad below 1 (0.972:1). Next, the chunk size of 1x166x223…166 does not divide evenly into 968 so you are getting partial chunks in that dimension. That seems potentially wasteful. The ‘1’ for the chunk size in the unlimited dimension seems fine because a) its the unlimited dimension and maybe you only get one record at a time and/or don’t wanna buffer multiple and b) the 166x233(x2 bytes) is a decent sized I/O request of ~72kb. Next, that “CHUNKED ( 1 )” for “timestamp” does indeed seem like a bad idea. The whole dataset is getting even more expanded than “dataCube”. Its compression ratio is 0.667:1 (so, its about 33% larger than it would be if you did NOT compress it). OTOH, “timestamp” is a really small dataset compared to “dataCube” so compression of it is probably totally irrelevant to total file size. Next, “timestamp” is floating point data and you can’t really do well compressing floats with gzip anyways and that is even more true if you do NOT also use HDF5’s shuffle filter in concert with gzip compresser. Note that shuffle by itself will do nothing to compress. It simply re-orders bytes in memory in hopes of making the resulting byte-stream easier to compress for something like GZIP than it would be able to do otherwise. This is because GZIP is a byte-level compressor. It doesn’t know about things like shorts or ints or doubles.

Circling back to your original post and data. I see you tried h5repack 3 different ways. One of those was was shuffle only. That is only going to re-order bytes in datasets and nothing else. So, that particular h5repack is not useful, except maybe I guess to probe HDF5 library performance which I guess was kind of your point wasn’t it

So, in looking at this closer and in your data sets, there isn’t much I think that can explain HDF5’s slower performance for your data. Your out.h5 and out-gz=1-only.h5 file sizes are within about 5% of each other and with about 5% of gzip’ing file the directly with gzip. So, I’ve got to believe that most of the compression that HDF5 is achieving is identical to what gzip’ing the whole file is achieving. The only difference is that HDF5 is doing as 2078 individual (but nicely sized) chunks for “dataCube” and another 2078 chunks (of 8 bytes) for “timestamp”. Can you maybe try re-packing just the dataCube part of the file and see if eliminating “timestamp” from the operation makes it go a lot faster?

miller86 · May 21, 2018, 7:03pm

Just curious but was the root caused ever discovered here? Was a solution found?

pletnes · May 22, 2018, 7:07am

Not yet, just haven’t had the time!

epourmal · May 23, 2018, 1:53am

Someone from THG is chiming in

Definitely Fetcher32 adds to slowness as do the shuffle filter. The original chunks in the file are not big; making them bigger will reduce MD overhead for chunked storage and will help with I/O.

As Mark mentioned already, “timestamp” has chunk of size 1. Please never use chunks sizes so small!

I would also suggest to try the more recent version of h5repack.

May be you can post h5dump -pH outputs for the original and compressed files? Based on the posted information it is hard to say why h5repack is slow. Do you know how much time it takes to create original file? Comparing just with gzip is not quite fare because gzip will chop the file in big pieces, compress them and write them down(or something like this), it is not the same as using h5repack (i.e., HDF5 library) to write the portable, self-describing HDF5 file

Thank you!

Elena

epourmal · May 23, 2018, 2:22am

h5dump output with -pH still can be huge. If the objects have attributes, try to use -A 0 to suppress them, but maybe a better option will be to look at the source and repacked files with h5stat tool to get an idea what is in the file in the first place.

Elena