Is there a way to run h5repack in parallel

peter.jensen · June 11, 2020, 3:35pm

Hello.

running h5repack -f GZIP=6 file1 file2 only uses one thread, is there a way to speed up the repacking?

Best regards.
Peter

bljones · June 11, 2020, 3:58pm

Hi Peter,

The h5repack reference page has information on how you can improve performance with h5repack. See:

https://portal.hdfgroup.org/display/HDF5/h5repack

Basically, you can set an environment variable (H5TOOLS_BUFSIZE) to change the hyperslab selection buffer size, which can help with performance.

I hope that helps!
-Barbara

peter.jensen · June 12, 2020, 2:34pm

Hi Barbara.

Thank you very much for your suggestion.

Unfortunately it does not seem to work changing the buffer size.

I found this comment from 2008 : http://hdf-forum.184993.n3.nabble.com/hdf-forum-speeding-up-h5repack-td193619.html (first reply from George) where it is hinted that is should be possible.

running h5dump -p -H yields multiple different chunks since it contains a lot of groups with different sizes.
if i run h5pcc -showconfig i have

Features:

               Parallel HDF5: yes

Parallel Filtered Dataset Writes: yes
Large Parallel I/O: yes
High-level library: yes
Build HDF5 Tests: yes
Build HDF5 Tools: yes
Threadsafety: no
Default API mapping: v110
With deprecated public symbols: yes
I/O filters (external): deflate(zlib)
MPE:
Direct VFD: no
(Read-Only) S3 VFD: no
(Read-Only) HDFS VFD: no
dmalloc: no
Packages w/ extra debug output: none
API tracing: no
Using memory checker: no
Memory allocation sanity checks: no
Function stack tracing: no
Strict file format checks: no
Optimization instrumentation: no

Still, the repacking only uses one thread on my computer.

bljones · June 15, 2020, 6:45pm

Hi Peter,

The h5repack utility is not a parallel tool.

Can you send us your file to examine?
I will contact you through the helpdesk on how you can do that.

Thanks!
-Barbara

pedro.vicente · June 15, 2020, 7:02pm

.>>>> The h5repack utility is not a parallel tool.

correct
the original use case of h5repack was to “regenerate” a file, like a "first aid " if you will .

typically one does this rarely, so for sure a wait of some minutes is acceptable

peter.jensen · June 16, 2020, 7:09am

Hi Barbara and Pedro.

I can send the file, no problem, I just don’t think it would solve anything.

Let me try to rephrase my question:
In your faq : https://support.hdfgroup.org/HDF5/hdf5-quest.html#p5comp
You state that a file can not be saved in parallel with compression (which in my case would be optimal but i understand the difficulties).
However, if i already have an uncompressed hdf5 file (which can be written fast), i can then rewrite that to its compressed version, but that only uses one thread. Is it possible to do this in parallel?

If neither is the case then i’m not sure i understand the “Parallel Filtered Dataset Writes: yes” flag.

It looks like you wrote something about it here barbara - Parallel Compression Support Detection but to me it is not clear if the user wants to write or read a hdf5 file in parallel.

best regards

bljones · June 16, 2020, 7:20pm

Hi Peter,

As of HDF5-1.10.2, parallel compression support was added to HDF5. With this release, HDF5 parallel applications can both create and write to compressed datasets (or datasets with filters applied, such as Fletcher32-bit).

With HDF5-1.8 and earlier releases, you could read compressed data in parallel, but not write.

The “Parallel Filtered Dataset Writes” flag was added to the libhdf5.settings file to indicate whether or not parallel compression is supported in that specific HDF5 library.

Please note that the https://support.hdfgroup.org/ web site is no longer
supported. The same faq on the Support Portal can be found here with the updated information:

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Is there a way to run h5repack in parallel

Features: