Queuing chunks for compression and writing

Peter_Majer · March 16, 2015, 4:53pm

Dear All
We have been experiencing and suffering from the fact that writing compressed files with hdf is significantly slower than writing uncompressed. I have been asking myself for a while whether there is a simple remedy. Would it be possible to have two queues of chunks when writing a file, one for compression and one for actual writing to achieve the following:

1) I enqueue N chunks for CompressionAndWriting. They initially enter CompressQueue.

2) The chunks from CompressQueue are concurrently compressed by multiple compression threads and subsequently enqueued in a WriteQueue.

3) A WriteThread sequentially writes all compressed chunks from WriteQueue to the file system.

This should allow to keep the WriteThread constantly busy and it should allow compressed writing to be faster than uncompressed writing by a factor that is more or less identical to the compression rate.

Interfacewise it would be nice to have "StartWrite" and "FinishWrite" methods where "Startwrite" simply copies the data into the CompressQueue and returns immediately thereafter while FinishWrite would be blocking until the write operation for the corresponding chunk has actually completed.

Would this be possible?
Would it be feasible?
Would it be easy?

Thanks, Peter

Dr. Peter Majer
Image Analysis Scientist and Software Architect
Bitplane AG
www.bitplane.com<http://www.bitplane.com/>

This message is intended only for the use of the addressee and may contain information that is confidential and/or subject to copyright. If you are not the intended recipient, you are hereby notified that any dissemination, copying, or redistribution of this message is strictly prohibited. If you have received this message in error please delete all copies immediately. Any views or opinions presented in this email are solely those of the author and do not necessarily represent those of Andor Technology Limited Companies. Andor Technology Limited has taken reasonable precautions to ensure that no viruses are contained in this email, but does not accept any responsibility once this email has been transmitted. Andor Technology Limited is a registered company in Northern Ireland, registration number: NI022466. Registered Office: Andor Technology, 7 Millennium Way, Springvale Business Park, Belfast, BT12 7AL.

···

___________________________________________________________________________
Please refer to www.oxinst.com/email-statement<http://www.oxinst.com/email-statement> for regulatory information.

gheber · March 16, 2015, 6:20pm

Peter, there's an API call that lets you write chunks directly
into the file including chunks which you have compressed outside
the HDF5 filter pipeline. Have a look at:

http://www.hdfgroup.org/HDF5/doc/HL/RM_HDF5Optimized.html#H5DOwrite_chunk

See how fast you can write with H5DOwrite_chunk and then do
a back-of-the-envelope calculation to see how elaborate
a queueing mechanism you want.

G.

···

From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of Peter Majer
Sent: Monday, March 16, 2015 11:53 AM
To: hdf-forum@lists.hdfgroup.org
Subject: [Hdf-forum] Queuing chunks for compression and writing

Dear All
We have been experiencing and suffering from the fact that writing compressed files with hdf is significantly slower than writing uncompressed. I have been asking myself for a while whether there is a simple remedy. Would it be possible to have two queues of chunks when writing a file, one for compression and one for actual writing to achieve the following:

1) I enqueue N chunks for CompressionAndWriting. They initially enter CompressQueue.

2) The chunks from CompressQueue are concurrently compressed by multiple compression threads and subsequently enqueued in a WriteQueue.

3) A WriteThread sequentially writes all compressed chunks from WriteQueue to the file system.

This should allow to keep the WriteThread constantly busy and it should allow compressed writing to be faster than uncompressed writing by a factor that is more or less identical to the compression rate.

Interfacewise it would be nice to have "StartWrite" and "FinishWrite" methods where "Startwrite" simply copies the data into the CompressQueue and returns immediately thereafter while FinishWrite would be blocking until the write operation for the corresponding chunk has actually completed.

Would this be possible?
Would it be feasible?
Would it be easy?

Thanks, Peter

Dr. Peter Majer
Image Analysis Scientist and Software Architect
Bitplane AG
www.bitplane.com<http://www.bitplane.com/>

This message is intended only for the use of the addressee and may contain information that is confidential and/or subject to copyright. If you are not the intended recipient, you are hereby notified that any dissemination, copying, or redistribution of this message is strictly prohibited. If you have received this message in error please delete all copies immediately. Any views or opinions presented in this email are solely those of the author and do not necessarily represent those of Andor Technology Limited Companies. Andor Technology Limited has taken reasonable precautions to ensure that no viruses are contained in this email, but does not accept any responsibility once this email has been transmitted. Andor Technology Limited is a registered company in Northern Ireland, registration number: NI022466. Registered Office: Andor Technology, 7 Millennium Way, Springvale Business Park, Belfast, BT12 7AL.

___________________________________________________________________________
Please refer to www.oxinst.com/email-statement<http://www.oxinst.com/email-statement> for regulatory information.

paramon · March 17, 2015, 7:18am

Hello Peter!

16.03.2015 19:53, Peter Majer пишет:

Dear All

We have been experiencing and suffering from the fact that writing
compressed files with hdf is significantly slower than writing
uncompressed.

My experience suggests that this depends greatly on the compression algorithm of choice. For example, gzip compression typically slowed data write by a factor of 10x-20x, while lz4 compression was even a bit faster compared to uncompressed data, for my type of data (mass spec).

I have been asking myself for a while whether there is a
simple remedy.

It seems so. Please try lz4 compression plugin. For mass spec data, applying lz4 compression delivered the following, simultaneously:

1) Smaller file size (5x-10x compression).
2) Slightly faster data write.
3) Slightly faster data read.

The only problem with lz4 is that it's not available in default HDF5 installation. However, I found using dynamic lz4 plugin reasonably easy.

Best wishes,
Andrey Paramonov

···

--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

Dimitris_Servis · March 16, 2015, 6:38pm

Hi Peter,

why do you do the compression on HDF5 side? This is something you can do more efficiently on business side.

Best

Dimitris

···

Peter, there’s an API call that lets you write chunks directly
into the file including chunks which you have compressed outside
the HDF5 filter pipeline. Have a look at:

http://www.hdfgroup.org/HDF5/doc/HL/RM_HDF5Optimized.html#H5DOwrite_chunk

See how fast you can write with H5DOwrite_chunk and then do
a back-of-the-envelope calculation to see how elaborate
a queueing mechanism you want.

G.

From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of Peter Majer
Sent: Monday, March 16, 2015 11:53 AM
To: hdf-forum@lists.hdfgroup.org
Subject: [Hdf-forum] Queuing chunks for compression and writing

Dear All
We have been experiencing and suffering from the fact that writing compressed files with hdf is significantly slower than writing uncompressed. I have been asking myself for a while whether there is a simple remedy. Would it be possible to have two queues of chunks when writing a file, one for compression and one for actual writing to achieve the following:

1) I enqueue N chunks for CompressionAndWriting. They initially enter CompressQueue.
2) The chunks from CompressQueue are concurrently compressed by multiple compression threads and subsequently enqueued in a WriteQueue.
3) A WriteThread sequentially writes all compressed chunks from WriteQueue to the file system.

This should allow to keep the WriteThread constantly busy and it should allow compressed writing to be faster than uncompressed writing by a factor that is more or less identical to the compression rate.

Interfacewise it would be nice to have “StartWrite” and “FinishWrite” methods where “Startwrite” simply copies the data into the CompressQueue and returns immediately thereafter while FinishWrite would be blocking until the write operation for the corresponding chunk has actually completed.

Would this be possible?
Would it be feasible?
Would it be easy?

Thanks, Peter

Dr. Peter Majer
Image Analysis Scientist and Software Architect
Bitplane AG
www.bitplane.com

This message is intended only for the use of the addressee and may contain information that is confidential and/or subject to copyright. If you are not the intended recipient, you are hereby notified that any dissemination, copying, or redistribution of this message is strictly prohibited. If you have received this message in error please delete all copies immediately. Any views or opinions presented in this email are solely those of the author and do not necessarily represent those of Andor Technology Limited Companies. Andor Technology Limited has taken reasonable precautions to ensure that no viruses are contained in this email, but does not accept any responsibility once this email has been transmitted. Andor Technology Limited is a registered company in Northern Ireland, registration number: NI022466. Registered Office: Andor Technology, 7 Millennium Way, Springvale Business Park, Belfast, BT12 7AL.
___________________________________________________________________________
Please refer to www.oxinst.com/email-statement for regulatory information.

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: x.com

Peter_Majer · March 16, 2015, 8:02pm

Dear Gerd
Thanks for pointing me to this. I have one question regarding this:

1) It is very important to me to create a "standard" hdf file that can be read (!) by the standard hdf library without any add-ons for decompression. I need this because there are many old versions of our product in the market that rely on the standard features of hdf for opening files, i.e. they can open gzip-compressed chunks because that is part of hdfs functionality. They would not be able to open chunks that I have compressed with my own compression algorithm. (FYI: I can not patch these old versions with a new decompression filter.)

2) My guess would be that I could use gzip for compression (which I will run outside of the lib in order to run it in parallel and then I write the chunks into the file using H5DOwrite_chunk) and in the hdf file I set the filtermask to that for gzip. Then I should be able to read the file with a standard hdf library and this will by itself do the decompression?

3) I have come to love hdf for it's extremely forgiving implementation. Over the years we have fiddled with chunk sizes. We never had to communicate a file format change to our customers because the library covered our back. That was really nice. What will happen if I write my own compressed chunks? Will I need to deliver a decompressor? Will I be able to change chunk sizes without breaking backward compatibility?

Thanks for your comments and help.
Cheers, Peter

···

________________________________
From: Hdf-forum [hdf-forum-bounces@lists.hdfgroup.org] on behalf of Gerd Heber [gheber@hdfgroup.org]
Sent: Monday, March 16, 2015 6:20 PM
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] Queuing chunks for compression and writing

Peter, there’s an API call that lets you write chunks directly
into the file including chunks which you have compressed outside
the HDF5 filter pipeline. Have a look at:

http://www.hdfgroup.org/HDF5/doc/HL/RM_HDF5Optimized.html#H5DOwrite_chunk

See how fast you can write with H5DOwrite_chunk and then do
a back-of-the-envelope calculation to see how elaborate
a queueing mechanism you want.

G.

From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of Peter Majer
Sent: Monday, March 16, 2015 11:53 AM
To: hdf-forum@lists.hdfgroup.org
Subject: [Hdf-forum] Queuing chunks for compression and writing

Dear All
We have been experiencing and suffering from the fact that writing compressed files with hdf is significantly slower than writing uncompressed. I have been asking myself for a while whether there is a simple remedy. Would it be possible to have two queues of chunks when writing a file, one for compression and one for actual writing to achieve the following:

1) I enqueue N chunks for CompressionAndWriting. They initially enter CompressQueue.

2) The chunks from CompressQueue are concurrently compressed by multiple compression threads and subsequently enqueued in a WriteQueue.

3) A WriteThread sequentially writes all compressed chunks from WriteQueue to the file system.

This should allow to keep the WriteThread constantly busy and it should allow compressed writing to be faster than uncompressed writing by a factor that is more or less identical to the compression rate.

Interfacewise it would be nice to have “StartWrite” and “FinishWrite” methods where “Startwrite” simply copies the data into the CompressQueue and returns immediately thereafter while FinishWrite would be blocking until the write operation for the corresponding chunk has actually completed.

Would this be possible?
Would it be feasible?
Would it be easy?

Thanks, Peter

Dr. Peter Majer
Image Analysis Scientist and Software Architect
Bitplane AG
www.bitplane.com<http://www.bitplane.com/>

This message is intended only for the use of the addressee and may contain information that is confidential and/or subject to copyright. If you are not the intended recipient, you are hereby notified that any dissemination, copying, or redistribution of this message is strictly prohibited. If you have received this message in error please delete all copies immediately. Any views or opinions presented in this email are solely those of the author and do not necessarily represent those of Andor Technology Limited Companies. Andor Technology Limited has taken reasonable precautions to ensure that no viruses are contained in this email, but does not accept any responsibility once this email has been transmitted. Andor Technology Limited is a registered company in Northern Ireland, registration number: NI022466. Registered Office: Andor Technology, 7 Millennium Way, Springvale Business Park, Belfast, BT12 7AL.

___________________________________________________________________________
Please refer to www.oxinst.com/email-statement<http://www.oxinst.com/email-statement> for regulatory information.

gheber · March 16, 2015, 8:45pm

Peter,

1) It is very important to me to create a "standard" hdf file that can be read (!) by the standard hdf library without any add-ons for decompression. I need this because there are many old versions of our product in the market that rely on the standard features of hdf for opening files, i.e. they can open gzip-compressed chunks because that is part of hdfs functionality. They would not be able to open chunks that I have compressed with my own compression algorithm. (FYI: I can not patch these old versions with a new decompression filter.)

Correct. If you used a compression algorithm which is not available to your clients,
they'd be in trouble when attempting to read those datasets. Newer versions of the
library (1.8.11+) support the dynamic loading of filters

however, this would require a minimum library version on the client side.

2) My guess would be that I could use gzip for compression (which I will run outside of the lib in order to run it in parallel and then I write the chunks into the file using H5DOwrite_chunk) and in the hdf file I set the filtermask to that for gzip. Then I should be able to read the file with a standard hdf library and this will by itself do the decompression?

Correct.

3) I have come to love hdf for it's extremely forgiving implementation. Over the years we have fiddled with chunk sizes. We never had to communicate a file format change to our customers because the library covered our back. That was really nice. What will happen if I write my own compressed chunks? Will I need to deliver a decompressor? Will I be able to change chunk sizes without breaking backward compatibility?

Yes, you will need to provide a decompressor, either by compiling it into the version
of the HDF5 library you distribute with your application, or as a plugin (shared library)
to be loaded at runtime.

I'm not sure I understand what you mean by "breaking backward compatibility."
At the API level, H5Dread/write won't see a difference.
A change in chunk size might have an adverse effect on performance, for example,
if you've hard-tuned your application's dataset chunk cache sizes.

Best, G.

···

________________________________
From: Hdf-forum [hdf-forum-bounces@lists.hdfgroup.org] on behalf of Gerd Heber [gheber@hdfgroup.org]
Sent: Monday, March 16, 2015 6:20 PM
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] Queuing chunks for compression and writing
Peter, there's an API call that lets you write chunks directly
into the file including chunks which you have compressed outside
the HDF5 filter pipeline. Have a look at:

http://www.hdfgroup.org/HDF5/doc/HL/RM_HDF5Optimized.html#H5DOwrite_chunk

See how fast you can write with H5DOwrite_chunk and then do
a back-of-the-envelope calculation to see how elaborate
a queueing mechanism you want.

G.

From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of Peter Majer
Sent: Monday, March 16, 2015 11:53 AM
To: hdf-forum@lists.hdfgroup.org<mailto:hdf-forum@lists.hdfgroup.org>
Subject: [Hdf-forum] Queuing chunks for compression and writing

Dear All
We have been experiencing and suffering from the fact that writing compressed files with hdf is significantly slower than writing uncompressed. I have been asking myself for a while whether there is a simple remedy. Would it be possible to have two queues of chunks when writing a file, one for compression and one for actual writing to achieve the following:

1) I enqueue N chunks for CompressionAndWriting. They initially enter CompressQueue.

2) The chunks from CompressQueue are concurrently compressed by multiple compression threads and subsequently enqueued in a WriteQueue.

3) A WriteThread sequentially writes all compressed chunks from WriteQueue to the file system.

This should allow to keep the WriteThread constantly busy and it should allow compressed writing to be faster than uncompressed writing by a factor that is more or less identical to the compression rate.

Interfacewise it would be nice to have "StartWrite" and "FinishWrite" methods where "Startwrite" simply copies the data into the CompressQueue and returns immediately thereafter while FinishWrite would be blocking until the write operation for the corresponding chunk has actually completed.

Would this be possible?
Would it be feasible?
Would it be easy?

Thanks, Peter

Dr. Peter Majer
Image Analysis Scientist and Software Architect
Bitplane AG
www.bitplane.com<http://www.bitplane.com/>

This message is intended only for the use of the addressee and may contain information that is confidential and/or subject to copyright. If you are not the intended recipient, you are hereby notified that any dissemination, copying, or redistribution of this message is strictly prohibited. If you have received this message in error please delete all copies immediately. Any views or opinions presented in this email are solely those of the author and do not necessarily represent those of Andor Technology Limited Companies. Andor Technology Limited has taken reasonable precautions to ensure that no viruses are contained in this email, but does not accept any responsibility once this email has been transmitted. Andor Technology Limited is a registered company in Northern Ireland, registration number: NI022466. Registered Office: Andor Technology, 7 Millennium Way, Springvale Business Park, Belfast, BT12 7AL.

___________________________________________________________________________
Please refer to www.oxinst.com/email-statement<http://www.oxinst.com/email-statement> for regulatory information.

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Queuing chunks for compression and writing