File size

Carlos · October 5, 2016, 12:08am

Schlumberger-Private
Hi,

I have a scenario in which my compressed h5 file needs to be updated with new data that is coming in every, say, 5 seconds.

Approach #1: keep the file opened and just write data as they come, or write a buffer at once.
Approach #2: open the file (RDWR), write the data (or a buffer) and then close the file.

Approach #1 is not desirable for my case because if there's any problem (outage, etc), then the h5 file will likely get corrupted. Or if I want to have a look at the file, I can't because it's still writing (still opened).

Approach #2 is good to address the issue above, BUT I noticed that if I open/write/close the file every 5 seconds, the file compression gets really bad and the file size goes up big time. Approach 1 doesn't suffer from this problem.

So, my question is: is there an "Approach #3" that gives me the best of the two worlds? Less likely to get me a corrupted h5 file and at the same time, a good compression rate?

Thanks,
Carlos R.

Landon_Clipp · October 5, 2016, 12:37am

Hello Carlos,

Why not write a program that takes data for a given amount of time, say 5
minutes, and stores it into a temporary text file. Then at the end of the 5
minutes, store that data into HDF, purge the file and then continue to read
data. If an outage happens, you should still have the data available in
your temporary file which can be recovered.

Regards,
Landon Clipp

···

On Oct 4, 2016 7:09 PM, "Carlos Penedo Rocha" <CRocha3@slb.com> wrote:

*Schlumberger-Private*

Hi,

I have a scenario in which my compressed h5 file needs to be updated with
new data that is coming in every, say, 5 seconds.

Approach #1: keep the file opened and just write data as they come, or
write a buffer at once.

Approach #2: open the file (RDWR), write the data (or a buffer) and then
close the file.

Approach #1 is not desirable for my case because if there’s any problem
(outage, etc), then the h5 file will likely get corrupted. Or if I want to
have a look at the file, I can’t because it’s still writing (still opened).

Approach #2 is good to address the issue above, *BUT* I noticed that if I
open/write/close the file every 5 seconds, the file compression gets really
bad and the file size goes up big time. Approach 1 doesn’t suffer from this
problem.

So, my question is: is there an “Approach #3” that gives me the best of
the two worlds? Less likely to get me a corrupted h5 file and at the same
time, a good compression rate?

Thanks,

Carlos R.

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

mark.koennecke · October 5, 2016, 6:05am

Hi,

···

Am 05.10.2016 um 02:08 schrieb Carlos Penedo Rocha <CRocha3@slb.com<mailto:CRocha3@slb.com>>:

Schlumberger-Private
Hi,

I have a scenario in which my compressed h5 file needs to be updated with new data that is coming in every, say, 5 seconds.

Approach #1: keep the file opened and just write data as they come, or write a buffer at once.
Approach #2: open the file (RDWR), write the data (or a buffer) and then close the file.

Approach #1 is not desirable for my case because if there’s any problem (outage, etc), then the h5 file will likely get corrupted. Or if I want to have a look at the file, I can’t because it’s still writing (still opened).

Approach #2 is good to address the issue above, BUT I noticed that if I open/write/close the file every 5 seconds, the file compression gets really bad and the file size goes up big time. Approach 1 doesn’t suffer from this problem.

So, my question is: is there an “Approach #3” that gives me the best of the two worlds? Less likely to get me a corrupted h5 file and at the same time, a good compression rate?

I think this may be separate issues: if the compression is bad, this may be because your chunk size is to small. Try writing every 10 sec or such.
Then there is the approach #3 which does not close the file but just flushes it. Flushing also ensures that the on disk structure is intact, so you are
safe against a crashing program. The call would be H5Fflush().

Regards,

Mark Könnecke

Thanks,
Carlos R.

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org<mailto:Hdf-forum@lists.hdfgroup.org>
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

werner · October 5, 2016, 8:42am

Hi Carlos,

use HDF5 1.10 . That one provides the feature to write to a file while it always remains readable by another process and it ensures the file will never be corrupted. That feature is called SWMR (single write, multiple read) and was introduce with 1.10.

Also you may consider using the LZ4 filter for compression instead of the internal deflate filter. LZ4 does not compress that strongly as deflate, but it's faster by a magnitude, nearly like uncompressed read / write, so it may be worth it, especially for time-constraint data I/O. You may also want to optimize the chunked layout of the dataset according to your data updates since each chunk is compressed on its own.

Cheers,

Werner

···

On 05.10.2016 02:08, Carlos Penedo Rocha wrote:

*Schlumberger-Private*

Hi,

I have a scenario in which my compressed h5 file needs to be updated with new data that is coming in every, say, 5 seconds.

Approach #1: keep the file opened and just write data as they come, or write a buffer at once.

Approach #2: open the file (RDWR), write the data (or a buffer) and then close the file.

Approach #1 is not desirable for my case because if there�s any problem (outage, etc), then the h5 file will likely get corrupted. Or if I want to have a look at the file, I can�t because it�s still writing (still opened).

Approach #2 is good to address the issue above, *BUT* I noticed that if I open/write/close the file every 5 seconds, the file compression gets really bad and the file size goes up big time. Approach 1 doesn�t suffer from this problem.

So, my question is: is there an �Approach #3� that gives me the best of the two worlds? Less likely to get me a corrupted h5 file and at the same time, a good compression rate?

Thanks,

Carlos R.

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

--
___________________________________________________________________________
Dr. Werner Benger Visualization Research
Center for Computation & Technology at Louisiana State University (CCT/LSU)
2019 Digital Media Center, Baton Rouge, Louisiana 70803
Tel.: +1 225 578 4809 Fax.: +1 225 578-5362

Carlos · October 5, 2016, 4:35pm

Schlumberger-Private
Thanks Landon, Koennecke, Werner and Patrick for the feedback.

I'm suffering exactly the problem described in section 3 in this link:
https://support.hdfgroup.org/HDF5/doc/H5.user/Performance.html

I like the suggestion to use the new SWMR feature, I think I'm not that confident in case there's, say, a shutdown exactly during a write operation.

I'll proceed by having a temporary UNcompressed H5 file and then, from time to time, I'll have the h5repack tool run on that file to just compress it into another permanent file. In my tests, I could open/write/close an uncompressed H5 file every x seconds and not suffer from the problem described in the link above. The issue really happens with compressed datasets. Another test that I did was to run h5repack on a bloated file, the size really goes down to what it should be.

As a side note, I don't think my chunk size is too small for the data I have. But it's not as big as the chunk cache (1MB).

Thanks and Regards,
Carlos

···

From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of Werner Benger
Sent: Wednesday, October 05, 2016 3:43 AM
To: hdf-forum@lists.hdfgroup.org
Subject: [Ext] Re: [Hdf-forum] File size

Hi Carlos,

use HDF5 1.10 . That one provides the feature to write to a file while it always remains readable by another process and it ensures the file will never be corrupted. That feature is called SWMR (single write, multiple read) and was introduce with 1.10.

Also you may consider using the LZ4 filter for compression instead of the internal deflate filter. LZ4 does not compress that strongly as deflate, but it's faster by a magnitude, nearly like uncompressed read / write, so it may be worth it, especially for time-constraint data I/O. You may also want to optimize the chunked layout of the dataset according to your data updates since each chunk is compressed on its own.

Cheers,

Werner

On 05.10.2016 02:08, Carlos Penedo Rocha wrote:
Schlumberger-Private
Hi,

I have a scenario in which my compressed h5 file needs to be updated with new data that is coming in every, say, 5 seconds.

Approach #1: keep the file opened and just write data as they come, or write a buffer at once.
Approach #2: open the file (RDWR), write the data (or a buffer) and then close the file.

Approach #1 is not desirable for my case because if there's any problem (outage, etc), then the h5 file will likely get corrupted. Or if I want to have a look at the file, I can't because it's still writing (still opened).

Approach #2 is good to address the issue above, BUT I noticed that if I open/write/close the file every 5 seconds, the file compression gets really bad and the file size goes up big time. Approach 1 doesn't suffer from this problem.

So, my question is: is there an "Approach #3" that gives me the best of the two worlds? Less likely to get me a corrupted h5 file and at the same time, a good compression rate?

Thanks,
Carlos R.

_______________________________________________

Hdf-forum is for HDF software users discussion.

Hdf-forum@lists.hdfgroup.org<mailto:Hdf-forum@lists.hdfgroup.org>

http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org<https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.hdfgroup.org_mailman_listinfo_hdf-2Dforum-5Flists.hdfgroup.org&d=CwMD-g&c=uGuXJ43KPkPWEl2imVFDmZQlhQUET7pVRA2PDIOxgqw&r=_7cDUs4kpe-OMPlm1XwkIQ&m=psEZsRuiIsfn7FRlC3Fs_rv6fgC2fNPOXCiEB3zsjwM&s=sFbk7TjmHu9po9C4vaqkBmP4G21KV6DUxIT7GBEXjpw&e=>

Twitter: https://twitter.com/hdf5<https://urldefense.proofpoint.com/v2/url?u=https-3A__twitter.com_hdf5&d=CwMD-g&c=uGuXJ43KPkPWEl2imVFDmZQlhQUET7pVRA2PDIOxgqw&r=_7cDUs4kpe-OMPlm1XwkIQ&m=psEZsRuiIsfn7FRlC3Fs_rv6fgC2fNPOXCiEB3zsjwM&s=dK4L_9-IG4SlCi7AqJoO73SOCUkP60PfH838gPCCvHo&e=>

--

___________________________________________________________________________

Dr. Werner Benger Visualization Research

Center for Computation & Technology at Louisiana State University (CCT/LSU)

2019 Digital Media Center, Baton Rouge, Louisiana 70803

Tel.: +1 225 578 4809 Fax.: +1 225 578-5362

Patrick_Vacek · October 5, 2016, 1:36pm

This sounds like a problem I encountered before. Here's my post about the issue and resolution:

http://hdf-forum.184993.n3.nabble.com/Deflate-and-partial-chunk-writes-td4028713.html

Basically, my solution was to locally buffer data until I'd filled up an entire chunk before writing to disk. Otherwise there are some inefficiencies in the compression that will cause your files to be oversized.

--Patrick

···

On 10/5/2016 1:06 AM, hdf-forum-request@lists.hdfgroup.org wrote:

From: Carlos Penedo Rocha <CRocha3@slb.com>
To: "hdf-forum@lists.hdfgroup.org" <hdf-forum@lists.hdfgroup.org>
Subject: [Hdf-forum] File size

Schlumberger-Private
Hi,

I have a scenario in which my compressed h5 file needs to be updated with new data that is coming in every, say, 5 seconds.

Approach #1: keep the file opened and just write data as they come, or write a buffer at once.
Approach #2: open the file (RDWR), write the data (or a buffer) and then close the file.

Approach #1 is not desirable for my case because if there's any problem (outage, etc), then the h5 file will likely get corrupted. Or if I want to have a look at the file, I can't because it's still writing (still opened).

Approach #2 is good to address the issue above, BUT I noticed that if I open/write/close the file every 5 seconds, the file compression gets really bad and the file size goes up big time. Approach 1 doesn't suffer from this problem.

So, my question is: is there an "Approach #3" that gives me the best of the two worlds? Less likely to get me a corrupted h5 file and at the same time, a good compression rate?

Thanks,
Carlos R.