Schlumberger-Private
Thanks Landon, Koennecke, Werner and Patrick for the feedback.
I'm suffering exactly the problem described in section 3 in this link:
https://support.hdfgroup.org/HDF5/doc/H5.user/Performance.html
I like the suggestion to use the new SWMR feature, I think I'm not that confident in case there's, say, a shutdown exactly during a write operation.
I'll proceed by having a temporary UNcompressed H5 file and then, from time to time, I'll have the h5repack tool run on that file to just compress it into another permanent file. In my tests, I could open/write/close an uncompressed H5 file every x seconds and not suffer from the problem described in the link above. The issue really happens with compressed datasets. Another test that I did was to run h5repack on a bloated file, the size really goes down to what it should be.
As a side note, I don't think my chunk size is too small for the data I have. But it's not as big as the chunk cache (1MB).
Thanks and Regards,
Carlos
···
From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of Werner Benger
Sent: Wednesday, October 05, 2016 3:43 AM
To: hdf-forum@lists.hdfgroup.org
Subject: [Ext] Re: [Hdf-forum] File size
Hi Carlos,
use HDF5 1.10 . That one provides the feature to write to a file while it always remains readable by another process and it ensures the file will never be corrupted. That feature is called SWMR (single write, multiple read) and was introduce with 1.10.
Also you may consider using the LZ4 filter for compression instead of the internal deflate filter. LZ4 does not compress that strongly as deflate, but it's faster by a magnitude, nearly like uncompressed read / write, so it may be worth it, especially for time-constraint data I/O. You may also want to optimize the chunked layout of the dataset according to your data updates since each chunk is compressed on its own.
Cheers,
Werner
On 05.10.2016 02:08, Carlos Penedo Rocha wrote:
Schlumberger-Private
Hi,
I have a scenario in which my compressed h5 file needs to be updated with new data that is coming in every, say, 5 seconds.
Approach #1: keep the file opened and just write data as they come, or write a buffer at once.
Approach #2: open the file (RDWR), write the data (or a buffer) and then close the file.
Approach #1 is not desirable for my case because if there's any problem (outage, etc), then the h5 file will likely get corrupted. Or if I want to have a look at the file, I can't because it's still writing (still opened).
Approach #2 is good to address the issue above, BUT I noticed that if I open/write/close the file every 5 seconds, the file compression gets really bad and the file size goes up big time. Approach 1 doesn't suffer from this problem.
So, my question is: is there an "Approach #3" that gives me the best of the two worlds? Less likely to get me a corrupted h5 file and at the same time, a good compression rate?
Thanks,
Carlos R.
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org<mailto:Hdf-forum@lists.hdfgroup.org>
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org<https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.hdfgroup.org_mailman_listinfo_hdf-2Dforum-5Flists.hdfgroup.org&d=CwMD-g&c=uGuXJ43KPkPWEl2imVFDmZQlhQUET7pVRA2PDIOxgqw&r=_7cDUs4kpe-OMPlm1XwkIQ&m=psEZsRuiIsfn7FRlC3Fs_rv6fgC2fNPOXCiEB3zsjwM&s=sFbk7TjmHu9po9C4vaqkBmP4G21KV6DUxIT7GBEXjpw&e=>
Twitter: https://twitter.com/hdf5<https://urldefense.proofpoint.com/v2/url?u=https-3A__twitter.com_hdf5&d=CwMD-g&c=uGuXJ43KPkPWEl2imVFDmZQlhQUET7pVRA2PDIOxgqw&r=_7cDUs4kpe-OMPlm1XwkIQ&m=psEZsRuiIsfn7FRlC3Fs_rv6fgC2fNPOXCiEB3zsjwM&s=dK4L_9-IG4SlCi7AqJoO73SOCUkP60PfH838gPCCvHo&e=>
--
___________________________________________________________________________
Dr. Werner Benger Visualization Research
Center for Computation & Technology at Louisiana State University (CCT/LSU)
2019 Digital Media Center, Baton Rouge, Louisiana 70803
Tel.: +1 225 578 4809 Fax.: +1 225 578-5362