We observe for several hours (or days) and are thinking of storing the
observed data in HDF5 files. In case the process or system crashes, we
do not want to lose data after several hours of observing.
At the beginning of an observation the HDF5 file and all its groups,
attributes, and datasets are created (and flushed). Thereafter the
datasets get extended during the observation. Now I wonder how much data
can still be read in case of a crash and what can to be done to avoid
loss of data.
- Can all data be read or can only the data until the latest flush be
read?
- Can it happen that in case of a crash the file gets corrupted and
nothing can be read, even if regular flushes were done? If so, is there
anything that can be done to be 100% sure the file does not get
corrupted. In particular, can the file be corrupted if the crash happens
during a flush.
- If regular flushes need to be done, is there a scheme that minimizes
IO? E.g. I can imagine it would be good to flush when a dataset chunk is
full. Maybe there are other considerations.
- What is the overhead of a flush? I assume that only the data chunks
that were changed get written and probably some index pages. How many
index pages? One per data set? Are data chunks written before index
pages to reduce the risk of file corruption?
I guess there are other issues I did not think of.
Cheers,
Ger
Hi Ger,
We observe for several hours (or days) and are thinking of storing the observed data in HDF5 files. In case the process or system crashes, we do not want to lose data after several hours of observing.
At the beginning of an observation the HDF5 file and all its groups, attributes, and datasets are created (and flushed). Thereafter the datasets get extended during the observation. Now I wonder how much data can still be read in case of a crash and what can to be done to avoid loss of data.
- Can all data be read or can only the data until the latest flush be read?
Unless you've turned off metadata cache evictions (which we call "corking the cache" :-), some metadata can be evicted from the cache since the last flush operation, which could put the file in an "unstable" state.
- Can it happen that in case of a crash the file gets corrupted and nothing can be read, even if regular flushes were done? If so, is there anything that can be done to be 100% sure the file does not get corrupted. In particular, can the file be corrupted if the crash happens during a flush.
Yes, on both accounts.
- If regular flushes need to be done, is there a scheme that minimizes IO? E.g. I can imagine it would be good to flush when a dataset chunk is full. Maybe there are other considerations.
- What is the overhead of a flush? I assume that only the data chunks that were changed get written and probably some index pages. How many index pages? One per data set? Are data chunks written before index pages to reduce the risk of file corruption?
All the dirty metadata in the metadata cache will be flushed out to the file, along with any cached raw data (chunked or otherwise).
Probably the "right" solution will be to use the metadata journaling feature that will be available in the 1.10.0 release.
Quincey
···
On Feb 9, 2010, at 7:07 AM, Ger van Diepen wrote: