Preventing file corruption from power loss

Hi,

I am using HDF5 very happily except for occasional issues with file
corruption. I would like to be as robust as possible to power loss at
arbitrary times. I don't mind losing the last several seconds or even
minutes of data, but I don't want to corrupt the file in some way that means
I lose access to older data I have already written out. After working on
the issue for a bit I have several ideas, and would like some feedback from
the community on which to pursue.

1. Maybe this will just go away with the metadata journaling feature in
1.10? Or if it is not completely gone, I can at least run a tool to repair
the metadata when the file is not properly closed. Does anyone have any
experience with the current state of this feature? Is there anything
outside of the metadata that won't be handled by this journaling?

2. Maybe the behavior of H5FD_STDIO would be better than H5FD_SEC2. The
corrupt files return "Invalid file size or file size less than superblock
eoa. Validation stopped." when h5check is run on them. I found a reference
to what I think is this particular issue here:
http://www.hdfgroup.org/HDF5/doc/TechNotes/VFL.html#SEC21 Alternatively
maybe I can just repair my files by writing a new EOF marker or changing the
EOA marker. But then again, this may be the first problem h5check finds but
not the only problem with the file.

3. Use H5FD_SPLIT, and make periodic backups of the metadata portion of the
file. I started experimenting with this option but I got some odd results.
Before I spend too much more time on this I'd like to know that this
actually does make sense given what gets stored in which file. My datasets
are only expanding, so I'm hoping that an older metadata file would still
provide correct information for accessing objects in a data file that has
some additional data (possibly partially) written to it. What I saw though
was that while I could still open the file, some datasets seemed to be
missing. Does the layout of the data portion change over time if I never
delete data? I do overwrite data, so maybe chunks get shuffled around
whenever they are actually stored. Also is there a way to use h5repack or a
similar utility to put split files back into a single file that can be
opened with the SEC2 VFD?

Other suggestions are of course welcome. Thanks for all the great work on
HDF5.

-Ethan

Hi Ethan,

Hi,

I am using HDF5 very happily except for occasional issues with file corruption. I would like to be as robust as possible to power loss at arbitrary times. I don't mind losing the last several seconds or even minutes of data, but I don't want to corrupt the file in some way that means I lose access to older data I have already written out. After working on the issue for a bit I have several ideas, and would like some feedback from the community on which to pursue.

1. Maybe this will just go away with the metadata journaling feature in 1.10? Or if it is not completely gone, I can at least run a tool to repair the metadata when the file is not properly closed. Does anyone have any experience with the current state of this feature? Is there anything outside of the metadata that won't be handled by this journaling?

  Yes, metadata journaling should address issues about file corruption, at least returning the file to the last API operation before the application aborted. It will not help with updates to raw data (i.e. H5Dwrite) that haven't hit disk yet, though.

2. Maybe the behavior of H5FD_STDIO would be better than H5FD_SEC2. The corrupt files return "Invalid file size or file size less than superblock eoa. Validation stopped." when h5check is run on them. I found a reference to what I think is this particular issue here: http://www.hdfgroup.org/HDF5/doc/TechNotes/VFL.html#SEC21 Alternatively maybe I can just repair my files by writing a new EOF marker or changing the EOA marker. But then again, this may be the first problem h5check finds but not the only problem with the file.

3. Use H5FD_SPLIT, and make periodic backups of the metadata portion of the file. I started experimenting with this option but I got some odd results. Before I spend too much more time on this I'd like to know that this actually does make sense given what gets stored in which file. My datasets are only expanding, so I'm hoping that an older metadata file would still provide correct information for accessing objects in a data file that has some additional data (possibly partially) written to it. What I saw though was that while I could still open the file, some datasets seemed to be missing. Does the layout of the data portion change over time if I never delete data? I do overwrite data, so maybe chunks get shuffled around whenever they are actually stored. Also is there a way to use h5repack or a similar utility to put split files back into a single file that can be opened with the SEC2 VFD?

  Using another file driver probably won't help, since the state of the metadata structures on disk could still be inconsistent.

Other suggestions are of course welcome. Thanks for all the great work on HDF5.

  If you have memory to spare, you could "cork the cache" until you reached a suitable point to update the metadata in the file, call H5Fflush(), then continue with your application. Here's a code snippet to cork the cache:

  H5AC_cache_config_t mdc_config;
  hid_t fapl;

  fapl = H5Pcreate(H5P_FILE_ACCESS);

  mdc_config.version = H5AC__CURR_CACHE_CONFIG_VERSION;
  H5Pget_mdc_config(fapl, &mdc_config)

  mdc_config.evictions_enabled = FALSE;
  mdc_config.incr_mode = H5C_incr__off;
  mdc_config.decr_mode = H5C_decr__off;

  H5Pset_mdc_config(fapl, &mdc_config);

  <other calls to modify the fapl>

  <H5Fopen or H5Fcreate with this fapl>

  But, it is possible that the application could fail in the middle of flushing the cache to the file, so this has the possibility of not helping. Generally speaking, journaling will solve the problem entirely, but it's not quite here yet.

  Quincey

···

On Jun 7, 2010, at 8:42 PM, Ethan Dreyfuss wrote: