To preface with questions.
What are the consequences for doing this?
What exact situations can cause this to happen in the wild? The comment
about the one case this happens does not make much sense to me.
Is there a smarter way of fixing this to work naturally with our
environment? Something similar to "If on gluster then don't check file
lengths, else check file length"?
We are currently running some HDF5 files on a gluster distributed file
system. We are using flock locking to coordinate access to a single file
from multiple compute nodes. All gluster performance tuning parameters are
turned off, such as write-behind, flush-behind etc.
Occasionally HDF5 will fail due to the following traceback when opening a
file::
HDF5ExtError: HDF5 error back trace
File "../../../src/H5F.c", line 1514, in H5Fopen
unable to open file
File "../../../src/H5F.c", line 1309, in H5F_open
unable to read superblock
File "../../../src/H5Fsuper.c", line 322, in H5F_super_read
unable to load superblock
File "../../../src/H5AC.c", line 1831, in H5AC_protect
H5C_protect() failed.
File "../../../src/H5C.c", line 6160, in H5C_protect
can't load entry
File "../../../src/H5C.c", line 10990, in H5C_load_entry
unable to load entry
File "../../../src/H5Fsuper_cache.c", line 467, in H5F_sblock_load
truncated file
End of HDF5 error back trace
I was able to determine with lots of testing that even though file metadata
cache is disabled in gluster/fuse, occasionally the fuse driver will lie
about the size of the file when performing the "FSEEK_END->FTELL" or
fstat(file) operations, even though the file is actually the correct size,
and when the filesize returns smaller, it is still possible to seek beyond
that point and read/write valid data from/to a file.
So I altered H5Fsuper_cache.c and commented out the following lines::
461-474 H5Fsuper_cache.c
/*
* Make sure that the data is not truncated. One case where this is
* possible is if the first file of a family of files was opened
* individually.
*/
if(HADDR_UNDEF == (eof = H5FD_get_eof(lf)))
HGOTO_ERROR(H5E_FILE, H5E_CANTOPENFILE, NULL, "unable to determine
file size")
/* (Account for the stored EOA being absolute offset -QAK) */
// if((eof + sblock->base_addr) < stored_eoa)
// HGOTO_ERROR(H5E_FILE, H5E_TRUNCATED, NULL, "truncated file: eof =
%llu, sblock->base_addr = %llu, stored_eoa = %llu", (unsigned long
long)eof, (unsigned long long)sblock->base_addr, (unsigned long
long)stored_eoa)
As well as removing some of the unit tests that assert that this exception
is raised.
We have been running HDF5 files this way for a couple of weeks now, and
everything is doing fine, all data is there and we have yet seen a
corrupted file.
Thank you for your time.
···
--
Thadeus