HDF5 On a distributed file system such as Gluster

To preface with questions.

What are the consequences for doing this?

What exact situations can cause this to happen in the wild? The comment
about the one case this happens does not make much sense to me.

Is there a smarter way of fixing this to work naturally with our
environment? Something similar to "If on gluster then don't check file
lengths, else check file length"?

We are currently running some HDF5 files on a gluster distributed file
system. We are using flock locking to coordinate access to a single file
from multiple compute nodes. All gluster performance tuning parameters are
turned off, such as write-behind, flush-behind etc.

Occasionally HDF5 will fail due to the following traceback when opening a
file::

HDF5ExtError: HDF5 error back trace

  File "../../../src/H5F.c", line 1514, in H5Fopen
    unable to open file
  File "../../../src/H5F.c", line 1309, in H5F_open
    unable to read superblock
  File "../../../src/H5Fsuper.c", line 322, in H5F_super_read
    unable to load superblock
  File "../../../src/H5AC.c", line 1831, in H5AC_protect
    H5C_protect() failed.
  File "../../../src/H5C.c", line 6160, in H5C_protect
    can't load entry
  File "../../../src/H5C.c", line 10990, in H5C_load_entry
    unable to load entry
  File "../../../src/H5Fsuper_cache.c", line 467, in H5F_sblock_load
    truncated file

End of HDF5 error back trace

I was able to determine with lots of testing that even though file metadata
cache is disabled in gluster/fuse, occasionally the fuse driver will lie
about the size of the file when performing the "FSEEK_END->FTELL" or
fstat(file) operations, even though the file is actually the correct size,
and when the filesize returns smaller, it is still possible to seek beyond
that point and read/write valid data from/to a file.

So I altered H5Fsuper_cache.c and commented out the following lines::

461-474 H5Fsuper_cache.c
    /*
     * Make sure that the data is not truncated. One case where this is
     * possible is if the first file of a family of files was opened
     * individually.
     */
    if(HADDR_UNDEF == (eof = H5FD_get_eof(lf)))
        HGOTO_ERROR(H5E_FILE, H5E_CANTOPENFILE, NULL, "unable to determine
file size")

    /* (Account for the stored EOA being absolute offset -QAK) */
    // if((eof + sblock->base_addr) < stored_eoa)
    // HGOTO_ERROR(H5E_FILE, H5E_TRUNCATED, NULL, "truncated file: eof =
%llu, sblock->base_addr = %llu, stored_eoa = %llu", (unsigned long
long)eof, (unsigned long long)sblock->base_addr, (unsigned long
long)stored_eoa)

As well as removing some of the unit tests that assert that this exception
is raised.

We have been running HDF5 files this way for a couple of weeks now, and
everything is doing fine, all data is there and we have yet seen a
corrupted file.

Thank you for your time.

···

--
Thadeus