Hdf5 file grew to 15TB during filesystem failure

David_Schneider · June 14, 2016, 6:56pm

We had a file system failure not too long ago, and after cleaning things up, we discovered a hdf5 file that was listed as being 15TB, while it only had 27GB of data in it - you could see this by using h5stat -S. This file was being created by a program that 'translates' data into a hdf5 schema. The file system is lustre 2.4, un striped, linux, running red hat 7. The app uses hdf5 1.8.15. We built hdf5 with parallel MPI support. While the app is a MPI program it does not use the parallel Hdf5 interface.

Our current theory is that due to the filesystem failure, you could allocate space for your file, but not write to it - I'm not such an expert with file system issues like this, but I understand it is possible to allocate more space then one physically has on disk.

Does someone know hdf5's behavior in this regard? If it cannot do a write, will it continually do new allocations, explaining why the filesize grew so large? Maybe this is a bug in hdf5 and it should error out on the write?

Then there is the question of what I can do to make the translating app more robust, one thing is upgrade to 1.8.17, the other is I have been looking at the document

and wondering if I should use a non-default file space management strategy. Currently we just use the default - but the translating app does not delete hdf5 objects from the output, it creates hundreds of chunked datasets, some datasets are small but larger datasets have chunks capped at 100MB. The document suggests that there may be a more optimal file space management property than the default if you do not remove h5 objects.

best,

David

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Hdf5 file grew to 15TB during filesystem failure