I am using HDF5 1.10.2 in a C++ project and I am trying to solve an issue when user selects a location with insufficient disk space. There is not an easy way to predict this will happen so I tried to simply catch errors (exceptions) during various operations and handle them. The problem is that when this happens, hdf5 crashes somewhere in destructor of H5::H5File. I have tried 1.12.2, but the issue is still there. Am I doing something wrong? Code to replicate the error follows:
You may want to inspect the value stored in errno for an insufficient disk space error (ENOSPC) and have your logic behave according to it. This approach has worked well for us in HDFql.
Thanks. But that does not solve the issue of crashing somewhere inside HDF library when H5::H5File is destroyed, does it? I do not check errno but I do try to handle exceptions. First one occurs on the line dataset->write(static_cast<void*>(data.get()), H5::PredType::NATIVE_UINT8);
and the only thing that happens afterward is m_file->close(); and m_file = nullptr; Should I ignore that explicit close() call and just forget m_file?
I think the deeper problem is that you are trying to solve a problem that falls into the realm of undefined HDF5 library behavior, and you are putting too much faith into second-hand/distorted information handed down from H5Cpp exceptions.
Itās not just disk space. How about storing compressed data streamed from some device? I do not know how big it is going to be? Should I check after writing every block of data? How do I calculate necessary amount of free bytes when I have no clue what kind of overhead HDF5 really has? Even if I do something like that I will have code that will just delay the underlying issue of HDF not recovering from failures properly. And how about storing data on some remote location? If network fails, write fails also. And that is not an issue of storage space. How about then?
But if this is officially undefined HDF5 library behaviot, then I guess I will have to move away from it until it becomes defined
That depends on your use case, which you havenāt really described.
That depends on the choices you make about how to represent your data in HDF5. If each byte is a separate dataset the overhead is different from when you lumpāem together into a single dataset.
While I donāt see that youāve nearly exhausted possibilities to stay out of trouble, I can tell you that fault tolerance was never a design goal for HDF5.
My use case - developing a PC app that handles large volumetric data (for example medical data from CT) and lot of stuff around it. User makes analyses, notes, segments data into regions and prepares various surgical guides. This is not stuff he can replicate in a short period of time so saving all this as some kind of a project file is a must. Technically it also means that lot of different data is stored (various internal types, lengths, etc.). And as I have already written, he can save this project into various locations. Either to a hard drive, flash drive or to a remote location - network drive, cloud, whatever.
While I can save the project to a local destination and try to copy it somewhere else afterwards I will still encounter problems in some cases. And yes, these problems were reported by users.
While I can accept the fact that failure during write operations means the file is corrupted beyond repair I would love to be able to give user an option to save to a different location. But those failures corrupt some stuff inside HDF library (maybe just the CPP part) and the app ultimately crashes in atexit handler inside HDF library no matter what I try. And I am really unsure if I can rely on the data written into another file and location after the first write failure occured.
Fault tolerance may be achieved by replication, or external journals ā meaning you keep a track of IO operations before executing them. Journals provide local solutions, and no protection when the entire node goes bye-bye. The latter class of problems solved with replication.
Data replication can be at frame level, such as Ethernet frames, IP multicast; or finding a suitable robust protocol such as SCTP, reliable UDP, etc⦠. Rolling your own solution is laborious, error prone, ZeroMQ is a message passing library with options when it comes to transport layer and it is a popular option in High Frequency Trading frameworks, and may be applied to your field as well.
One possible solution is to decouple data producer ( and consumer) from storage, provide options for robustness:
cloud: amortised cost, high reliability and accessibility
intranet: collection of data recorders
ā¦
āAnd I am really unsure if I can rely on the data written into another file and location after the first write failure occured.ā Checksum is a method to answer these sort of questions, depending of implementation and kind youāre walking on the trade-off curve between performance and certainty.
All in all, it boils down to the budget and the skill of the software writer.
Thanks for reminding me of software engineering basics. My point is that HDF library does not survive its own failure to write. Even if I do all the error checks, replication, external journals and/or use more reliable network transfer. All in all, it boils down to HDF corrupting itself and taking everything down with it. I still think this is not an incompetence on my side or bad design decision on HDFās side. But a bit rare and annoying bug.
Thatās not how I would describe it. The state of an HDF5 file thatās being modified has two parts, a part in (user-)memory that may not yet have been persisted and a persisted portion in non-volatile storage. Letās say the library finds itself at a point where it canāt any longer write (meta-)data into the file. Usually, there is NO corruption at this point, just a set of failed write or flush calls. What is the state of the file and how should the library recover it? If you start thinking about this question, itās actually rather difficult. The intended file state canāt be persisted, so whatās the next best state? That depends on your definition of ābestā. If you think, āthe state before the error generating operation,ā then that may not be the answer, because there is no guarantee that that state was/could be persisted either. (With caching, even read calls can trigger write callsā¦) This regression is not infinite, but the destination (recover to which state?) is less clear than it seems. Combine that with in-place modifications, and the state before the file was opened may no longer exist. Onion VFD introduces the concept of file versions but it is not intended as a recovery solution.
This forum is not a place where we accuse each other or personalize issues. We are all here to learn and find solutions where possible.
I have a virtual hard drive mapped as O:/ and it has a capacity of 100 MBs. As expected, the first call to work(...) throws an exception because the write operation fails. That is ok as I can react to it. The second call is writing to a different location where available disk space is not an issue. That (probably) works. But when the main() function ends, the app crashes somewhere in HDF library. That is the problem I am facing now. Write operation to a.hdf5 does not only leaves a.hdf5 broken (that is expected) but the app crashes afterwards (that is not expected) and that is the reason I have doubts about the status of b.hdf5.
Thank you for the example. I would like to create a C example to see if that can reproduce the behavior. The assertion fails because the library loses track of the open object count in a file. H5Cpp.h doesnāt implement proper RAII, and I donāt want to be fooled by that.
Thanks. I have adjusted main() function. It looks like this:
int main()
{
int retval = EXIT_SUCCESS;
retval &= work("O:/foo.h5"); // limited available space to force failure
retval &= work("D:/foo.h5"); // lots of free space
return retval;
}
It behaves similarly to C++ version. It asserts after finishing main(). Pictures follow:
Thanks for trying. This is a good data point. Would you mind trying this slightly modified version? All thatās changed is the H5F_CLOSE_STRONG file close degree. Also, would you mind trying this code with HDF5 1.10.8? Iām not sure if the reference counting logic has changed with the introduction of VOL in 1.12, but letās just rule out that possibility!
Thanks, G.
#include "hdf5.h"
#include <stdint.h>
#include <stdlib.h>
#define SIZE (1024 * 1024 * 128)
int work(const char* path)
{
int retval = EXIT_SUCCESS;
uint8_t* data = (uint8_t*) malloc(sizeof(uint8_t)*SIZE);
for (size_t i = 0; i < SIZE; ++i) {
*(data+i) = i % 256;
}
hid_t fapl = H5I_INVALID_HID;
if ((fapl = H5Pcreate(H5P_FILE_ACCESS))
== H5I_INVALID_HID) {
retval = EXIT_FAILURE;
goto fail_fapl;
}
if (H5Pset_fclose_degree(fapl, H5F_CLOSE_STRONG) < 0) {
retval = EXIT_FAILURE;
goto fail_file;
}
hid_t file = H5I_INVALID_HID;
if ((file = H5Fcreate(path, H5F_ACC_TRUNC, H5P_DEFAULT, fapl))
== H5I_INVALID_HID) {
retval = EXIT_FAILURE;
goto fail_file;
}
hid_t group = H5I_INVALID_HID;
if ((group = H5Gcreate(file, "H5::Group", H5P_DEFAULT, H5P_DEFAULT,
H5P_DEFAULT)) == H5I_INVALID_HID) {
retval = EXIT_FAILURE;
goto fail_group;
}
hid_t fspace = H5I_INVALID_HID;
if ((fspace = H5Screate_simple(1, (hsize_t[]) {(hsize_t) SIZE}, NULL))
== H5I_INVALID_HID) {
retval = EXIT_FAILURE;
goto fail_fspace;
}
hid_t dset = H5I_INVALID_HID;
if ((dset = H5Dcreate(group, "H5::Dataset", H5T_NATIVE_UINT8, fspace,
H5P_DEFAULT, H5P_DEFAULT, H5P_DEFAULT))
== H5I_INVALID_HID) {
retval = EXIT_FAILURE;
goto fail_dset;
}
if (H5Dwrite(dset, H5T_NATIVE_UINT8, fspace, fspace, H5P_DEFAULT, data) < 0) {
retval = EXIT_FAILURE;
goto fail_write;
}
printf("Write successed.\n");
if (H5Fflush(file, H5F_SCOPE_GLOBAL) < 0) {
retval = EXIT_FAILURE;
goto fail_flush;
}
printf("Flush succeeded.\n");
fail_flush:
fail_write:
if (H5Dclose(dset) < 0) {
printf("H5Dclose failed.\n");
}
fail_dset:
if (H5Sclose(fspace) < 0) {
printf("H5Sclose failed.\n");
}
fail_fspace:
if (H5Gclose(group) < 0) {
printf("H5Gclose failed.\n");
}
fail_group:
if (H5Fclose(file) < 0) {
printf("H5Fclose failed.\n");
}
fail_file:
if (H5Pclose(fapl) < 0) {
printf("H5Pclose failed.\n");
}
fail_fapl:
return retval;
}
int main()
{
int retval = EXIT_SUCCESS;
retval &= work("O:/foo.h5"); // limited available space to force failure
retval &= work("D:/foo.h5"); // lots of free space
return retval;
}
Unfortunately, there is no change with H5F_CLOSE_STRONG. Both 1.10.8 and 1.12.2 are unable to succesfully close file after write failure. H5Fclose(file) returns -1 and app crashes when itās closed. Following screenshots are from 1.10.8
Iād like to help with this, but Iām not seeing the error on my end (yet). Iāve tried 1.10.8, the hdf5_1_10 branch (which will be 1.10.9 at the end of the month), and the develop branch (which will be 1.13.2 at the end of June). Every single one exits normally with a write failure error stack dump when the disk fills up. Iām building HDF5 from source in debug mode via the Autotools + gcc 11.1.0 on Ubuntu 20.04 LTS.
gdb confirms the normal exit:
[Inferior 1 (process 304971) exited normally]
To force the write failure, Iām using a very small ramdisk.
Before:
Filesystem Size Used Avail Use% Mounted on
tmpfs 8.0M 0 8.0M 0% /mnt/ramdisk
After:
Filesystem Size Used Avail Use% Mounted on
tmpfs 8.0M 8.0M 0 100% /mnt/ramdisk
Maybe this only rears its head on Windows? Has anyone seen this fail on a non-Windows system? Iāll try the test program out on Windows when I get some free time (maybe later in the week or over the weekend). I have VS2022 on a Win 10 Pro box. Let me know if you are doing anything unusual when you build.
Hereās the stack dump with the 1.10.8 branch (ignore the ādevelopā in the source path - this really is 1.10.8, as you can see from the second line):
Writing to /mnt/ramdisk/foo.h5
HDF5-DIAG: Error detected in HDF5 (1.10.8) thread 0:
#000: ../../develop/src/H5Dio.c line 317 in H5Dwrite(): can't write data
major: Dataset
minor: Write failed
#001: ../../develop/src/H5Dio.c line 801 in H5D__write(): can't write data
major: Dataset
minor: Write failed
#002: ../../develop/src/H5Dcontig.c line 628 in H5D__contig_write(): contiguous write failed
major: Dataset
minor: Write failed
#003: ../../develop/src/H5Dselect.c line 311 in H5D__select_write(): write error
major: Dataspace
minor: Write failed
#004: ../../develop/src/H5Dselect.c line 224 in H5D__select_io(): write error
major: Dataspace
minor: Write failed
#005: ../../develop/src/H5Dcontig.c line 1255 in H5D__contig_writevv(): can't perform vectorized sieve buffer write
major: Dataset
minor: Can't operate on object
#006: ../../develop/src/H5VM.c line 1410 in H5VM_opvv(): can't perform operation
major: Internal error (too specific to document in detail)
minor: Can't operate on object
#007: ../../develop/src/H5Dcontig.c line 999 in H5D__contig_writevv_sieve_cb(): block write failed
major: Dataset
minor: Write failed
#008: ../../develop/src/H5Fio.c line 150 in H5F_block_write(): write through page buffer failed
major: Low-level I/O
minor: Write failed
#009: ../../develop/src/H5PB.c line 1021 in H5PB_write(): write through metadata accumulator failed
major: Page Buffering
minor: Write failed
#010: ../../develop/src/H5Faccum.c line 831 in H5F__accum_write(): file write failed
major: Low-level I/O
minor: Write failed
#011: ../../develop/src/H5FDint.c line 240 in H5FD_write(): driver write request failed
major: Virtual File Layer
minor: Write failed
#012: ../../develop/src/H5FDsec2.c line 864 in H5FD__sec2_write(): file write failed: time = Wed May 11 20:35:50 2022
, filename = '/mnt/ramdisk/foo.h5', file descriptor = 3, errno = 28, error message = 'No space left on device', buf = 0x7fb7868a6690, total write size = 125831552, bytes this sub-write = 125831552, bytes actually written = 18446744073709551615, offset = 0
major: Low-level I/O
minor: Write failed