Unable to close HDF5 file


#1

Hello,

in our software we use the HDF5 format to save the results. We use the it in C++ code, HDF5 version 1.12.0 (the same issue occured in 1.8.16 we used previously)

We run a lot of calculations, each in separate process, without any connection between them. The issue is, that in some of the calculations the HDF5 file is not correctly closed, which results in exception. I think the issue might be connected with our usage of quite slow network drives with slow response times.

The error written by HDF5 is:
HDF5-DIAG: Error detected in HDF5 (1.12.0) thread 0:
_ #000: H5F.c line 886 in H5Fclose(): decrementing file ID failed_
_ major: Object atom_
_ minor: Unable to close file_
_ #001: H5I.c line 1422 in H5I_dec_app_ref(): can’t decrement ID ref count_
_ major: Object atom_
_ minor: Unable to decrement reference count_
_ #002: H5F.c line 243 in H5F__close_cb(): unable to close file_
_ major: File accessibility_
_ minor: Unable to close file_
_ #003: H5VLcallback.c line 3977 in H5VL_file_close(): file close failed_
_ major: Virtual Object Layer_
_ minor: Unable to close file_
_ #004: H5VLcallback.c line 3945 in H5VL__file_close(): file close failed_
_ major: Virtual Object Layer_
_ minor: Unable to close file_
_ #005: H5VLnative_file.c line 884 in H5VL__native_file_close(): can’t close file_
_ major: File accessibility_
_ minor: Unable to decrement reference count_
_ #006: H5Fint.c line 2057 in H5F__close(): can’t close file_
_ major: File accessibility_
_ minor: Unable to close file_
_ #007: H5Fint.c line 2230 in H5F_try_close(): problems closing file_
_ major: File accessibility_
_ minor: Unable to close file_
_ #008: H5Fint.c line 1387 in H5F__dest(): unable to close file_
_ major: File accessibility_
_ minor: Unable to close file_
_ #009: H5FD.c line 850 in H5FD_close(): close failed_
_ major: Virtual File Layer_
_ minor: Unable to close file_
_ #010: H5FDsec2.c line 439 in H5FD_sec2_close(): unable to close file, errno = 5, error message = ‘Input/output error’_
_ major: Low-level I/O_
_ minor: Unable to close file_
dave+: H5VLnative_file.c:862: H5VL__native_file_close: Assertion `(H5F_file_id_exists(f))’ failed.

(not sure if the last line belongs to the error or is another one).

The HDF file itself seems ok, it is openable and contains the data. So I just need to get rid of the error, which breaks any further calculation done by our software.

The issue happens while calling H5File close() function. I tried removing the close() call, leaving the closing to the implicit functionality - did not help. I tried some catching of exceptions, but I only acchieved SEGV instead of ABRT error. I tried using H5F_CLOSE_STRONG property to avoid the calculation of the references, but that did not help either. Now I am pretty much out of ideas.

Does anyone have any experience with such behavior? I tried googling, but did not really find anything.

Thanks in advance.


#2

David, how are you? Just a few questions for clarification:

Your application is using the C-API though, right?

Can you describe the setup a bit? What’s the OS? Is it an NFS mount, SMB, …? What’s the connectivity?

Can you correlate the issues with losses in network connectivity?

Have you created some kind of performance envelope with, say, fio (https://fio.readthedocs.io/en/latest/index.html), to see how far you can push your setup?
If so, how does that compare to what your app is trying to do?

Best, G.


#3

Hello,

I believe we use mostly the C++ API, I am not entirely sure, if entirely.

The setup description is:

· SUSE Linux Enterprise Desktop 12 SP2

· Fileserver is NFS mount at a „virtual“ fileserver with a gigabit connection.

I do not know about any losses of network connectivity, it just does not seem 100% robust.

The issues with network:

· Sometimes (1% of calculations) when creating a file in one script and opening it in another, the second one fails with error that the file does not exist. Adding a sleep for few seconds alleviates this issue.

· Rarely (0.1%) the file saved with C++ „ofstream“ is missing a few last rows.

In our tests I run the same calculation over and over again, sometimes it runs without any issues, sometimes it fails e.g. with the HDF5 close issues. (2% of calculations).

We did not do performance envelopes using fio or any other tools.

Thanks.

Best regards

David Šerý


#4

David, thanks for the description. The error message comes from a failed NULL pointer check in H5F_file_id_exists (see H5Fquery.c). That means that the library has no H5F_t structure representing the file that you want to close. Normally, these things don’t just disappear, and it seems likely that the file was already closed in another part of your code.

One way to debug the issue would be to look at the open HDF5 file handles in different parts of your code. For example, you can use H5Iis_valid (https://portal.hdfgroup.org/display/HDF5/H5I_IS_VALID) to check if a given file handle is still valid (positive return value). Another option is to use H5Fget_obj_count (https://portal.hdfgroup.org/display/HDF5/H5F_GET_OBJ_COUNT) to look at the number of open handles for the given file (the second argument is H5F_OBJ_FILE). At some point, there might be multiple handles for the given file.

Finally, H5F_CLOSE_STRONG is a rather blunt instrument and will more likely compound the issue than help clarify the situation. The default (H5F_CLOSE_SEMI) is a better option for now.

Best, G.


#5

Hello and thank you for the answer.
There is indeed something wrong with the obj_count counter.

In my code I basically have:
cout << Start of HDF5 Ausgabe function: << thread number << endl;
pthread_mutex_lock(&filelock); // the code runs on 2 threads, avoid saving overlap
remove(filename);
H5File H5Datei(filename, H5F_ACC_TRUNC);
h5_id = H5Datei.getId();
cout thread number << File id << H5Iis_valid(h5_id) << H5Fget_obj_count(h5_id, H5F_OBJ_FILE) << endl; // this is pseudocode
… // some saving
the same cout
… // rest of saving
pthread_mutex_unlock(&filelock);
the same cout
H5Datei.close();
cout << End of HDF5 Ausgabe function: << thread number << endl;

The filename is unique for each of the threads.

The calculation is executed in several independent processes, each using 2 threads. Each thread of each process creates one h5 file, all of them have unique combination of folder + filename.

Some of the calculations end without issues, some have error looking for example like this:
Start of HDF5 Ausgabe function:Thread 0
Start of HDF5 Ausgabe function:Thread 1
Thread 0 File id 16777216 1 1
Thread 0 File id 16777216 1 1
Thread 0 File id 16777216 1 1
Signal ABRT during closing of Thread 0 - signal is caught and the saving for thread 0 is called again.
Thread 1 File id 16777217 1 2
Thread 1 File id 16777217 1 2
Thread 1 File id 16777217 1 2
End of HDF5 Ausgabe function:Thread 1
Start of HDF5 Ausgabe function:Thread 0
Thread 0 File id 16777217 1 2
Thread 0 File id 16777217 1 2
Thread 0 File id 16777217 1 2

So as you can see, one line before the close call the H5 was both valid and the obj_count was 1. I still received the ABRT signal.

This time the HDF5 description of the problem is, I believe, following (HDF5 1.8.16 used):
#000: …/…/src/H5F.c line 795 in H5Fclose(): decrementing file ID failed
_ major: Object atom_
_ minor: Unable to close file_
_ #001: …/…/src/H5I.c line 1491 in H5I_dec_app_ref(): can’t decrement ID ref count_
_ major: Object atom_
_ minor: Unable to decrement reference count_
_ #002: …/…/src/H5Fint.c line 1270 in H5F_close(): can’t close file_
_ major: File accessibilty_
_ minor: Unable to close file_
_ #003: …/…/src/H5Fint.c line 1432 in H5F_try_close(): problems closing file_
_ major: File accessibilty_
_ minor: Unable to close file_
_ #004: …/…/src/H5Fint.c line 869 in H5F_dest(): unable to close file_
_ major: File accessibilty_
_ minor: Unable to close file_
_ #005: …/…/src/H5FD.c line 1104 in H5FD_close(): close failed_
_ major: Virtual File Layer_
_ minor: Unable to close file_
_ #006: …/…/src/H5FDsec2.c line 432 in H5FD_sec2_close(): unable to close file, errno = 5, error message = ‘Input/output error’_
_ major: Low-level I/O_
_ minor: Unable to close file_
terminate called after throwing an instance of ‘H5::FileIException’
HDF5-DIAG: Error detected in HDF5 (1.8.16) thread 47271460120320:

I don’t understand, how is it possible, that in the moment Thread 1 opens the file, it already has the counter = 2. Thread 0 has a different id prior to this. In the end Thread 0 gets the same Id, but it is after Thread 1 already closed the file, so the counter should be ok here as well. Is it possible, that the counter is influenced by another independently executed processes? By the independence of the processes I mean something like running

script &
script &

Thanks for the help.


#6

Are you using a thread-safe build of the HDF5 C-library? (In other words, was it built with --enable-threadsafe?) Unless for purposes of your application, you don’t need any locking in that case.

Which HDF5 C++ interface are you using? This interface (https://portal.hdfgroup.org/display/HDF5/HDF5+CPP+Reference+Manuals) and its predecessors are NOT thread-safe.

I have to think a little more about your pseudo-code/example. You might get away with a non-thread-safe build of the library, but only if exactly one of your threads calls HDF5 library functions.

Best, G.


#7

Hello,

based on https://support.hdfgroup.org/HDF5/hdf5-quest.html#tsafe I thought the library cannot really be thread safe for C++ API, so we do not use the thread-safe build.

I do not really know which C++ API we are using, neither how to find this out.

In my example there are 2 threads calling the HDF5 library functions, but I believe the calls are not concurrent except for the small overlap caused by pthread_mutex_unlock being before the close() call. This is in order to ulock the mutex-lock even when the call() function fails. I tried wrapping the function in try-catch block and unlocking after catching H5:FileIException, but this did not work either, because there was a SIGSEGV error during deallocation of the H5File variable.


#8

The more I look at your pseudo code, the less I understand it. For example, why is H5Datei.close() outside the lock? What is g5_id? I don’t see the obj_count being 0 anywhere.

Can you give us either a clear description of what you are trying to achieve or a minimum working example that lets us reproduce the issue?

Best, G.


#9

Hi David,

thread safety is function of how you use the internal data structures of the HDF5 system. One example is a data structure called skip list, where object references are associated with integer handles.

Take a handle for instance:
hid_t id = H5... ; and notice that when maintaining reference count on id with H5Iinc_ref or H5Idec_ref the system needs to have atomic access to the datastructure, obtain the object reference and maintain the associated chores including closing the resource when reference count hits zero.

To guarantee atomicity one could pepper the code base with locking primitives, but you can’t do that without picking up the associated cost: added code complexity + runtime performance.
When accessing system calls from a dedicated thread both can be avoided; the other solution is to incorporate thread control in the client code.
When I designed H5CPP I followed the previous principle: you are responsible to protect any system calls when operating from multiple threads. Passing an object by reference is atomic and thread safe:
my_call( const h5::fd_t& fd, ... ) however my_call( h5::fd_t fd) is not, since copying a handle is equivalent with incrementing the reference count of the underlying hid_t id.

If you have any interest in an alternative C++ HDF5 stack the latest 5 minutes H5CPP presentation slides are here and here is the ISC’19 The CRUD like function calls are as simple as using python, but devilishly fast – also you can interchange calls with any existing C HDF5 code, due to the following properties:

  • H5CPP handles are binary compatible with CAPI handles
  • conversion CTOR/ operators provide seamless conversion between CAPI and H5CPP handles

As for the actual task, as Gerd suggested: good idea to outline what your intention is – in a broad sense: the idea; Then we can suggest directions. When posting code, please provide minimum compileable example – putting it on github (or attach the tarball here) allows others to take a look and actually help you.

best wishes: steven


#10

I am sorry for the mistakes, the g5_id is a typo, it should have been h5_id. And the obj_count was 1, not 0. One line before the close() call both the validation flag and obj_count were 1.

Unfortunatelly I cannot provide you with an example, because the software is not my personal property, neither would I be allowed to share it. And I do not have a similar example without the usage of the software.


#11

And about the close() outside the lock - I tried to mention the reasoning in the post I sent yesterday. When an error in close() occurs, the file is still locked and the saving function cannot be rerun.

I tried wrapping the saving function in try except call, releasing the lock after the except / after successful execution. But that did not work either, there was some SEGV error during deallocation of the HDF5 variables, so the except block could not even catch the H5:FileIException error.


#12

I agree with Steven (surprise!). Unless we can get a reproducer, I’d recommend to 1) drop the C++ API and switch to the C-API (thread-safe build) or 2) consider alternatives such as H5CPP. Other than revisiting the “handle discipline,” which appears to be the sticking point at the moment, targeting the C-API is not going to be much different from your current C++ setup, which perhaps only obscures matters. (And it doesn’t have to be an all-out conversion. Just focus on trouble spots.) G.


#13

Thanks for the support and suggested solutions. The task was put on hold for now, we might return to it with rewriting the API usage. I will try to update the ticket if that happens.