Elena et al,
Further to previous emails regarding problems using nested named data types (HDFFV-9174, HDFFV-9176), I can now report my colleague and I have tracked down the crash that is occurring in our application and have been able to patch the problem locally. Can I request that a new ticket be raised in JIRA, and the following information attached? Thanks.
The problem requires use of (a) named data types (not necessarily nested); (b) soft links; and (c) data type conversions (which may be implicitly generated by dataset read or write functions). All references in the following explanation refer to source files and lines as per the 1.8.14 release.
When reading or writing dataset values, data type conversions are often required. When a data type conversion structure is initialised, it takes a copy of the source and destination data types [H5Tconv.c:1963 and others]. If either data type is a named data type, the H5T_t oloc.file record is non-NULL, but (in the cases we tested) its oloc.holding_file record is FALSE. These fields are copied as part of the data type copy, and so the data types held by the converter has this feature. Furthermore, data type conversions are registered in a global conversion path cache that out-lives any H5F_t file handle.
Because oloc.holding_file is FALSE, it is possible to close the file that holds the named data types being referred to. THIS IS THE BUG; that there are now registered global data type identifiers held by global data type conversion structures that have a dangling reference (the oloc.file record) to a file that is no longer open- the record has been deallocated. However, until some other operation attempts to dereference this dangling pointer, there is no failure in the HDF library.
We found, however, that deleting a soft link will trigger such a failure. Having opened a new file (a different one to the one containing the original named data types) that contains a soft link, calling H5Ldelete() on the soft link includes code that searches all currently held identifiers of all types (datasets, groups and data types) and attempts to rename them to a NULL or empty string [H5Gname.c:815 callback switch statement]. Because the data types with dangling file references report as named data types [at H5Gname.c:828], the oloc is retrieved and the file pointer is dereferenced when it tries to locate the parent of the file record [H5Gname.c:859]. This causes the crash.
Reproducing this, especially in the HDF test harness and without any source code changes, is surprisingly difficult. The reason is that when the file is closed, creating the dangling reference, the memory occupied by the H5F_t structure is freed onto the internal HDF heap. This structure still looks valid (enough) especially if the "parent" field is NULL. If you re-open the same file, it winds up picking back up the same block from the free list and effectively re-hooking the file reference, and everything continues to work. If you open a different file, you often still can't induce a crash. In our application, we would sometimes have to open and close a cooperating pair of files 6 times before we would get the crash. Our new test placed into the HDF test framework uses two files to help ensure that the dangling file reference refers to a file that is closed.
Because of this, a (temporary) source code modification is required to reliably trip the test that we have submitted. This introduces debug-heap type behaviour by writing a 0xCDCDCDCD pattern over the top of memory when it is returned to the free list. See below for more information on the patches involved.
Our fix works, in that the HDF regression tests all seem to pass (on Windows, Visual Studio 2008), our system tests pass and the application no longer crashes. However it is probably not the most elegant fix. What I have done is, where the data type objects are copied into the conversion structure, I reach in and free the entire oloc field using internal API calls. As the oloc.holding_file field is FALSE, this has no real reference counting side-effects, and even if this field were TRUE, this is a copy of a data type so the original would still keep the file alive. Even though this sets oloc.file to NULL, the copy still reports as a named data type when the soft link renaming callback occurs. Therefore I also had to patch this callback to perform a secondary check for a NULL file pointer before proceeding.
We tried COPY_TRANSIENT instead of COPY_ALL, but this still copies the oloc structure and file reference and so was not a solution.
We believe that holding_file must be FALSE because the data type conversion structures are in the global conversion path cache and reference counting should not be performed in this context.
I wonder if a more elegant solution may be possible as part of fixing HDFFV-9174. If when you create a dataset with a named data type you take a copy of the data type to associate with the dataset that correctly reports as *not* named, then you could use this same technique when taking copies of data types for the conversion structures. Then the existing check in H5Gname.c should prevent dereferencing a dangling file reference if one exists. However, if you consider the existence of the dangling reference bad practice, consider our method of freeing the oloc structure as part of the final solution.
Furthermore, we are not sure whether the "shared" field in the H5T_t structure should also be freed; it may also contain a dangling reference.
[*] = We applied this patch to the standard build of 1.8.14 for the purpose of our production application build.
dtypes.c.patch = A patch for test/dtypes.c that adds two tests: one that shows how H5Tcommitted() gives inconsistent results for committed data types depending on whether you have reopened the file (HDFFV-9174), and another that shows how dangling file references from committed data types can be dereferenced when deleting a soft link (note: this requires H5Fint.c.memory.patch to be applied as well).
H5Fint.c.memory.patch = A patch for src/H5Fint.c that adds one line of code to set the content of the H5F_t memory freed on the HDF internal heap (and returned to the free list) to 0xCDCDCDCD so that we reliably trip the dangling file reference problem. Without this, the memory is untouched and with a non-null parent pointer the H5Ldelete() operation typically succeeds, even though it is dereferencing "deallocated" memory.
[*] H5Gname.c.patch = A patch that detects explicitly set NULL file pointers (as set in H5Tconv.c patch, next) in copies of named data types generated for the internal data conversion paths. Detecting these allows us to avoid dereferencing the dangling file reference that would otherwise be there. Possibly if the HDF group fix their H5Tcommitted() behaviour then the check for not-a-named-datatype just above where this patch is applied would prevent the problem in a more robust fashion.
[*] H5Tconv.c.patch = A patch that explicitly frees the "oloc" reference (including its internal file pointer, which is the pointer that becomes a dangling reference) when data types are copied for the purposes of placing them in the global cache of converters (for various conversion paths). These out-live the files that would otherwise be referred to. Attempt to do a transient copy of these data types was attempted but this has no effect on "oloc". We're uncertain about the shared location information but testing seems to suggest it is not being dereferenced even if it might also hold dangling file reference.
All the best!
IMPORTANT: This email remains the property of the Department of Defence and is subject to the jurisdiction of section 70 of the Crimes Act 1914. If you have received this email in error, you are requested to contact the sender and delete the email.
dtypes.c.patch (10.2 KB)
H5Fint.c.memory.patch (434 Bytes)
H5Gname.c.patch (737 Bytes)
H5Tconv.patch (7.25 KB)