Segfault at exit with HDF5/netCDF4

j.j.green · November 22, 2022, 9:24pm

I have a mature application (processing oceanographic data), it reads input data files (several) opens an HDF5 file, writes to it, closes it, closes the input files, reopens the output file, does some post-processing, writing the results back into the output file, closes it.

I recently added a new input file format, netCDF4, which is of course HDF5 under the hood. On processing these files I now get a segfault at exit.

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x000055b4a1b46318 in ?? ()
(gdb) bt
#0  0x000055b4a1b46318 in ?? ()
#1  0x00007f2d73f14755 in H5E_dump_api_stack ()
   from /usr/lib/x86_64-linux-gnu/libhdf5_serial.so.103
#2  0x00007f2d73f11dd0 in H5Eget_auto2 ()
   from /usr/lib/x86_64-linux-gnu/libhdf5_serial.so.103
#3  0x00007f2d73e7f4b4 in H5_term_library ()
   from /usr/lib/x86_64-linux-gnu/libhdf5_serial.so.103
#4  0x00007f2d73c9a8a7 in __run_exit_handlers (status=0, 
   listp=0x7f2d73e40718 <__exit_funcs>, 
   run_list_atexit=run_list_atexit@entry=true, 
   run_dtors=run_dtors@entry=true)
   at exit.c:108
#5  0x00007f2d73c9aa60 in __GI_exit (status=<optimised out>) 
   at exit.c:139
#6  0x00007f2d73c7808a in __libc_start_main (main=0x55b49fa2e4d0 <main>, 
   argc=8, argv=0x7ffd917950b8, 
   init=<optimised out>, 
   fini=<optimised out>, 
   rtld_fini=<optimised out>, stack_end=0x7ffd917950a8)
   at ../csu/libc-start.c:342
#7  0x000055b49fa2e8be in _start ()

Processing any other supported input file format does not generate a segfault. The HDF5 output file, which is explicitly closed, is not corrupted as HDF5, and looks to have the expected (correct) data inside it.

I did find this report of a similar issue, but the workaround suggested there is not good for me, the input format is 3rd party.

After some searching around the HDF5 docs, I tried adding H5dont_atexit() at the very start of the program, and this seems to fix the segfaults (and I’ll take that), but leaves me a bit nervous, I’d like the atexit hooks to run, and maybe this points to some deeper problem in my code?

Any ideas?

If it makes a difference, this is Linux Mint Una (an Ubuntu variant), and ldd tells that the program has shared libraries

libhdf5_serial.so.103 
libhdf5_serial_hl.so.100 
libnetcdf.so.15

Thanks in advance

Jim

hyoklee · November 22, 2022, 11:17pm

HI, Jim!

Thanks for sharing an interesting real-world problem.

I’d like to know more details.
In your work flow,

where does the segmentation fault occur? netCDF-4 input to HDF5 output step or HDF5 post-processing step? Or both?
what’s the version of HDF5 library being used?
do you use netCDF-API to process the new netCDF-4 input? If so, what’s the netCDF version?

Finally, I would appreciate if you can provide a minimally reproducible GitHub Action YML using the closest Ubuntu version that matches Mint.

Additionally, I would appreciate if your app runs without error using 1) the develop branch of HDF5/netCDF-4 and 2) static build.

j.j.green · November 22, 2022, 11:42pm

Thanks for the response:

The segfault is right at the end of processing, my program closes the HDF5 file, reports success (says “done.”) and returns EXIT_SUCCESS, then segfaults
Version is 1.10.4+repack-11ubuntu1 from Ubuntu deb libhdf5-dev
The netCDF is read with the netCFD API, version 1:4.7.3-1 from Ubuntu deb libnetcdf-dev

I’ll have a go at creating a minimal example, but may take a few days. Static build will need a local compile of netCDF since the Ubuntu -dev package no longer includes the static library for reasons that are not entirely clear to me.

Cheers, Jim

hyoklee · November 23, 2022, 2:18am

Thanks for the details!

Using deb packages will make testing easier.

Reading netCDF with netCDF API makes this problem very interesting.

Is there any particular reason that you don’t use netCDF API for writing netCDF-4/HDF5 in your workflow?

j.j.green · November 23, 2022, 9:57am

Apologies, I was perhaps not clear: I am using the netCDF C API to read a 3rd party netCDF-4 file, this is the new code added.

 #include <netcdf.h>
   : 
 nc_open(path, NC_NOWRITE, &(state->id.file);

The existing application uses the HDF5 C API to create, write, close, then open, read, write close a HDF5 file.

hyoklee · November 24, 2022, 4:17am

Hi,

Thank you for more details.
You may want to call netCDF API for file close as well.

I started preparing a GitHub Action [1] for your test case.
I found out that Ubuntu 20.04 matches the versions on your Mint system [2].

By the way, do you know NCO tool?
I’m asking because I think your workflow may benefit a lot from it.
A few lines of shell script with NCO tools may achieve what your application does.

NCO is very well written and robust tool so
you may not worry about the error that you’re seeing now.
I wrote a quick tutorial for most common use case.

Regards,

[1] https://github.com/hyoklee/actions/blob/main/.github/workflows/n4h5.yml
[2] https://github.com/hyoklee/actions

j.j.green · November 24, 2022, 7:45pm

I think I misread your earlier message,

Is there any particular reason that you don’t use netCDF API for writing netCDF-4/HDF5 in your workflow?

This code started out around 2003 which I think predated netCDF-4 by several tears, if we were to start again we probably would use netCDF-4, such an easy API!

So, to business: I have spent some time and created a small program which has the same pattern of netCDF-4 and HDF5 create/open/close as the problem program, but I could not reproduce the segfault. So I think it must be something in what is done inside those file operations, rather than just their order.

It is not feasible for me to bisect the codebase for this, it is big (100kloc+ split over several libraries), and also commercial, so I think the only way forward is for me to try to isolate the actual cause of the segfault via gdb, then perhaps you could look at the corresponding code in HDF5, sound reasonable?

As a start in this direction. I previously indicated this was occuring after the return in main.c, so we then drop into glibc’s exit, and stepping through that in gdb I get the line-numbers:

Breakpoint 2, __GI_exit (status=0) at exit.c:138
138	exit.c: No such file or directory.
(gdb) s
139	in exit.c
(gdb)
__run_exit_handlers (status=0, listp=0x7ffff7940718 <__exit_funcs>,
    run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true)
    at exit.c:40
40	in exit.c
(gdb)
45	in exit.c
(gdb)
46	in exit.c
(gdb)
__GI___call_tls_dtors () at cxa_thread_atexit_impl.c:145
145	cxa_thread_atexit_impl.c: No such file or directory.
(gdb)
146	in cxa_thread_atexit_impl.c
(gdb)
__run_exit_handlers (status=0, listp=0x7ffff7940718 <__exit_funcs>,
    run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true)
    at exit.c:56
56	exit.c: No such file or directory.
(gdb)
__lll_cas_lock (futex=<optimised out>)
    at ../sysdeps/unix/sysv/linux/x86/lowlevellock.h:47
47	../sysdeps/unix/sysv/linux/x86/lowlevellock.h: No such file or directory.
(gdb)
__run_exit_handlers (status=0, listp=0x7ffff7940718 <__exit_funcs>,
    run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true)
    at exit.c:59
59	exit.c: No such file or directory.
(gdb)
61	in exit.c
(gdb)
70	in exit.c
(gdb)
72	in exit.c
(gdb)
76	in exit.c
(gdb)
77	in exit.c
(gdb)
103	in exit.c
(gdb)
106	in exit.c
(gdb)
108	in exit.c
(gdb)

Now looking at the sources for exit.c in glibc 2.31 these line numbers look to be this:

77	  switch (f->flavor)
:
	    case ef_cxa:
	      /* To avoid dlclose/exit race calling cxafct twice (BZ 22180),
             we must mark this function as ef_free.  */
103	      f->flavor = ef_free;
	      cxafct = f->func.cxa.fn;
#ifdef PTR_DEMANGLE
106	      PTR_DEMANGLE (cxafct);
#endif
108	      cxafct (f->func.cxa.arg, status);

with the final line being 108, so it is f->func.cxa.fn which is segfaulting (presumably, line 106 is optimised away).

So next, I’ll compile a local libhdf5 with debugging symbols, so we should be able to see what that function call is (unfortunately there is no libhfd5-dbg available in Debian). That will probably be in several days due to other commitments.

Speak soon.

j.j.green · December 11, 2022, 9:19pm

A bit more detail on this: I’ve now compiled the latest HDF5 (1.12.2) and NetCDF-4 (4.9.0) with debugging symbols enabled, I still get a segfault but now with a more informative backtrace:

Program received signal SIGSEGV, Segmentation fault.
0x0000555556a14a88 in ?? ()
(gdb) bt
#0  0x0000555556a14a88 in ?? ()
#1  0x00007ffff7721921 in H5E_dump_api_stack (is_api=is_api@entry=true)
    at H5Eint.c:945
#2  0x00007ffff771d804 in H5Eget_auto2 (estack_id=estack_id@entry=0, 
    func=func@entry=0x7fffffffd9e8, client_data=client_data@entry=0x0)
    at H5E.c:1604
#3  0x00007ffff7615f6b in H5_term_library () at H5.c:306
#4  0x00007ffff74108a7 in __run_exit_handlers (status=0, 
    listp=0x7ffff75b6718 <__exit_funcs>, 
    run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true)
    at exit.c:108
#5  0x00007ffff7410a60 in __GI_exit (status=<optimised out>) at exit.c:139
#6  0x00007ffff73ee08a in __libc_start_main (main=0x5555555584d0 <main>, argc=8, 
    argv=0x7fffffffdf88, init=<optimised out>, fini=<optimised out>, 
    rtld_fini=<optimised out>, stack_end=0x7fffffffdf78) at ../csu/libc-start.c:342
#7  0x00005555555588de in _start ()

Now that null client_data looked a bit suspicious to me, but apparently that is permitted according to the documentation on H5Eget_auto2, but that same document warns of the perils of mixing set_auto1 and get_auto2. In my code I have one occurrence of

  if (H5Eset_auto(NULL, NULL) < 0)
    warn("cant suppress HDF5 error reporting (not fatal)\n");

and the program is compiled with -DH5_USE_16_API, could that be anything to do with it?

j.j.green · December 11, 2022, 10:01pm

And a little bit more, setting directory to my copy of the HDF5 sources and setting a break on H5_term_library

Breakpoint 1, H5_term_library () at H5.c:283
283	{
(gdb) s
296	    if (!(H5_INIT_GLOBAL))
(gdb) 
300	    H5_TERM_GLOBAL = TRUE;
(gdb) 
303	    H5CX_push_special();
(gdb) 
H5CX_push_special () at H5CX.c:836
836	{
(gdb) 
839	    FUNC_ENTER_NOAPI_NOINIT_NOERR
(gdb) 
__ctype_b_loc () at ../include/ctype.h:40
40	../include/ctype.h: No such file or directory.
(gdb) 
41	in ../include/ctype.h
(gdb) 
H5CX_push_special () at H5CX.c:842
842	    cnode = (H5CX_node_t *)HDcalloc(1, sizeof(H5CX_node_t));
(gdb) 

Program received signal SIGSEGV, Segmentation fault.
0x0000555556a14a88 in ?? ()
(gdb) print sizeof(H5CX_node_t)
$1 = 456

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Segfault at exit with HDF5/netCDF4