Segmentation fault in H5Dopen2 call after several runs


#1

What would cause a core dump in the H5Dopen2 call. This is the call stack. It only happens after several dataset runs through the source HDF5 file

Process terminating with default action of signal 11 (SIGSEGV)
==1423169== General Protection Fault
==1423169== at 0x625C92D: getenv (in /usr/lib64/libc-2.28.so)
==1423169== by 0x774333: H5D_build_extfile_prefix.isra.2 (in …/bin/resample)
==1423169== by 0x7768C6: H5D_open (in …/bin/resample)
==1423169== by 0x7610EE: H5Dopen2 (in …/bin/resample)


#2

Which version of the library are you using? How was it built? Debug? Production? Do you have a complete stack dump? Can you offer more information about your program and what it’s doing?


#3

It fails in debug and production. The HDF5 library that is being used is 1.8.22

The program reads the hdf5 file which are basically Satellite DATA . In the test case that is failing it is
ATL21 dataset.

https://search.earthdata.nasa.gov/search/granules?p=C2153577500-NSIDC_CPRD&pg[0][v]=f&pg[0][gsk]=-start_date&q=ATL21&tl=1668445274.916!3!!

The program reads the file and for each dataset - it transforms it based on whether a spatial subset or reprojection is requested and writes the output in hdf5 format .

There are about 310 datasets or runs to process
The failure occurs on run number 293

full stack dump:
==339321==
==339321== Process terminating with default action of signal 11 (SIGSEGV)
==339321== General Protection Fault
==339321== at 0x625C92D: getenv (in /usr/lib64/libc-2.28.so)
==339321== by 0x774313: H5D_build_extfile_prefix.isra.2 (in …/bin/resample)
==339321== by 0x7768A6: H5D_open (in …/bin/resample)
==339321== by 0x7610CE: H5Dopen2 (in …/bin/resample)
==339321== by 0x4A0DCA: scan_a_dataset (hdf5getinfo.c:1141)
==339321== by 0x4A1239: process_dataset (hdf5getinfo.c:1115)
==339321== by 0x4A139B: scan_for_datasets (hdf5getinfo.c:955)
==339321== by 0x43E024: GetHdf5Field (hdf_oc.c:2188)
==339321== by 0x484E93: ResampleImage (resample.c:245)
==339321== by 0x441899: main_ext (main_ext.c:1106)
==339321== by 0x411EFE: main (main.c:161)


#4

Hello Sudha,

For your reference, I created a SUPPORT ticket for this issue, SUPPORT-1861. Please tell us the severity of this issue to your work so we could prioritize the work for it appropriately. Also, for any important issues in the future, please note that sending email to help@hdfgroup.org will be a better way to contact us for quicker responses.

Thank you!
Binh-Minh

Service Desk: help.hdfgroup.org
Email Address: help@hdfgroup.org


#5

Thank you for the reply. We are trying to see if we can do a workaround. I think the severity level is major The request for this product does cause a core dump if they select all datasets. How do we track the support ticket - to see the progress?
Thanks
-Sudha


#6

The SUPPORT ticket was just for your reference when you contact us regarding the issue before a bug report is available. In this case, it didn’t take long, but usually, there would be detailed discussion and investigation to verify the problem… I just entered a bug report https://jira.hdfgroup.org/browse/HDFFV-11351, where you will be able to track the progress and it will be used to prioritize the work. I entered the severity as Major. From now on, you can use the bug report as reference when contacting us regarding this issue.
Binh-Minh


#7

Hi @sudha.murthy,

can you tell me a bit more about the system you’re using, how HDF5 is installed and how it’s being used (serial/parallel, threads/no threads, …), etc.? It’s very strange to see a protection fault from getenv because in HDF5 it is just passed a hard-coded string of “HDF5_EXTFILE_PREFIX” (https://github.com/HDFGroup/hdf5/blob/hdf5-1_8_22/src/H5Dint.c#L914), so this makes me think some internal libc state might be messed up. Are you possibly using multiple threads that might be causing this issue by messing up state between the threads?


#8

Yeah - that is what is confusing. It is not multithreaded. We just statically link to the HDF5 libraries.
But this is legacy code - so I was trying to find out what we could be doing wrong in our calls that could mess this up.