A Random Crash I can't explain

Hi. I’m using HDF5-cpp to store results of a computation on a cluster (large matrices among other things).

What I find is that occasionally, the code crashes with the error I included at the end of this message.

The key (probably) is that the crash only occurs when I send binary that performs the calculations (and eventually outputs .h5 files with results) to be ran using slurm on a computational node within the same cluster. Note that the crash doesn’t always happen, which is why it has been difficult to debug. The same binary will run however without crashing if I simply run it.

Any helps or hints would be greatly appreciated. Thanks!

Here is the detailed error message:

HDF5-DIAG: Error detected in HDF5 (1.14.1-2) thread 0:
  #000: H5D.c line 480 in H5Dclose(): can't decrement count on dataset ID
    major: Dataset
    minor: Unable to decrement reference count
  #001: H5Iint.c line 1263 in H5I_dec_app_ref_always_close(): can't decrement ID ref count
    major: Object ID
    minor: Unable to decrement reference count
  #002: H5Iint.c line 1233 in H5I__dec_app_ref_always_close(): can't decrement ID ref count
    major: Object ID
    minor: Unable to decrement reference count
  #003: H5Iint.c line 1107 in H5I__dec_app_ref(): can't decrement ID ref count
    major: Object ID
    minor: Unable to decrement reference count
  #004: H5Dint.c line 297 in H5D__close_cb(): unable to close dataset
    major: Dataset
    minor: Close failed
  #005: H5VLcallback.c line 2808 in H5VL_dataset_close(): dataset close failed
    major: Virtual Object Layer
    minor: Can't close object
  #006: H5VLcallback.c line 2771 in H5VL__dataset_close(): dataset close failed
    major: Virtual Object Layer
    minor: Can't close object
  #007: H5VLnative_dataset.c line 813 in H5VL__native_dataset_close(): can't close dataset
    major: Dataset
    minor: Unable to decrement reference count
  #008: H5Dint.c line 1886 in H5D_close(): unable to flush cached dataset info
    major: Dataset
    minor: Write failed
  #009: H5Dint.c line 3223 in H5D__flush_real(): unable to flush raw data
    major: Dataset
    minor: Unable to flush data from cache
  #010: H5Dcontig.c line 1602 in H5D__contig_flush(): unable to flush sieve buffer
    major: Dataset
    minor: Unable to flush data from cache
  #011: H5Dint.c line 3189 in H5D__flush_sieve_buf(): block write failed
    major: Low-level I/O
    minor: Write failed
  #012: H5Fio.c line 190 in H5F_shared_block_write(): write through page buffer failed
    major: Low-level I/O
    minor: Write failed
  #013: H5PB.c line 1017 in H5PB_write(): write through metadata accumulator failed
   major: Page Buffering
    minor: Write failed
  #014: H5Faccum.c line 832 in H5F__accum_write(): file write failed
    major: Low-level I/O
    minor: Write failed
  #015: H5FDint.c line 306 in H5FD_write(): driver write request failed
    major: Virtual File Layer
    minor: Write failed
  #016: H5FDsec2.c line 856 in H5FD__sec2_write(): file write failed: time = Mon Jul  3 22:35:54 2023
, filename = './dat__U2_V0p5_TP-0p2_Mu-0p8_KDIM16_FineMoms6400_RefdMoms256_FFShellCount1p5_BETA5_C4_T_START5_Square Hubbard_OMFL_VANHOVE_ALLSYMM/1.h5', file\
 descriptor = 3, errno = 5, error message = 'Input/output error', buf = 0x92119438, total write size = 65536, bytes this sub-write = 65536, bytes actually w\
ritten = 18446744073709551615, offset = 0
    major: Low-level I/O
    minor: Write failed
terminate called after throwing an instance of 'H5::DataSetIException'
/var/spool/slurmd/job49109/slurm_script: line 30:  1018 Aborted                 $PROG

The write fails with an error code of -1, which could be due to a permission issue.
What does the relative path ./dat__U2_V0p5_TP-0p2_Mu-0p8_KDIM16_FineMoms6400_RefdMoms256_FFShellCount1p5_BETA5_C4_T_START5_Square Hubbard_OMFL_VANHOVE_ALLSYMM/1.h5 translate to on your cluster?
Where do you think you are writing? What’s the working directory of the executable when running in batch mode?

G.

Thank you!

That directory is relative to where I launch the job.

I have printed out from within the program its directory by putting the following into the main function:

    char currentPath[FILENAME_MAX];
    if (getcwd(currentPath, sizeof(currentPath)) != nullptr) {
        std::cout << "Current working directory: " << currentPath << std::endl;
    } else {
        std::cerr << "Failed to get current working directory." << std::endl;
    }


    struct stat fileStat;
    if (stat(currentPath, &fileStat) == 0) {
        mode_t permissions = fileStat.st_mode;

        // Check permission flags
        if (permissions & S_IRUSR) {
            std::cout << "User has read permission." << std::endl;
        }

        if (permissions & S_IWUSR) {
            std::cout << "User has write permission." << std::endl;
        }

        if (permissions & S_IXUSR) {
            std::cout << "User has execute permission." << std::endl;
        }

        // Similarly, you can check permissions for group and others
        // using S_IRGRP, S_IWGRP, S_IXGRP, S_IROTH, S_IWOTH, S_IXOTH

    } else {
        std::cerr << "Failed to get file/directory stat." << std::endl;
    }

Which outputs:

Current working directory: /home/users/aleryani/frg_new2
User has read permission.
User has write permission.
User has execute permission.

i.e. it’s the same directory where I launch the job. I expect the file to be created in /home/users/aleryani/frg_new2/dat__U2_..._ALLSYMM/ (and it does create a corrupt file sometimes).

I forgot to mention that the error messages can vary. For example, here’s another error message from some minutes ago where I tried to send the job again and it crashed:

load lib/fftw/3.3.8 (FFTW_CFLAGS, FFTW_FFLAGS, FFTW_LDFLAGS, FFTW_INCL, FFTW_LIBS, LD_LIBRARY_PATH, C_INCLUDE_PATH)
load lib/hdf5/1.14.1-2 (PATH, LD_LIBRARY_PATH, LIBRARY_PATH, C_INCLUDE_PATH, CPLUS_INCLUDE_PATH, CMAKE_PREFIX_PATH, HDF5_DIR)
GNU compiler collection (gcc,g++,gfortran) version 7.4.0
load compiler/gcc/7.4.0 (PATH, LD_LIBRARY_PATH, CPATH, MANPATH, PYTHONPATH)
HDF5-DIAG: Error detected in HDF5 (1.14.1-2) thread 0:
  #000: H5F.c line 660 in H5Fcreate(): unable to synchronously create file
    major: File accessibility
    minor: Unable to create file
  #001: H5F.c line 614 in H5F__create_api_common(): unable to create file
    major: File accessibility
    minor: Unable to open file
  #002: H5VLcallback.c line 3605 in H5VL_file_create(): file create failed
    major: Virtual Object Layer
    minor: Unable to create file
  #003: H5VLcallback.c line 3571 in H5VL__file_create(): file create failed
    major: Virtual Object Layer
    minor: Unable to create file
  #004: H5VLnative_file.c line 94 in H5VL__native_file_create(): unable to create file
    major: File accessibility
    minor: Unable to open file
  #005: H5Fint.c line 1903 in H5F_open(): unable to lock the file
    major: File accessibility
    minor: Unable to lock file
  #006: H5FD.c line 2026 in H5FD_lock(): driver lock request failed
    major: Virtual File Layer
    minor: Unable to lock file
  #007: H5FDsec2.c line 988 in H5FD__sec2_lock(): unable to lock file, errno = 11, error message = 'Resource temporarily unavailable'
    major: Virtual File Layer
    minor: Unable to lock file
terminate called after throwing an instance of 'H5::FileIException'
/var/spool/slurmd/job49111/slurm_script: line 30: 32216 Aborted                 $PROG

real    0m33.363s
user    7m44.713s
sys     0m0.355s

What’s the file system?

Try setting the HDF5_USE_FILE_LOCKING environment variable to FALSE, e.g.,

export HDF5_USE_FILE_LOCKING="FALSE"

Thank you very much gheber. The filesystem is “xfs”. I didn’t want to reply until I’m sure the problem has been fixed, but it does indeed seem to be gone after setting that environmental variable.

Thanks a lot!

1 Like