H5DWrite fails for larger data and / or smaller chunks

Working algorithm fails only for larger data

After having used HDF5 (based on the implementation of the discussion Parallel HDF5 write with irregular size in one dimension ) over more than a year successfully, I recently stressed that implementation with larger data to be written. With larger data I get the appended error, while smaller data is fine.

Reproducing the error with scaled down minimal example

I have reimplemented our algorithm to this minimal example: HDF_Forum_H5Dwrite_fails · Tobias Meisel / Minimal Examples · GitLab.
With same data size and options as in actual implementation, I was able to reproduce the exact error like in actual application.

The error can be triggered by either lowering c (chunk size) or the amount of data each MPI process writes (s). The full documentation of the parameters is here: hdf5_minimal/README.md · main · Tobias Meisel / Minimal Examples · GitLab

This call is similar to my actual use case (running on HPC):

srun -n 24 hdf5_minimal -f x.h5 -l 50 -c 1000000 -i 0 -s 24000000

This is the scaled down version (running on Desktop PC):

hdf5_minimal -f x.h5 -l 50 -c 100 -i 0 -s 240000

The error occurs at H5Dwrite
In minimal example it this line:
H5Dwrite(dset, H5T_NATIVE_INT, mspace, fspace, io_transfer, dataext);

Possible reasons?

Similar problems (address overflow ?, inappropriate chunk_size ?) are reported here in the forum, but the answers did not help me.

Error log

HDF5-DIAG: Error detected in HDF5 (1.14.2) MPI-process 0:
  #000: H5B.c line 969 in H5B__insert_helper(): can't insert subtree
    major: B-Tree node
    minor: Unable to insert object
  #001: H5B.c line 1072 in H5B__insert_helper(): unable to unprotect child
    major: B-Tree node
    minor: Unable to unprotect metadata
  #002: H5AC.c line 1569 in H5AC_unprotect(): Can't run sync point
    major: Object cache
    minor: Unable to flush data from cache
  #003: H5ACmpio.c line 2065 in H5AC__run_sync_point(): H5AC__rsp__dist_md_write__flush_to_min_clean() failed.
    major: Object cache
    minor: Can't get value
  #004: H5ACmpio.c line 1748 in H5AC__rsp__dist_md_write__flush_to_min_clean(): Can't propagate and apply candidate list.
    major: Object cache
    minor: Unable to flush data from cache
  #005: H5ACmpio.c line 1217 in H5AC__propagate_and_apply_candidate_list(): Can't apply candidate list.
    major: Object cache
    minor: Internal error detected
  #006: H5Cmpio.c line 368 in H5C_apply_candidate_list(): flush candidates failed
    major: Object cache
    minor: Unable to flush data from cache
  #007: H5Cmpio.c line 1076 in H5C__flush_candidate_entries(): flush candidates in ring failed
    major: Object cache
    minor: Unable to flush data from cache
  #008: H5Cmpio.c line 1247 in H5C__flush_candidates_in_ring(): can't flush entry
    major: Object cache
    minor: Unable to flush data from cache
  #009: H5Centry.c line 609 in H5C__flush_single_entry(): Can't write image to file
    major: Object cache
    minor: Unable to flush data from cache
  #010: H5Fio.c line 220 in H5F_block_write(): write through page buffer failed
    major: Low-level I/O
    minor: Write failed
  #011: H5PB.c line 992 in H5PB_write(): write through metadata accumulator failed
    major: Page Buffering
    minor: Write failed
  #012: H5Faccum.c line 821 in H5F__accum_write(): file write failed
    major: Low-level I/O
    minor: Write failed
  #013: H5FDint.c line 309 in H5FD_write(): driver write request failed
    major: Virtual File Layer
    minor: Write failed
  #014: H5FDmpio.c line 1559 in H5FD__mpio_write(): MPI_File_set_view failed: MPI error string is 'MPI_ERR_TYPE: invalid datatype'
    major: Internal error (too specific to document in detail)
    minor: Some MPI function failed
  #015: H5B.c line 969 in H5B__insert_helper(): can't insert subtree
    major: B-Tree node
    minor: Unable to insert object
  #016: H5B.c line 1029 in H5B__insert_helper(): unable to split node
    major: B-Tree node
    minor: Unable to split node
  #017: H5B.c line 449 in H5B__split(): unable to create B-tree
    major: B-Tree node
    minor: Unable to initialize object
  #018: H5B.c line 238 in H5B_create(): can't add B-tree root node to cache
    major: B-Tree node
    minor: Unable to initialize object
  #019: H5AC.c line 747 in H5AC_insert_entry(): Can't run sync point
    major: Object cache
    minor: Unable to flush data from cache
  #020: H5ACmpio.c line 2065 in H5AC__run_sync_point(): H5AC__rsp__dist_md_write__flush_to_min_clean() failed.
    major: Object cache
    minor: Can't get value
  #021: H5ACmpio.c line 1748 in H5AC__rsp__dist_md_write__flush_to_min_clean(): Can't propagate and apply candidate list.
    major: Object cache
    minor: Unable to flush data from cache
  #022: H5ACmpio.c line 1217 in H5AC__propagate_and_apply_candidate_list(): Can't apply candidate list.
    major: Object cache
    minor: Internal error detected
  #023: H5Cmpio.c line 368 in H5C_apply_candidate_list(): flush candidates failed
    major: Object cache
    minor: Unable to flush data from cache
  #024: H5Cmpio.c line 1076 in H5C__flush_candidate_entries(): flush candidates in ring failed
    major: Object cache
    minor: Unable to flush data from cache
  #025: H5Cmpio.c line 1247 in H5C__flush_candidates_in_ring(): can't flush entry
    major: Object cache
    minor: Unable to flush data from cache
  #026: H5Centry.c line 609 in H5C__flush_single_entry(): Can't write image to file
    major: Object cache
    minor: Unable to flush data from cache
  #027: H5Fio.c line 220 in H5F_block_write(): write through page buffer failed
    major: Low-level I/O
    minor: Write failed
  #028: H5PB.c line 992 in H5PB_write(): write through metadata accumulator failed
    major: Page Buffering
    minor: Write failed
  #029: H5Faccum.c line 821 in H5F__accum_write(): file write failed
    major: Low-level I/O
    minor: Write failed
  #030: H5FDint.c line 309 in H5FD_write(): driver write request failed
    major: Virtual File Layer
    minor: Write failed
  #031: H5FDmpio.c line 1559 in H5FD__mpio_write(): MPI_File_set_view failed: MPI error string is 'MPI_ERR_TYPE: invalid datatype'
    major: Internal error (too specific to document in detail)
    minor: Some MPI function failed
5
HDF5-DIAG: Error detected in HDF5 (1.14.2) MPI-process 0:
  #000: H5AC.c line 463 in H5AC_dest(): can't enable skip list
    major: Object cache
    minor: Internal error detected
  #001: H5C.c line 1091 in H5C_set_slist_enabled(): slist already enabled?
    major: Object cache
    minor: Internal error detected
  #002: H5Fint.c line 1400 in H5F__dest(): unable to flush cached data (phase 2)
    major: File accessibility
    minor: Unable to flush data from cache
  #003: H5Fint.c line 2254 in H5F__flush_phase2(): secure from MDC flush failed
    major: Object cache
    minor: Unable to flush data from cache
  #004: H5AC.c line 1172 in H5AC_secure_from_file_flush(): can't disable skip list
    major: Object cache
    minor: Internal error detected
  #005: H5C.c line 1132 in H5C_set_slist_enabled(): slist not empty?
    major: Object cache
    minor: Internal error detected
  #006: H5Fint.c line 2243 in H5F__flush_phase2(): unable to flush metadata cache
    major: Object cache
    minor: Unable to flush data from cache
  #007: H5AC.c line 614 in H5AC_flush(): Can't flush
    major: Object cache
    minor: Unable to flush data from cache
  #008: H5ACmpio.c line 2206 in H5AC__flush_entries(): Can't run sync point.
    major: Object cache
    minor: Unable to flush data from cache
  #009: H5ACmpio.c line 2071 in H5AC__run_sync_point(): H5AC__rsp__dist_md_write__flush() failed.
    major: Object cache
    minor: Can't get value
  #010: H5ACmpio.c line 1627 in H5AC__rsp__dist_md_write__flush(): Can't apply candidate list.
    major: Object cache
    minor: Internal error detected
  #011: H5Cmpio.c line 368 in H5C_apply_candidate_list(): flush candidates failed
    major: Object cache
    minor: Unable to flush data from cache
  #012: H5Cmpio.c line 1076 in H5C__flush_candidate_entries(): flush candidates in ring failed
    major: Object cache
    minor: Unable to flush data from cache
  #013: H5Cmpio.c line 1247 in H5C__flush_candidates_in_ring(): can't flush entry
    major: Object cache
    minor: Unable to flush data from cache
  #014: H5Centry.c line 609 in H5C__flush_single_entry(): Can't write image to file
    major: Object cache
    minor: Unable to flush data from cache
  #015: H5Fio.c line 220 in H5F_block_write(): write through page buffer failed
    major: Low-level I/O
    minor: Write failed
  #016: H5PB.c line 992 in H5PB_write(): write through metadata accumulator failed
    major: Page Buffering
    minor: Write failed
  #017: H5Faccum.c line 821 in H5F__accum_write(): file write failed
    major: Low-level I/O
    minor: Write failed
  #018: H5FDint.c line 303 in H5FD_write(): addr overflow, addr = 4145934, size=2096, eoa=4145934
    major: Invalid arguments to routine
    minor: Address overflowed
  #019: H5Fint.c line 2222 in H5F__flush_phase2(): unable to flush metadata cache
    major: Object cache
    minor: Unable to flush data from cache
  #020: H5AC.c line 614 in H5AC_flush(): Can't flush
    major: Object cache
    minor: Unable to flush data from cache
  #021: H5ACmpio.c line 2206 in H5AC__flush_entries(): Can't run sync point.
    major: Object cache
    minor: Unable to flush data from cache
  #022: H5ACmpio.c line 2071 in H5AC__run_sync_point(): H5AC__rsp__dist_md_write__flush() failed.
    major: Object cache
    minor: Can't get value
  #023: H5ACmpio.c line 1627 in H5AC__rsp__dist_md_write__flush(): Can't apply candidate list.
    major: Object cache
    minor: Internal error detected
  #024: H5Cmpio.c line 368 in H5C_apply_candidate_list(): flush candidates failed
    major: Object cache
    minor: Unable to flush data from cache
  #025: H5Cmpio.c line 1076 in H5C__flush_candidate_entries(): flush candidates in ring failed
    major: Object cache
    minor: Unable to flush data from cache
  #026: H5Cmpio.c line 1247 in H5C__flush_candidates_in_ring(): can't flush entry
    major: Object cache
    minor: Unable to flush data from cache
  #027: H5Centry.c line 609 in H5C__flush_single_entry(): Can't write image to file
    major: Object cache
    minor: Unable to flush data from cache
  #028: H5Fio.c line 220 in H5F_block_write(): write through page buffer failed
    major: Low-level I/O
    minor: Write failed
  #029: H5PB.c line 992 in H5PB_write(): write through metadata accumulator failed
    major: Page Buffering
    minor: Write failed
  #030: H5Faccum.c line 821 in H5F__accum_write(): file write failed
    major: Low-level I/O
    minor: Write failed
  #031: H5FDint.c line 303 in H5FD_write(): addr overflow, addr = 4145934, size=2096, eoa=4145934
    major: Invalid arguments to routine
    minor: Address overflowed

Question

Could you please advise me, how to further tackle down / solve the problem?

Hi @tobias.meisel,

the error stack given here looks very similar to an issue that was fixed for the HDF5 1.14.3 release, in PR #3688. Would it be possible to try your same example with that version?

Note that enabling collective metadata writes by calling H5Pset_coll_metadata_write on your File Access Property List that you pass to H5Fcreate should also fix the issue if this is the case. I’d recommend having that on anyway if you’re going to be writing a large amount of data/metadata, but it would be a good test to try before trying the 1.14.3 release.

With H5Pset_coll_metadata_write I still got the same error. But I needed to increase the amount of data.

The problem is solved with the newer HDF Version 1.14.3.

@jhenderson Thank you for your help.