Working algorithm fails only for larger data
After having used HDF5 (based on the implementation of the discussion Parallel HDF5 write with irregular size in one dimension ) over more than a year successfully, I recently stressed that implementation with larger data to be written. With larger data I get the appended error, while smaller data is fine.
Reproducing the error with scaled down minimal example
I have reimplemented our algorithm to this minimal example: HDF_Forum_H5Dwrite_fails · Tobias Meisel / Minimal Examples · GitLab.
With same data size and options as in actual implementation, I was able to reproduce the exact error like in actual application.
The error can be triggered by either lowering c (chunk size) or the amount of data each MPI process writes (s). The full documentation of the parameters is here: hdf5_minimal/README.md · main · Tobias Meisel / Minimal Examples · GitLab
This call is similar to my actual use case (running on HPC):
srun -n 24 hdf5_minimal -f x.h5 -l 50 -c 1000000 -i 0 -s 24000000
This is the scaled down version (running on Desktop PC):
hdf5_minimal -f x.h5 -l 50 -c 100 -i 0 -s 240000
The error occurs at H5Dwrite
In minimal example it this line:
H5Dwrite(dset, H5T_NATIVE_INT, mspace, fspace, io_transfer, dataext);
Possible reasons?
Similar problems (address overflow ?, inappropriate chunk_size ?) are reported here in the forum, but the answers did not help me.
Error log
HDF5-DIAG: Error detected in HDF5 (1.14.2) MPI-process 0:
#000: H5B.c line 969 in H5B__insert_helper(): can't insert subtree
major: B-Tree node
minor: Unable to insert object
#001: H5B.c line 1072 in H5B__insert_helper(): unable to unprotect child
major: B-Tree node
minor: Unable to unprotect metadata
#002: H5AC.c line 1569 in H5AC_unprotect(): Can't run sync point
major: Object cache
minor: Unable to flush data from cache
#003: H5ACmpio.c line 2065 in H5AC__run_sync_point(): H5AC__rsp__dist_md_write__flush_to_min_clean() failed.
major: Object cache
minor: Can't get value
#004: H5ACmpio.c line 1748 in H5AC__rsp__dist_md_write__flush_to_min_clean(): Can't propagate and apply candidate list.
major: Object cache
minor: Unable to flush data from cache
#005: H5ACmpio.c line 1217 in H5AC__propagate_and_apply_candidate_list(): Can't apply candidate list.
major: Object cache
minor: Internal error detected
#006: H5Cmpio.c line 368 in H5C_apply_candidate_list(): flush candidates failed
major: Object cache
minor: Unable to flush data from cache
#007: H5Cmpio.c line 1076 in H5C__flush_candidate_entries(): flush candidates in ring failed
major: Object cache
minor: Unable to flush data from cache
#008: H5Cmpio.c line 1247 in H5C__flush_candidates_in_ring(): can't flush entry
major: Object cache
minor: Unable to flush data from cache
#009: H5Centry.c line 609 in H5C__flush_single_entry(): Can't write image to file
major: Object cache
minor: Unable to flush data from cache
#010: H5Fio.c line 220 in H5F_block_write(): write through page buffer failed
major: Low-level I/O
minor: Write failed
#011: H5PB.c line 992 in H5PB_write(): write through metadata accumulator failed
major: Page Buffering
minor: Write failed
#012: H5Faccum.c line 821 in H5F__accum_write(): file write failed
major: Low-level I/O
minor: Write failed
#013: H5FDint.c line 309 in H5FD_write(): driver write request failed
major: Virtual File Layer
minor: Write failed
#014: H5FDmpio.c line 1559 in H5FD__mpio_write(): MPI_File_set_view failed: MPI error string is 'MPI_ERR_TYPE: invalid datatype'
major: Internal error (too specific to document in detail)
minor: Some MPI function failed
#015: H5B.c line 969 in H5B__insert_helper(): can't insert subtree
major: B-Tree node
minor: Unable to insert object
#016: H5B.c line 1029 in H5B__insert_helper(): unable to split node
major: B-Tree node
minor: Unable to split node
#017: H5B.c line 449 in H5B__split(): unable to create B-tree
major: B-Tree node
minor: Unable to initialize object
#018: H5B.c line 238 in H5B_create(): can't add B-tree root node to cache
major: B-Tree node
minor: Unable to initialize object
#019: H5AC.c line 747 in H5AC_insert_entry(): Can't run sync point
major: Object cache
minor: Unable to flush data from cache
#020: H5ACmpio.c line 2065 in H5AC__run_sync_point(): H5AC__rsp__dist_md_write__flush_to_min_clean() failed.
major: Object cache
minor: Can't get value
#021: H5ACmpio.c line 1748 in H5AC__rsp__dist_md_write__flush_to_min_clean(): Can't propagate and apply candidate list.
major: Object cache
minor: Unable to flush data from cache
#022: H5ACmpio.c line 1217 in H5AC__propagate_and_apply_candidate_list(): Can't apply candidate list.
major: Object cache
minor: Internal error detected
#023: H5Cmpio.c line 368 in H5C_apply_candidate_list(): flush candidates failed
major: Object cache
minor: Unable to flush data from cache
#024: H5Cmpio.c line 1076 in H5C__flush_candidate_entries(): flush candidates in ring failed
major: Object cache
minor: Unable to flush data from cache
#025: H5Cmpio.c line 1247 in H5C__flush_candidates_in_ring(): can't flush entry
major: Object cache
minor: Unable to flush data from cache
#026: H5Centry.c line 609 in H5C__flush_single_entry(): Can't write image to file
major: Object cache
minor: Unable to flush data from cache
#027: H5Fio.c line 220 in H5F_block_write(): write through page buffer failed
major: Low-level I/O
minor: Write failed
#028: H5PB.c line 992 in H5PB_write(): write through metadata accumulator failed
major: Page Buffering
minor: Write failed
#029: H5Faccum.c line 821 in H5F__accum_write(): file write failed
major: Low-level I/O
minor: Write failed
#030: H5FDint.c line 309 in H5FD_write(): driver write request failed
major: Virtual File Layer
minor: Write failed
#031: H5FDmpio.c line 1559 in H5FD__mpio_write(): MPI_File_set_view failed: MPI error string is 'MPI_ERR_TYPE: invalid datatype'
major: Internal error (too specific to document in detail)
minor: Some MPI function failed
5
HDF5-DIAG: Error detected in HDF5 (1.14.2) MPI-process 0:
#000: H5AC.c line 463 in H5AC_dest(): can't enable skip list
major: Object cache
minor: Internal error detected
#001: H5C.c line 1091 in H5C_set_slist_enabled(): slist already enabled?
major: Object cache
minor: Internal error detected
#002: H5Fint.c line 1400 in H5F__dest(): unable to flush cached data (phase 2)
major: File accessibility
minor: Unable to flush data from cache
#003: H5Fint.c line 2254 in H5F__flush_phase2(): secure from MDC flush failed
major: Object cache
minor: Unable to flush data from cache
#004: H5AC.c line 1172 in H5AC_secure_from_file_flush(): can't disable skip list
major: Object cache
minor: Internal error detected
#005: H5C.c line 1132 in H5C_set_slist_enabled(): slist not empty?
major: Object cache
minor: Internal error detected
#006: H5Fint.c line 2243 in H5F__flush_phase2(): unable to flush metadata cache
major: Object cache
minor: Unable to flush data from cache
#007: H5AC.c line 614 in H5AC_flush(): Can't flush
major: Object cache
minor: Unable to flush data from cache
#008: H5ACmpio.c line 2206 in H5AC__flush_entries(): Can't run sync point.
major: Object cache
minor: Unable to flush data from cache
#009: H5ACmpio.c line 2071 in H5AC__run_sync_point(): H5AC__rsp__dist_md_write__flush() failed.
major: Object cache
minor: Can't get value
#010: H5ACmpio.c line 1627 in H5AC__rsp__dist_md_write__flush(): Can't apply candidate list.
major: Object cache
minor: Internal error detected
#011: H5Cmpio.c line 368 in H5C_apply_candidate_list(): flush candidates failed
major: Object cache
minor: Unable to flush data from cache
#012: H5Cmpio.c line 1076 in H5C__flush_candidate_entries(): flush candidates in ring failed
major: Object cache
minor: Unable to flush data from cache
#013: H5Cmpio.c line 1247 in H5C__flush_candidates_in_ring(): can't flush entry
major: Object cache
minor: Unable to flush data from cache
#014: H5Centry.c line 609 in H5C__flush_single_entry(): Can't write image to file
major: Object cache
minor: Unable to flush data from cache
#015: H5Fio.c line 220 in H5F_block_write(): write through page buffer failed
major: Low-level I/O
minor: Write failed
#016: H5PB.c line 992 in H5PB_write(): write through metadata accumulator failed
major: Page Buffering
minor: Write failed
#017: H5Faccum.c line 821 in H5F__accum_write(): file write failed
major: Low-level I/O
minor: Write failed
#018: H5FDint.c line 303 in H5FD_write(): addr overflow, addr = 4145934, size=2096, eoa=4145934
major: Invalid arguments to routine
minor: Address overflowed
#019: H5Fint.c line 2222 in H5F__flush_phase2(): unable to flush metadata cache
major: Object cache
minor: Unable to flush data from cache
#020: H5AC.c line 614 in H5AC_flush(): Can't flush
major: Object cache
minor: Unable to flush data from cache
#021: H5ACmpio.c line 2206 in H5AC__flush_entries(): Can't run sync point.
major: Object cache
minor: Unable to flush data from cache
#022: H5ACmpio.c line 2071 in H5AC__run_sync_point(): H5AC__rsp__dist_md_write__flush() failed.
major: Object cache
minor: Can't get value
#023: H5ACmpio.c line 1627 in H5AC__rsp__dist_md_write__flush(): Can't apply candidate list.
major: Object cache
minor: Internal error detected
#024: H5Cmpio.c line 368 in H5C_apply_candidate_list(): flush candidates failed
major: Object cache
minor: Unable to flush data from cache
#025: H5Cmpio.c line 1076 in H5C__flush_candidate_entries(): flush candidates in ring failed
major: Object cache
minor: Unable to flush data from cache
#026: H5Cmpio.c line 1247 in H5C__flush_candidates_in_ring(): can't flush entry
major: Object cache
minor: Unable to flush data from cache
#027: H5Centry.c line 609 in H5C__flush_single_entry(): Can't write image to file
major: Object cache
minor: Unable to flush data from cache
#028: H5Fio.c line 220 in H5F_block_write(): write through page buffer failed
major: Low-level I/O
minor: Write failed
#029: H5PB.c line 992 in H5PB_write(): write through metadata accumulator failed
major: Page Buffering
minor: Write failed
#030: H5Faccum.c line 821 in H5F__accum_write(): file write failed
major: Low-level I/O
minor: Write failed
#031: H5FDint.c line 303 in H5FD_write(): addr overflow, addr = 4145934, size=2096, eoa=4145934
major: Invalid arguments to routine
minor: Address overflowed
Question
Could you please advise me, how to further tackle down / solve the problem?