Dear community,
i am reaching out, because i am stuck with an error i cannot get my head around. I printed the traceback at the end of the message. I am using hdf5_1.10.11 under the hood of netcdf_4.9.2 for a parellel data output using Intel MPI 2021.4. At the time of crashing, the file is about 790MB in size.
Below is the output of valgrind for one example of the 32 tasks.
Could you be so kind to give it a look? Please ask if additional information is necessary. Thank you very much!
==1949873== Invalid read of size 16
==1949873== at 0x53A8870: __I_MPI___intel_avx_rep_memcpy (in /opt/intel/oneapi/mpi/2021.4.0/lib/libmpifort.so.12.0.0)
==1949873== by 0x572AC78: memcpy (string3.h:52)
==1949873== by 0x572AC78: ADIOI_Fill_send_buffer (ad_write_coll.c:897)
==1949873== by 0x572AC78: ??? (ad_write_coll.c:701)
==1949873== by 0x5727F9F: ADIOI_Exch_and_write (ad_write_coll.c:466)
==1949873== by 0x5727F9F: ADIOI_GEN_WriteStridedColl (ad_write_coll.c:189)
==1949873== by 0x61CFBF3: MPIOI_File_write_all (write_all.c:114)
==1949873== by 0x61CFEC1: PMPI_File_write_at_all (write_atall.c:58)
==1949873== by 0x82F496D: ??? (in /home/akprog/lib/i21/hdf5_1.10.11/lib/libhdf5.so.103.4.1)
==1949873== by 0x8074056: H5FD_write (in /home/akprog/lib/i21/hdf5_1.10.11/lib/libhdf5.so.103.4.1)
==1949873== by 0x804B49B: H5F__accum_write (in /home/akprog/lib/i21/hdf5_1.10.11/lib/libhdf5.so.103.4.1)
==1949873== by 0x818D739: H5PB_write (in /home/akprog/lib/i21/hdf5_1.10.11/lib/libhdf5.so.103.4.1)
==1949873== by 0x8059008: H5F_block_write (in /home/akprog/lib/i21/hdf5_1.10.11/lib/libhdf5.so.103.4.1)
==1949873== by 0x7FBDFFB: H5C__flush_single_entry (in /home/akprog/lib/i21/hdf5_1.10.11/lib/libhdf5.so.103.4.1)
==1949873== by 0x82DCBBE: H5C_apply_candidate_list (in /home/akprog/lib/i21/hdf5_1.10.11/lib/libhdf5.so.103.4.1)
==1949873== Address 0xfdbd540 is 688 bytes inside an unallocated block of size 1,360 in arena “client”
Traceback:
[tnode085:1949873:0:1949873] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0xfee4020)
==== backtrace (tid:1949873) ====
0 0x0000000000012cf0 __funlockfile() :0
1 0x00000000000c4880 __I_MPI___intel_avx_rep_memcpy() ???:0
2 0x0000000000092c79 memcpy() /usr/include/bits/string3.h:52
3 0x0000000000092c79 ADIOI_W_Exchange_data() /build/impi/_buildspace/release/src/mpi/romio/../../../../../src/mpi/romio/adio/common/ad_write_coll.c:701
4 0x000000000008ffa0 ADIOI_Exch_and_write() /build/impi/_buildspace/release/src/mpi/romio/../../../../../src/mpi/romio/adio/common/ad_write_coll.c:466
5 0x000000000008ffa0 ADIOI_GEN_WriteStridedColl() /build/impi/_buildspace/release/src/mpi/romio/../../../../../src/mpi/romio/adio/common/ad_write_coll.c:189
6 0x0000000000b37bf4 MPIOI_File_write_all() /build/impi/buildspace/release/src/mpi/romio/../../../../../src/mpi/romio/mpi-io/write_all.c:114
7 0x0000000000b37ec2 PMPI_File_write_at_all() /build/impi/buildspace/release/src/mpi/romio/../../../../../src/mpi/romio/mpi-io/write_atall.c:58
8 0x00000000003d796e H5FD_get_mpio_atomicity() ???:0
9 0x0000000000157057 H5FD_write() ???:0
10 0x000000000012e49c H5F__accum_write() ???:0
11 0x000000000027073a H5PB_write() ???:0
12 0x000000000013c009 H5F_block_write() ???:0
13 0x00000000000a0ffc H5C__flush_single_entry() ???:0
14 0x00000000003bfbbf H5C_apply_candidate_list() ???:0
15 0x00000000003bcfec H5AC__log_moved_entry() ???:0
16 0x00000000003bdec1 H5AC__run_sync_point() ???:0
17 0x0000000000074459 H5AC_insert_entry() ???:0
18 0x000000000007ae4c H5B_create() ???:0
19 0x000000000007a96a H5B_insert() ???:0
20 0x0000000000079c96 H5B_insert() ???:0
21 0x000000000007a378 H5B_insert() ???:0
22 0x000000000007a378 H5B_insert() ???:0
23 0x0000000000079007 H5B_insert() ???:0
24 0x00000000000bc9a8 H5Dchunk_iter() ???:0
25 0x00000000003cf578 H5D__contig_collective_write() ???:0
26 0x00000000003cd367 H5D__contig_collective_write() ???:0
27 0x00000000003d2690 H5D__chunk_collective_write() ???:0
28 0x00000000000f52a4 H5D__write() ???:0
29 0x00000000000f472f H5Dwrite() ???:0
30 0x00000000000e9fdb NC4_put_vars() ???:0
31 0x0000000000044be1 nc_put_vars_float() ???:0
32 0x000000000077a6e6 nf_put_vars_real() /home_nfs/appbuilder/netcdf/parallel-intel/netcdf-fortran-4.6.1/fortran/nf_varsio.F90:412
33 0x000000000071a1c9 netcdf_mp_nf90_put_var_1d_fourbytereal() /home_nfs/appbuilder/netcdf/parallel-intel/netcdf-fortran-4.6.1/fortran/./netcdf_expanded.F90:927
34 0x000000000061562d dataset_nf90_put_var_d() /net/themis/system/akprog/fortran/lib/io_dataset/i21/mod_m_dataset_netcdf.f90:4577
