debugging hdf5 - memory problems with core driver

I am using the core driver to do buffered I/O, growing hdf5 files in memory
until they are "big enough" and then flushing them to disk. I am trying to
debug the code I just wrote which runs fine on one machine but crashes on
another. I suspect I am doing something stupid with memory (buffer overflow
or some such thing) but thought I'd ask whether any of the debugging output
from hdf5 could shed some light on the problem.

The problem: After a couple of GB of data are written to memory I get a
bunch of these, and then everything crashes and burns:

tcmalloc: large alloc 18446744071939653632 bytes == (nil) @
tcmalloc: large alloc 18446744071940014080 bytes == (nil) @
tcmalloc: large alloc 18446744071940407296 bytes == (nil) @
tcmalloc: large alloc 18446744071940734976 bytes == (nil) @
tcmalloc: large alloc 18446744071941095424 bytes == (nil) @

Note that 18446744071939653632 is 0xffffffff96818000 (really huge
number). 18446744071939653632 bytes is 16 384 petabytes, FYI.

I compiled hdf1.8.9 with --enable-debug=all and set HDF5_DEBUG=trace.

Here is an example of output from a successful buffered write of a 3D
variable (I do dozens of these before it crashes). Note some of the output
is debugging output from my code.

0, 400. : Before h5_write_2d_or_3d
Writing thpert
H5Screate_simple(rank=1, dims=0x7fffffeb90c0 {1}, maxdims=0x7fffffeb8fc0
{1}) = 67108866 (dspace);
H5Tcopy(type=50331924 (dtype)) = 50333762 (dtype);
H5Tcopy(type=50331924 (dtype)) = 50333763 (dtype);
H5Tcopy(type=50331922 (dtype)) = 50333764 (dtype);
H5Tset_size(type=50333762 (dtype), size=34) = SUCCEED;
H5Tset_size(type=50333763 (dtype), size=1) = SUCCEED;
H5Screate_simple(rank=3, dims=0x7fffffeb90c0 {80, 120, 120},
maxdims=0x7fffffeb8fc0 {80, 120, 120}) = 67108867 (dspace);
H5Pcreate(cls=150994953 (genprop class)) = 167773041 (genprop list);
H5Pset_chunk(plist=167773041 (genprop list), ndims=3, dim=0x10004d7ce20
{80, 30, 30}) = SUCCEED;
H5Dcreate2(loc=33554438 (group), name=0x10004e054f0, type=50331922 (dtype),
space=67108867 (dspace), lcpl=H5P_DEFAULT, dcpl=167773041 (genprop list),
dapl=H5P_DEFAULT) = <delayed>
+ H5Iget_type(id=134217735 (file driver)) = H5I_VFL;
+ H5Iget_type(id=134217735 (file driver)) = H5I_VFL;
H5Dcreate2 = 83886080 (dset);
H5Acreate2(loc=83886080 (dset), attr_name=0x10004e054f0, type=50333762
(dtype), space=67108866 (dspace), acpl=H5P_DEFAULT, aapl=H5P_DEFAULT) =
100663296 (attr);
H5Awrite(attr=100663296 (attr), dtype=50333762 (dtype), buf=0x7fffffebdec0)
= SUCCEED;
H5Acreate2(loc=83886080 (dset), attr_name=0x10004e054f0, type=50333763
(dtype), space=67108866 (dspace), acpl=H5P_DEFAULT, aapl=H5P_DEFAULT) =
100663297 (attr);
H5Awrite(attr=100663297 (attr), dtype=50333763 (dtype), buf=0x7fffffebdd00)
= SUCCEED;
H5Acreate2(loc=83886080 (dset), attr_name=0x10004e05530, type=50333764
(dtype), space=67108866 (dspace), acpl=H5P_DEFAULT, aapl=H5P_DEFAULT) =
100663298 (attr);
H5Awrite(attr=100663298 (attr), dtype=50333764 (dtype), buf=0x7fffffeb9764)
= SUCCEED;
H5Acreate2(loc=83886080 (dset), attr_name=0x10004e05560, type=50333764
(dtype), space=67108866 (dspace), acpl=H5P_DEFAULT, aapl=H5P_DEFAULT) =
100663299 (attr);
H5Awrite(attr=100663299 (attr), dtype=50333764 (dtype), buf=0x7fffffeb975c)
= SUCCEED;
H5Dwrite(dset=83886080 (dset), mem_type=50331922 (dtype),
mem_space=H5P_DEFAULT, file_space=H5P_DEFAULT, dxpl=H5P_DEFAULT,
buf=0x10000cb8cc0) = <delayed>
+ H5Iget_type(id=134217735 (file driver)) = H5I_VFL;
+ H5Iget_type(id=134217735 (file driver)) = H5I_VFL;
H5Dwrite = SUCCEED;
H5Aclose(attr=100663296 (attr)) = SUCCEED;
H5Aclose(attr=100663297 (attr)) = SUCCEED;
H5Sclose(space=67108866 (dspace)) = SUCCEED;
H5Tclose(type=50333762 (dtype)) = SUCCEED;
H5Tclose(type=50333763 (dtype)) = SUCCEED;
H5Tclose(type=50333764 (dtype)) = SUCCEED;
H5Aclose(attr=100663298 (attr)) = SUCCEED;
H5Aclose(attr=100663299 (attr)) = SUCCEED;
H5Dclose(dset=83886080 (dset)) = SUCCEED;
H5Sclose(space=67108867 (dspace)) = SUCCEED;
H5Pclose(plist=167773041 (genprop list)) = SUCCEED;
0, 400. : After h5_write_2d_or_3d

Here is where everything crashes and burns:

0, 400. : Before h5_write_2d_or_3d
Writing xvort
H5Screate_simple(rank=1, dims=0x7fffffeb90c0 {1}, maxdims=0x7fffffeb8fc0
{1}) = 67108866 (dspace);
H5Tcopy(type=50331924 (dtype)) = 50333834 (dtype);
H5Tcopy(type=50331924 (dtype)) = 50333835 (dtype);
H5Tcopy(type=50331922 (dtype)) = 50333836 (dtype);
H5Tset_size(type=50333834 (dtype), size=29) = SUCCEED;
H5Tset_size(type=50333835 (dtype), size=2) = SUCCEED;
H5Screate_simple(rank=3, dims=0x7fffffeb90c0 {80, 120, 120},
maxdims=0x7fffffeb8fc0 {80, 120, 120}) = 67108867 (dspace);
H5Pcreate(cls=150994953 (genprop class)) = 167773077 (genprop list);
H5Pset_chunk(plist=167773077 (genprop list), ndims=3, dim=0x10004d7dac0
{80, 30, 30}) = SUCCEED;
H5Dcreate2(loc=33554438 (group), name=0x10004e068b0, type=50331922 (dtype),
space=67108867 (dspace), lcpl=H5P_DEFAULT, dcpl=167773077 (genprop list),
dapl=H5P_DEFAULT) = <delayed>
+ H5Iget_type(id=134217735 (file driver)) = H5I_VFL;
+ H5Iget_type(id=134217735 (file driver)) = H5I_VFL;
+ H5Iget_type(id=134217735 (file driver)) = H5I_VFL;
+ H5Iget_type(id=134217735 (file driver)) = H5I_VFL;
+ H5Iget_type(id=134217735 (file driver)) = H5I_VFL;
+ H5Iget_type(id=134217735 (file driver)) = H5I_VFL;
H5Dcreate2 = 83886080 (dset);
H5Acreate2(loc=83886080 (dset), attr_name=0x10004e068b0, type=50333834
(dtype), space=67108866 (dspace), acpl=H5P_DEFAULT, aapl=H5P_DEFAULT) =
100663296 (attr);
H5Awrite(attr=100663296 (attr), dtype=50333834 (dtype), buf=0x7fffffebdec0)
= SUCCEED;
H5Acreate2(loc=83886080 (dset), attr_name=0x10004e068b0, type=50333835
(dtype), space=67108866 (dspace), acpl=H5P_DEFAULT, aapl=H5P_DEFAULT) =
100663297 (attr);
H5Awrite(attr=100663297 (attr), dtype=50333835 (dtype), buf=0x7fffffebdd00)
= SUCCEED;
H5Acreate2(loc=83886080 (dset), attr_name=0x10004e06950, type=50333836
(dtype), space=67108866 (dspace), acpl=H5P_DEFAULT, aapl=H5P_DEFAULT) =
100663298 (attr);
H5Awrite(attr=100663298 (attr), dtype=50333836 (dtype), buf=0x7fffffeb9764)
= SUCCEED;
H5Acreate2(loc=83886080 (dset), attr_name=0x10004e06990, type=50333836
(dtype), space=67108866 (dspace), acpl=H5P_DEFAULT, aapl=H5P_DEFAULT) =
100663299 (attr);
H5Awrite(attr=100663299 (attr), dtype=50333836 (dtype), buf=0x7fffffeb975c)
= SUCCEED;
H5Dwrite(dset=83886080 (dset), mem_type=50331922 (dtype),
mem_space=H5P_DEFAULT, file_space=H5P_DEFAULT, dxpl=H5P_DEFAULT,
buf=0x10000cb8cc0) = <delayed>
+ H5Iget_type(id=134217735 (file driver)) = H5I_VFL;
+ H5Iget_type(id=134217735 (file driver)) = H5I_VFL;
tcmalloc: large alloc 18446744071939653632 bytes == (nil) @
tcmalloc: large alloc 18446744071940014080 bytes == (nil) @
tcmalloc: large alloc 18446744071940407296 bytes == (nil) @
tcmalloc: large alloc 18446744071940734976 bytes == (nil) @
tcmalloc: large alloc 18446744071941095424 bytes == (nil) @
tcmalloc: large alloc 18446744071941455872 bytes == (nil) @
tcmalloc: large alloc 18446744071941816320 bytes == (nil) @
tcmalloc: large alloc 18446744071942209536 bytes == (nil) @
tcmalloc: large alloc 18446744071942537216 bytes == (nil) @
tcmalloc: large alloc 18446744071942897664 bytes == (nil) @
tcmalloc: large alloc 18446744071943258112 bytes == (nil) @
H5Dwrite = FAIL;
HDF5-DIAG: Error detected in HDF5 (1.8.9) MPI-process 0:
  #000: H5Dio.c line 266 in H5Dwrite(): can't write data
    major: Dataset
    minor: Write failed
  #001: H5Dio.c line 673 in H5D_write(): can't write data
    major: Dataset
    minor: Write failed
  #002: H5Dchunk.c line 1861 in H5D_chunk_write(): unable to read raw data
chunk
    major: Low-level I/O
    minor: Read failed
  #003: H5Dchunk.c line 2846 in H5D_chunk_lock(): unable to preempt
chunk(s) from cache
    major: Low-level I/O
    minor: Unable to initialize object
  #004: H5Dchunk.c line 2632 in H5D_chunk_cache_prune(): unable to preempt
one or more raw data cache entry
    major: Low-level I/O
    minor: Unable to flush data from cache
  #005: H5Dchunk.c line 2499 in H5D_chunk_cache_evict(): cannot flush
indexed storage buffer
    major: Low-level I/O
    minor: Write failed
  #006: H5Dchunk.c line 2427 in H5D_chunk_flush_entry(): unable to write
raw data to file
    major: Dataset
    minor: Write failed
  #007: H5Fio.c line 158 in H5F_block_write(): write through metadata
accumulator failed
    major: Low-level I/O
    minor: Write failed
  #008: H5Faccum.c line 808 in H5F_accum_write(): file write failed
    major: Low-level I/O
    minor: Write failed
  #009: H5FDint.c line 185 in H5FD_write(): driver write request failed
    major: Virtual File Layer
    minor: Write failed
  #010: H5FDcore.c line 1039 in H5FD_core_write(): unable to allocate
memory block of 2023198720 bytes
    major: File accessability
    minor: Can't allocate space
H5Aclose(attr=100663296 (attr)) = SUCCEED;
H5Aclose(attr=100663297 (attr)) = SUCCEED;
H5Sclose(space=67108866 (dspace)) = SUCCEED;
H5Tclose(type=50333834 (dtype)) = SUCCEED;
H5Tclose(type=50333835 (dtype)) = SUCCEED;
H5Tclose(type=50333836 (dtype)) = SUCCEED;
H5Aclose(attr=100663298 (attr)) = SUCCEED;
H5Aclose(attr=100663299 (attr)) = SUCCEED;
H5Dclose(dset=83886080 (dset))tcmalloc: large alloc 18446744071943258112
bytes == (nil) @
tcmalloc: large alloc 18446744071943258112 bytes == (nil) @
tcmalloc: large alloc 18446744071943258112 bytes == (nil) @
tcmalloc: large alloc 18446744071943258112 bytes == (nil) @
= FAIL;

etc. etc. Indeed we immediately run out of memory as somewhere somebody's
trying to allocate a ridiculous amount of memory. I have no idea where the
tcmalloc messages are being generated. I am pretty sure during H5Dwrite
malloc is called somewhere and it's passing a corrupted value for the
amount of memory to write. Note, immediately before everything crashes, I
do have a lot of memory available (on the order of 20 GB).

Should I be concerned that almost all the memory addresses are
like 0x7fffffeb90c0 which if converted to bytes is 127.999999 terabytes
(seems odd).

Any pointers on debugging this? Full output of the code (containing all the
initialization stuff) can be found here:
http://waterspout.cst.cmich.edu/hdf5debug in the file cm1-out.txt.

Leigh

···

--
Leigh Orf
Associate Professor of Atmospheric Science
Department of Earth and Atmospheric Sciences
Central Michigan University
Office phone: 989-774-1923