Greetings folks. I noticed that HDF5 crashes when I/O filters produce more data than the original dataset size.
When a dataset is created, its declared dimensions + data type are naturally honored when it comes the time to write the data with H5Dwrite
. The I/O filter interface, however, allows a compressor to either return a number that’s smaller than that (in which case it successfully compressed the data) or slightly larger (in which case the compressor didn’t do a good job).
Now, let’s say we have a really bad compressor which requires 100x more room than necessary. What I observe is that HDF5 seems to truncate the data, so it’s not possible to retrieve it afterwards. In some cases, HDF5 even crashes when the dataset handle is closed:
#1 0x00007ffff72e201b in __GI_abort () at abort.c:79
#2 0x00007ffff733ab98 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7ffff742c2a0 "%s\n") at ../sysdeps/posix/libc_fatal.c:181
#3 0x00007ffff7341f0a in malloc_printerr (str=str@entry=0x7ffff742fb50 "munmap_chunk(): invalid pointer") at malloc.c:5332
#4 0x00007ffff73421bc in munmap_chunk (p=<optimized out>) at malloc.c:2830
#5 0x00007ffff7cf69ee in H5MM_xfree (mem=<optimized out>) at /Data/Compile/Sources/hdf5-1.12.0/src/H5MM.c:560
#6 0x00007ffff7c1e2b5 in H5D__chunk_mem_xfree (chk=<optimized out>, _pline=<optimized out>) at /Data/Compile/Sources/hdf5-1.12.0/src/H5Dchunk.c:1435
#7 0x00007ffff7c22d44 in H5D__chunk_mem_xfree (_pline=<optimized out>, chk=<optimized out>) at /Data/Compile/Sources/hdf5-1.12.0/src/H5Dchunk.c:1433
#8 H5D__chunk_flush_entry (dset=dset@entry=0x4990b0, ent=ent@entry=0x4ac980, reset=reset@entry=true) at /Data/Compile/Sources/hdf5-1.12.0/src/H5Dchunk.c:3495
#9 0x00007ffff7c23190 in H5D__chunk_cache_evict (dset=dset@entry=0x4990b0, ent=0x4ac980, flush=flush@entry=true) at /Data/Compile/Sources/hdf5-1.12.0/src/H5Dchunk.c:3550
#10 0x00007ffff7c232f7 in H5D__chunk_dest (dset=0x4990b0) at /Data/Compile/Sources/hdf5-1.12.0/src/H5Dchunk.c:2918
#11 0x00007ffff7c3a985 in H5D_close (dataset=0x4990b0) at /Data/Compile/Sources/hdf5-1.12.0/src/H5Dint.c:2000
#12 0x00007ffff7e468e9 in H5VL__native_dataset_close (dset=<optimized out>, dxpl_id=<optimized out>, req=<optimized out>) at /Data/Compile/Sources/hdf5-1.12.0/src/H5VLnative_dataset.c:634
#13 0x00007ffff7e2f4cf in H5VL__dataset_close (obj=<optimized out>, dxpl_id=dxpl_id@entry=792633534417207304, req=req@entry=0x0, cls=<optimized out>)
at /Data/Compile/Sources/hdf5-1.12.0/src/H5VLcallback.c:2595
#14 0x00007ffff7e35a8c in H5VL_dataset_close (vol_obj=vol_obj@entry=0x496770, dxpl_id=792633534417207304, req=req@entry=0x0) at /Data/Compile/Sources/hdf5-1.12.0/src/H5VLcallback.c:2633
#15 0x00007ffff7c35e69 in H5D__close_cb (dset_vol_obj=0x496770) at /Data/Compile/Sources/hdf5-1.12.0/src/H5Dint.c:352
#16 H5D__close_cb (dset_vol_obj=0x496770) at /Data/Compile/Sources/hdf5-1.12.0/src/H5Dint.c:342
#17 0x00007ffff7ce2b77 in H5I_dec_ref (id=360287970189639682) at /Data/Compile/Sources/hdf5-1.12.0/src/H5I.c:1376
#18 H5I_dec_ref (id=360287970189639682) at /Data/Compile/Sources/hdf5-1.12.0/src/H5I.c:1341
#19 0x00007ffff7ce2c47 in H5I_dec_app_ref (id=360287970189639682) at /Data/Compile/Sources/hdf5-1.12.0/src/H5I.c:1421
#20 0x00007ffff7ce2e79 in H5I_dec_app_ref_always_close (id=id@entry=360287970189639682) at /Data/Compile/Sources/hdf5-1.12.0/src/H5I.c:1465
#21 0x00007ffff7c17aee in H5Dclose (dset_id=360287970189639682) at /Data/Compile/Sources/hdf5-1.12.0/src/H5D.c:337
#22 0x000000000042c6c4 in main (argc=<optimized out>, argv=0x100000000000006) at main.cpp:586
The ability to write more data than originally expected for a given dataset is key for HDF5-UDF. Because it takes a user-provided source code, compiles it and stores the bytecode on H5Dwrite()
, it compares to an inefficient compressor when the dataset dimensions are very small.
I remember having seen some assumptions in the HDF5 source code that an I/O filter would not grow larger than ~2x the dataset dimensions (or so). Still, this looks like a bug to me: the code should definitely not crash, nor to truncate data, as that will lead to data loss.
Could anyone shed some light here? I’ve a few spare cycles to work on this topic right now, so I’m able to test-debug things.