HDF5 crashes with inefficient compressors

lucasvr · November 17, 2020, 6:11pm

Greetings folks. I noticed that HDF5 crashes when I/O filters produce more data than the original dataset size.

When a dataset is created, its declared dimensions + data type are naturally honored when it comes the time to write the data with H5Dwrite. The I/O filter interface, however, allows a compressor to either return a number that’s smaller than that (in which case it successfully compressed the data) or slightly larger (in which case the compressor didn’t do a good job).

Now, let’s say we have a really bad compressor which requires 100x more room than necessary. What I observe is that HDF5 seems to truncate the data, so it’s not possible to retrieve it afterwards. In some cases, HDF5 even crashes when the dataset handle is closed:

#1  0x00007ffff72e201b in __GI_abort () at abort.c:79
#2  0x00007ffff733ab98 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7ffff742c2a0 "%s\n") at ../sysdeps/posix/libc_fatal.c:181
#3  0x00007ffff7341f0a in malloc_printerr (str=str@entry=0x7ffff742fb50 "munmap_chunk(): invalid pointer") at malloc.c:5332
#4  0x00007ffff73421bc in munmap_chunk (p=<optimized out>) at malloc.c:2830
#5  0x00007ffff7cf69ee in H5MM_xfree (mem=<optimized out>) at /Data/Compile/Sources/hdf5-1.12.0/src/H5MM.c:560
#6  0x00007ffff7c1e2b5 in H5D__chunk_mem_xfree (chk=<optimized out>, _pline=<optimized out>) at /Data/Compile/Sources/hdf5-1.12.0/src/H5Dchunk.c:1435
#7  0x00007ffff7c22d44 in H5D__chunk_mem_xfree (_pline=<optimized out>, chk=<optimized out>) at /Data/Compile/Sources/hdf5-1.12.0/src/H5Dchunk.c:1433
#8  H5D__chunk_flush_entry (dset=dset@entry=0x4990b0, ent=ent@entry=0x4ac980, reset=reset@entry=true) at /Data/Compile/Sources/hdf5-1.12.0/src/H5Dchunk.c:3495
#9  0x00007ffff7c23190 in H5D__chunk_cache_evict (dset=dset@entry=0x4990b0, ent=0x4ac980, flush=flush@entry=true) at /Data/Compile/Sources/hdf5-1.12.0/src/H5Dchunk.c:3550
#10 0x00007ffff7c232f7 in H5D__chunk_dest (dset=0x4990b0) at /Data/Compile/Sources/hdf5-1.12.0/src/H5Dchunk.c:2918
#11 0x00007ffff7c3a985 in H5D_close (dataset=0x4990b0) at /Data/Compile/Sources/hdf5-1.12.0/src/H5Dint.c:2000
#12 0x00007ffff7e468e9 in H5VL__native_dataset_close (dset=<optimized out>, dxpl_id=<optimized out>, req=<optimized out>) at /Data/Compile/Sources/hdf5-1.12.0/src/H5VLnative_dataset.c:634
#13 0x00007ffff7e2f4cf in H5VL__dataset_close (obj=<optimized out>, dxpl_id=dxpl_id@entry=792633534417207304, req=req@entry=0x0, cls=<optimized out>)
    at /Data/Compile/Sources/hdf5-1.12.0/src/H5VLcallback.c:2595
#14 0x00007ffff7e35a8c in H5VL_dataset_close (vol_obj=vol_obj@entry=0x496770, dxpl_id=792633534417207304, req=req@entry=0x0) at /Data/Compile/Sources/hdf5-1.12.0/src/H5VLcallback.c:2633
#15 0x00007ffff7c35e69 in H5D__close_cb (dset_vol_obj=0x496770) at /Data/Compile/Sources/hdf5-1.12.0/src/H5Dint.c:352
#16 H5D__close_cb (dset_vol_obj=0x496770) at /Data/Compile/Sources/hdf5-1.12.0/src/H5Dint.c:342
#17 0x00007ffff7ce2b77 in H5I_dec_ref (id=360287970189639682) at /Data/Compile/Sources/hdf5-1.12.0/src/H5I.c:1376
#18 H5I_dec_ref (id=360287970189639682) at /Data/Compile/Sources/hdf5-1.12.0/src/H5I.c:1341
#19 0x00007ffff7ce2c47 in H5I_dec_app_ref (id=360287970189639682) at /Data/Compile/Sources/hdf5-1.12.0/src/H5I.c:1421
#20 0x00007ffff7ce2e79 in H5I_dec_app_ref_always_close (id=id@entry=360287970189639682) at /Data/Compile/Sources/hdf5-1.12.0/src/H5I.c:1465
#21 0x00007ffff7c17aee in H5Dclose (dset_id=360287970189639682) at /Data/Compile/Sources/hdf5-1.12.0/src/H5D.c:337
#22 0x000000000042c6c4 in main (argc=<optimized out>, argv=0x100000000000006) at main.cpp:586

The ability to write more data than originally expected for a given dataset is key for HDF5-UDF. Because it takes a user-provided source code, compiles it and stores the bytecode on H5Dwrite(), it compares to an inefficient compressor when the dataset dimensions are very small.

I remember having seen some assumptions in the HDF5 source code that an I/O filter would not grow larger than ~2x the dataset dimensions (or so). Still, this looks like a bug to me: the code should definitely not crash, nor to truncate data, as that will lead to data loss.

Could anyone shed some light here? I’ve a few spare cycles to work on this topic right now, so I’m able to test-debug things.

derobins · November 18, 2020, 12:22am

What version of HDF5 are you using? Are you building in debug or production? Is this Windows?

lucasvr · November 18, 2020, 2:04am

This is HDF5 1.12.0 built in release mode with debug info (CMAKE_BUILD_TYPE=RelWithDebInfo). This is a Linux-based system.

lucasvr · November 18, 2020, 3:01am

Here’s an example that reproduces the problem.

I/O filter code

// build with 'g++ liberror.cpp -C -o libtestcrash.so -shared -fPIC -Wall -g -ggdb'
#include <hdf5.h>
#include <stdlib.h>
#include <sys/types.h>
#include <unistd.h>
#include <string.h>

extern "C" {

size_t callback(unsigned int flags, size_t cd_nelmts, const unsigned int *cd_values, size_t nbytes, size_t *buf_size, void **buf)
{
    if (flags & H5Z_FLAG_REVERSE) {
        return *buf_size;
    } else {
        char *newbuf = (char *) calloc(1000*1000, sizeof(char));
        free(*buf);
        *buf = newbuf;
        *buf_size = 1000*1000;
        return *buf_size;
    }
}

const H5Z_class2_t H5Z_UDF_FILTER[1] = {{
    H5Z_CLASS_T_VERS, 0x2112, 1, 1, "crash_filter", NULL, NULL, callback,
}};

H5PL_type_t H5PLget_plugin_type(void) { return H5PL_TYPE_FILTER; }
const void *H5PLget_plugin_info(void) { return H5Z_UDF_FILTER; }
}

application code

// build with 'g++ mainerror.cpp -o mainerror -g -ggdb -Wall -lhdf5'
// run with 'HDF5_PLUGIN_PATH=$PWD ./mainerror file.h5'
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <hdf5.h>

#define CHECK(hid) if ((hid) < 0) { fprintf(stderr, "failed @line %d\n", __LINE__); exit(1); }

int main(int argc, char **argv)
{
    if (argc != 2) {
        printf("Syntax: %s <file.h5>\n", argv[0]);
        exit(1);
    }
    hsize_t dims[2] = {10, 10};
    hid_t file_id = H5Fopen(argv[1], H5F_ACC_RDWR, H5P_DEFAULT);
    CHECK(file_id);
    hid_t space_id = H5Screate_simple(2, dims, NULL);
    CHECK(space_id);
    hid_t dcpl_id = H5Pcreate(H5P_DATASET_CREATE);
    CHECK(dcpl_id);
    CHECK(H5Pset_filter(dcpl_id, 0x2112, H5Z_FLAG_MANDATORY, 0, NULL));
    CHECK(H5Pset_chunk(dcpl_id, 2, dims));
    hid_t dset_id = H5Dcreate(file_id, "crash_dataset", H5T_STD_I8LE, space_id, H5P_DEFAULT, dcpl_id, H5P_DEFAULT);
    CHECK(dset_id);
    char *data = (char *) calloc(dims[0] * dims[1], sizeof(char));
    CHECK(H5Dwrite(dset_id, H5T_STD_I8LE, H5S_ALL, H5S_ALL, H5P_DEFAULT, data));
    CHECK(H5Dclose(dset_id));
    CHECK(H5Pclose(dcpl_id));
    CHECK(H5Sclose(space_id));
    CHECK(H5Fclose(file_id));
    free(data);
    return 0;
}

If you change the plugin code so that it allocates 10x10, or even 100x100, the problem won’t kick in.

sludtke · November 18, 2020, 12:53pm

I actually just finished dealing with this issue as well. In my experience, the only time it’s safe to overwrite an existing data set in an HDF file is when it’s uncompressed and has the same data type as the original. Changing the dimensions may be ok, but anything else produced unreliable results. Of course, since there is no mechanism for deleting data sets, only unlinking them, there really is no good workaround for doing this within an existing file. ie- if you unlink and recreate the dataset then the original data remains orphaned in the file, defeating the purpose of compression. In the end the only viable solution I found was to copy everything to a new HDF file. Indeed, you can find this recommendation in the HDF manual. Pretty frustrating to have what is effectively a filesystem in a file, but with no ‘delete’ function, but it is what it is.

lucasvr · November 20, 2020, 3:21pm

@derobins, as a matter of fact, would it be better to report such issues on GitHub so that they don’t get lost on the Forum archives? I’m happy to move this one to GitHub if you agree.

epourmal · November 20, 2020, 4:03pm

All,

I am not sure what the actual problem is and I just entered a bug report HDFFV-11179 to investigate the failure above.

This filter function just allocates a bigger buffer. If there is no memory mismatch as Dana mentioned, it should work. HDF5 should write a chunk into newly allocated place in the file leaving a hole left by the previous smaller chunk. To my knowledge, HDF5 should handle the case of bigger chunks if the filter function doesn’t fail when new “compressed” buffer is bigger than original (unless I am missing something )

Thank you for reporting and please go ahead and create an issue in GitHub. We may get help from the community in debugging it.

Elena

derobins · November 20, 2020, 6:16pm

Yes, I think a github issue is the way to go as it’s public and easier to track. Keep in mind that we just switched to github and still do most of our issue tracking work in JIRA (which isn’t really public yet), so you’ll have to be patient with us while we transition.