HDF5Bug? H5FD_MPIO_COLLECTIVE + Chunking

We see a strange Bug with HDF5 and our simulation code PIConGPU HDF5: Field "Kinks" with FreeFormula · Issue #2841 · ComputationalRadiationPhysics/picongpu · GitHub

Chunking + H5FD_MPIO_COLLECTIVE with 16 mpi ranks write wrong data (black artifacts in the image)

compile:

  • broken: mpicc -g main.cpp -lhdf5 -L$HDF5_ROOT/lib && mpiexec -n 16 ./a.out
  • fix: mpicc -g main.cpp -lhdf5 -L$HDF5_ROOT/lib -DFIX && mpiexec -n 16 ./a.out
#include <mpi.h>
#include <hdf5.h>

#include <stdlib.h>
#include <string.h>
#include <stdio.h>

#define X 1872llu
#define Y 1872llu

int write_HDF5(
    MPI_Comm const comm, MPI_Info const info,
    float* data, size_t len, int rank)
{
    // property list
    hid_t plist_id = H5Pcreate(H5P_FILE_ACCESS);

    // MPI-I/O driver
    H5Pset_fapl_mpio(plist_id, comm, info); 

    // file create
    char file_name[100];
    sprintf(file_name, "%zu", len);
    strcat(file_name, ".h5");
    hid_t file_id = H5Fcreate(file_name, H5F_ACC_TRUNC,  
                              H5P_DEFAULT, plist_id); 

    // dataspace
    hsize_t dims[3] = {Y, X};
    hsize_t globalDims[3] = {Y * 4, X * 4};
    hsize_t max_dims[2] = {Y * 4, X * 4};
    hsize_t offset[2] = {rank/4 * Y, rank%4 * X};

    hid_t srcSize = H5Screate_simple(2, dims, NULL);
    hid_t filespace = H5Screate_simple(2,
        globalDims,
        max_dims);

    printf("%i: %llu,%llu %llu,%llu \n", rank,offset[0],offset[1],globalDims[0],globalDims[1]);
    
    // chunking
    hsize_t chunk[2] = {128, 128};
    hid_t datasetCreationProperty = H5Pcreate(H5P_DATASET_CREATE);
    H5Pset_chunk(datasetCreationProperty, 2, chunk);

    // dataset
    hid_t dset_id = H5Dcreate(file_id, "dataset1", H5T_NATIVE_FLOAT,  
                              filespace, H5P_DEFAULT,
                              datasetCreationProperty, H5P_DEFAULT);
                        
    // write
    hid_t dset_plist_id = H5Pcreate(H5P_DATASET_XFER);

#ifdef FIX
    H5Pset_dxpl_mpio(dset_plist_id, H5FD_MPIO_INDEPENDENT); // default
#else
    H5Pset_dxpl_mpio(dset_plist_id, H5FD_MPIO_COLLECTIVE); 
#endif

    hid_t dd = H5Dget_space(dset_id);
    H5Sselect_hyperslab(dd, H5S_SELECT_SET, offset,
                        NULL, dims, NULL);

    herr_t status;
    status = H5Dwrite(dset_id, H5T_NATIVE_FLOAT, 
                      srcSize, dd, dset_plist_id, data); 

    // close all
    status = H5Pclose(plist_id);
    status = H5Pclose(dset_plist_id);
    status = H5Dclose(dset_id);
    status = H5Fclose(file_id);

    return 0;
}

int main(int argc, char* argv[])
{

    MPI_Comm comm = MPI_COMM_WORLD; 
    MPI_Info info = MPI_INFO_NULL;  

    MPI_Init(&argc, &argv);

    int rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    
    size_t lengths[1] = {X*Y};
    for( size_t i = 0; i < 1; ++i )
    {
        size_t len = lengths[i];
        printf("Writing for len=%zu ...\n", len);
        float* data = (float*)malloc(len * sizeof(float));
        for( size_t y = 0; y < Y; ++y)
            for( size_t x = 0; x < X; ++x)
                data[y * Y + x] = 100.f + y%1024;
    
        write_HDF5(comm, info, data, len, rank);
        free(data);
        printf("Finished write for len=%zu ...\n", len);
    }
    
    MPI_Finalize();

    return 0;
}

Software:

  • gcc (GCC) 5.3.0
  • hdf5-parallel 1.8.2 and 1.10.4
  • openmpi/2.1.2 compiled with CUDA8
  • CUDA 8
  • Ubuntu 14.04.1

update: I updated the example code to avoid that the global domain size is not a multiple of 4 which result into an not well initialized inout array for each MPI rank.

I change the code a little bit to have have horizontal color lines. As you can see somehow on column of chunks is shifted by one element.

I tested the example above also with HDF5 1.10.4. The wrong output is still shown :frowning:

Hi @r.widera,

would it be possible to try building the latest HDF5 develop branch and seeing if you run into the same issue? I grabbed your program and tried it with that + MPICH 3.2 and the screenshot below shows what I’m getting at the same data region as in your screenshot. Just to be sure, you can also see by the info next to the open window that HDFView sees the dataset as chunked.

Interestingly enough, I also tested the same program with HDF5 1.10.4 and seemed to get the same (presumably correct?) results.

@jhenderson Where can I find the HDF5 development branch. I checked github and checked the download section of the HDF5 side but can not find the repository.

@r.widera The development branch of HDF5 (among others) can be found at https://bitbucket.hdfgroup.org/projects. Specifically, https://bitbucket.hdfgroup.org/projects/HDFFV/repos/hdf5/browse.

I tested the current dev (commit: 703acba51fac02634e0b194cd28287b09c792385). The error is still visible.
Maybe it is only visible with OpenMPI + HDF5.

Hmm, it may be that this issue needs more testing specifically with OpenMPI. It very well could be that the issue resides within OpenMPI (or a specific version therein), or could still be in HDF5 itself; it seems unclear at this point.

I’ll enter an issue to keep track of this problem, please let us know if you happen to discover anything else about what might be going on here.

On a Debian 9.6 “stretch” with with GCC 6.3 and HDF5 1.10.4 I can confirm the data corruption occurs with:

  • OpenMPI 3.1.3

I can’t see the bug when using a stack built on:

  • MPICH 3.3

Did you already enter a JIRA number for this issue?

Hi @a.huebl,

I had not yet entered a JIRA issue for this. If you believe that you’ve narrowed this down to an OpenMPI issue, then I may hold off on entering one. Considering how new OpenMPI 3.1.3 is, I’m wondering if it may be useful to try and come up with a small reproducer. It may even be interesting to see if the problem still exists with OpenMPI 4.0.0…

This problem affects likely all released OpenMPI versions, as we see it on old OpenMPI 2.X variants, the latest 3.X release and 4.0.0 did not change much in the I/O layer from what I see.

I also cross-linked this upstream to OpenMPI: https://github.com/open-mpi/ompi/issues/6285

Anyway, you might want to track such issues of essential dependencies nonetheless, so you can warn/error at configure or runtime when known buggy MPI-combinations are detected. Otherwise people will not know that they have to update their dependencies.