Cannot write more than 512 MB in 1D


#1

Hi,

we are experiencing problems writing “large” 1D datasets on 64bit Linux systems when local writes exceed 512 MBytes or 134217728 elements for 4-byte int/float (1D dataspace) for local MPI ranks.

To reproduce, please see the following one-MPI-rank example, running with HDF5 1.10.4 and OpenMPI 3.1.3 on a 64bit Debian Linux 9.5 with 16 GByte RAM.

#include <mpi.h>
#include <hdf5.h>

#include <stdlib.h>
#include <string.h>
#include <stdio.h>


int write_HDF5(
    MPI_Comm const comm, MPI_Info const info,
    int* data, size_t len)
{
    // property list
    hid_t plist_id = H5Pcreate(H5P_FILE_ACCESS);

    // MPI-I/O driver
    H5Pset_fapl_mpio(plist_id, comm, info); 

    // file create
    char file_name[100];
    sprintf(file_name, "%zu", len);
    strcat(file_name, ".h5");
    hid_t file_id = H5Fcreate(file_name, H5F_ACC_TRUNC,  
                              H5P_DEFAULT, plist_id); 

    // dataspace
    hsize_t dims[1] = {len};
    hsize_t max_dims[1] = {len};
    // hsize_t* max_dims = NULL;
    hid_t filespace = H5Screate_simple(1,
        dims,
        max_dims);
    
    // chunking
    hid_t datasetCreationProperty = H5Pcreate(H5P_DATASET_CREATE);

    // dataset
    hid_t dset_id = H5Dcreate(file_id, "dataset1", H5T_NATIVE_INT,  
                              filespace, H5P_DEFAULT,
                              datasetCreationProperty, H5P_DEFAULT);
                        
    // write
    hid_t dset_plist_id = H5Pcreate(H5P_DATASET_XFER);
    H5Pset_dxpl_mpio(dset_plist_id, H5FD_MPIO_COLLECTIVE); 
    // H5Pset_dxpl_mpio(dset_plist_id, H5FD_MPIO_INDEPENDENT); // default

    herr_t status;
    status = H5Dwrite(dset_id, H5T_NATIVE_INT, 
                      H5S_ALL, filespace, dset_plist_id, data); 

    // close all
    status = H5Pclose(plist_id);
    status = H5Pclose(dset_plist_id);
    status = H5Dclose(dset_id);
    status = H5Fclose(file_id);

    return 0;
}

int main(int argc, char* argv[])
{

    MPI_Comm comm = MPI_COMM_WORLD; 
    MPI_Info info = MPI_INFO_NULL;  

    MPI_Init(&argc, &argv);
    
    size_t lengths[3] = {134217727u, 134217728u, 134217729u};
    for( size_t i = 0; i < 3; ++i )
    {
        size_t len = lengths[i];
        printf("Writing for len=%zu ...\n", len);
        int* data = malloc(len * sizeof(int));
        for( size_t k=0; k<len; ++k)
            data[k] = 420;
    
        write_HDF5(comm, info, data, len);
        free(data);
        printf("Finished write for len=%zu ...\n", len);
    }
    
    MPI_Finalize();

    return 0;
}
$ h5pcc phdf5.c && ./a.out
Writing for len=134217727 ...
Finished write for len=134217727 ...
Writing for len=134217728 ...
Finished write for len=134217728 ...
Writing for len=134217729 ...
HDF5-DIAG: Error detected in HDF5 (1.10.4) MPI-process 0:
  #000: H5Dio.c line 336 in H5Dwrite(): can't write data
    major: Dataset
    minor: Write failed
  #001: H5Dio.c line 828 in H5D__write(): can't write data
    major: Dataset
    minor: Write failed
  #002: H5Dmpio.c line 671 in H5D__contig_collective_write(): couldn't finish shared collective MPI-IO
    major: Low-level I/O
    minor: Write failed
  #003: H5Dmpio.c line 2013 in H5D__inter_collective_io(): couldn't finish collective MPI-IO
    major: Low-level I/O
    minor: Can't get value
  #004: H5Dmpio.c line 2057 in H5D__final_collective_io(): optimized write failed
    major: Dataset
    minor: Write failed
  #005: H5Dmpio.c line 426 in H5D__mpio_select_write(): can't finish collective parallel write
    major: Low-level I/O
    minor: Write failed
  #006: H5Fio.c line 165 in H5F_block_write(): write through page buffer failed
    major: Low-level I/O
    minor: Write failed
  #007: H5PB.c line 1028 in H5PB_write(): write through metadata accumulator failed
    major: Page Buffering
    minor: Write failed
  #008: H5Faccum.c line 826 in H5F__accum_write(): file write failed
    major: Low-level I/O
    minor: Write failed
  #009: H5FDint.c line 258 in H5FD_write(): driver write request failed
    major: Virtual File Layer
    minor: Write failed
  #010: H5FDmpio.c line 1844 in H5FD_mpio_write(): file write failed
    major: Low-level I/O
    minor: Write failed
Finished write for len=134217729 ...

$ du -hs 13421772*                                              
513M	134217727.h5
513M	134217728.h5
4,0K	134217729.h5

Do we miss anything that needs to be passed to write more than 512 MByte from a rank for a single dataset?


#2

We just found it’s even more confusing since in the lengths:

size_t lengths[5] = {134217727u, 134217728u, 134217729u, 134217736u, 134217737u};

the sizes 134217729 - 134217736 fail to write, but smaller and larger lengths work.


#3

If I change the data type (falsely) to some other 4-byte type during H5Dwrite, e.g. H5T_IEEE_F32LE it does not crash. It also fails with the above error when taking H5T_STD_I32LE (which is same size as my native int on the system).

If I set both H5Dcreate and H5Dwrite to H5T_IEEE_F32LE it also fails with the above error.


#4

As we debugged further, the error resides somewhere at the boundary of H5S_MAX_MPI_COUNT when switching from native MPI-IO types to the struct type that contains main data + leftovers in H5Smpio.c in function H5S_mpio_all_type. As a note, for ints that boundary lies inside the multiple of an int.

The error seems to be thrown on the write of that struct.


#5

Since I can’t report this in JIRA myself: reporters are Axel Huebl and René Widera, both affiliated with HZDR. Just to credit properly since I entered this in my name here.


#6

Hi Axel (and René),

I entered bug HDFFV-10638 for this issue.

Thank you!
-Barbara


#7

Hi Barbara,

I follow the Jira issue but could not see any progress in https://jira.hdfgroup.org/projects/HDFFV/issues/HDFFV-10638

This bug affects potentially all parallel HDF5 writes leading to segmentation faults for all parallel users when they hit the magic ranges. Do you have an estimate when this might be fixed? Thanks a lot for your help!


#8

The problem is likely OpenMPI specific. I can confirm it happens with GCC 6.3 and OpenMPI 3.1.3 but not with MPICH 3.3


#9

This issue is an OpenMPI issue affecting all known releases up to v4.0.0: