H5Part vs. HDF: couldn't finish shared collective MPI-IO

This problem may be a little bit hard to describe, but I will do my best.

I am working on a Record&Replayer tool for HDF5 and have faced a very
strange problem. I am trying to run the Record&Replay for a H5Part
application and its trace looks like this:

H5Dcreate2(33554433,x,50331690,67108868,0,0,0) = 83886080 <0.00013>
H5Dwrite(83886080,50331690,67108867,67108869,167772178,33554432) = 0
<0.02262>
H5Dclose(83886080) = 0 <0.00003>
H5Dcreate2(33554433,y,50331690,67108868,0,0,0) = 83886081 <0.00013>
H5Dwrite(83886081,50331690,67108867,67108869,167772178,33554432) = 0
<0.02120>

However, when I try to replay this part by calling the functions exactly the
same as you see above, I can write dataset "x" successfully, but the second
call to H5Dwrite throws this error:
  Going to call: H5Dwrite(83886081, 50331690, 67108867, 67108869, 167772178,
33554432);
HDF5-DIAG: Error detected in HDF5 (1.9.130) MPI-process 0:
  #000: H5Dio.c line 266 in H5Dwrite(): can't write data
    major: Dataset
    minor: Write failed
  #001: H5Dio.c line 674 in H5D__write(): can't write data
    major: Dataset
    minor: Write failed
  #002: H5Dmpio.c line 544 in H5D__contig_collective_write(): couldn't
finish shared collective MPI-IO
    major: Low-level I/O
    minor: Write failed
  #003: H5Dmpio.c line 1523 in H5D__inter_collective_io(): couldn't finish
collective MPI-IO
    major: Low-level I/O
    minor: Can't get value
  #004: H5Dmpio.c line 1567 in H5D__final_collective_io(): optimized write
failed
    major: Dataset
    minor: Write failed
  #005: H5Dmpio.c line 312 in H5D__mpio_select_write(): can't finish
collective parallel write
    major: Low-level I/O
    minor: Write failed
  #006: H5Fio.c line 158 in H5F_block_write(): write through metadata
accumulator failed
    major: Low-level I/O
    minor: Write failed
  #007: H5Faccum.c line 816 in H5F_accum_write(): file write failed
    major: Low-level I/O
    minor: Write failed
  #008: H5FDint.c line 185 in H5FD_write(): driver write request failed
    major: Virtual File Layer
    minor: Write failed
  #009: H5FDmpio.c line 1844 in H5FD_mpio_write(): MPI_File_write_at_all
failed
    major: Internal error (too specific to document in detail)
    minor: Some MPI function failed
  #010: H5FDmpio.c line 1844 in H5FD_mpio_write(): Other I/O error , error
stack:
ADIOI_GEN_WRITECONTIG(50): Other I/O error Invalid argument
    major: Internal error (too specific to document in detail)
    minor: MPI Error String

In order to see if this is my replayer problem or not, I wrote a manual HDF5
code to do the same thing and I had no problem replaying it. So, I'm gussing
it's something in H5Part that I am missing?

Here's the HDF5 code that I wrote and I have no problem recording and
replaying it:
/* DataSpace creation */
    hsize_t cur_dim[] = {8388608};
    hsize_t dmax = H5S_UNLIMITED;
    hid_t mem_space_id = H5Screate_simple(1, cur_dim, &dmax);
    hid_t second_simple_ds = H5Screate_simple(1, cur_dim, NULL);
    hid_t file_space_id = H5Screate_simple(1, cur_dim, NULL);

    /* Hyperslab selection */
    hsize_t start[] = {0};
    hsize_t stride[] = {1};
    hsize_t count[] = {8388608};
    //hsize_t block[] = {0};
    H5Sselect_hyperslab(file_space_id, 0, start, stride, count, NULL);
    printf("Before writing particles\n");

    /* DataSet create */
    hid_t x_dataset = H5Dcreate(step0_grp_id, "x", H5T_NATIVE_FLOAT,
second_simple_ds, H5P_DEFAULT, H5P_DEFAULT, H5P_DEFAULT);
    /* Write the data */
    size_t npoints = H5Sget_select_npoints(mem_space_id);
    size_t size_of_data_type = H5Tget_size(H5T_NATIVE_FLOAT);
    size_t total_size_written = npoints * size_of_data_type;
    float* dummy_data = (float*) malloc(total_size_written);
    int i;
    for(i = 0; i < npoints; i++)
        dummy_data[i] = 99.99;
    H5Dwrite( x_dataset, H5T_NATIVE_FLOAT, mem_space_id, file_space_id,
mpio_prop, dummy_data);
    printf("Written variable 1\n");
    free(dummy_data);
    H5Dclose(x_dataset);

    /* DataSet create */
    hid_t y_dataset = H5Dcreate(step0_grp_id, "y", H5T_NATIVE_FLOAT,
second_simple_ds, H5P_DEFAULT, H5P_DEFAULT, H5P_DEFAULT);
    /* Write the data */
    npoints = H5Sget_select_npoints(mem_space_id);
    size_of_data_type = H5Tget_size(H5T_NATIVE_FLOAT);
    total_size_written = npoints * size_of_data_type;
    dummy_data = (float*) malloc(total_size_written);
    for(i = 0; i < npoints; i++)
        dummy_data[i] = 89.99;
    H5Dwrite( y_dataset, H5T_NATIVE_FLOAT, mem_space_id, file_space_id,
mpio_prop, dummy_data);
    printf("Written variable 2\n");
    free(dummy_data);

···

--
View this message in context: http://hdf-forum.184993.n3.nabble.com/H5Part-vs-HDF-couldn-t-finish-shared-collective-MPI-IO-tp4025883.html
Sent from the hdf-forum mailing list archive at Nabble.com.