This problem may be a little bit hard to describe, but I will do my best.
I am working on a Record&Replayer tool for HDF5 and have faced a very
strange problem. I am trying to run the Record&Replay for a H5Part
application and its trace looks like this:
H5Dcreate2(33554433,x,50331690,67108868,0,0,0) = 83886080 <0.00013>
H5Dwrite(83886080,50331690,67108867,67108869,167772178,33554432) = 0
<0.02262>
H5Dclose(83886080) = 0 <0.00003>
H5Dcreate2(33554433,y,50331690,67108868,0,0,0) = 83886081 <0.00013>
H5Dwrite(83886081,50331690,67108867,67108869,167772178,33554432) = 0
<0.02120>
However, when I try to replay this part by calling the functions exactly the
same as you see above, I can write dataset "x" successfully, but the second
call to H5Dwrite throws this error:
Going to call: H5Dwrite(83886081, 50331690, 67108867, 67108869, 167772178,
33554432);
HDF5-DIAG: Error detected in HDF5 (1.9.130) MPI-process 0:
#000: H5Dio.c line 266 in H5Dwrite(): can't write data
major: Dataset
minor: Write failed
#001: H5Dio.c line 674 in H5D__write(): can't write data
major: Dataset
minor: Write failed
#002: H5Dmpio.c line 544 in H5D__contig_collective_write(): couldn't
finish shared collective MPI-IO
major: Low-level I/O
minor: Write failed
#003: H5Dmpio.c line 1523 in H5D__inter_collective_io(): couldn't finish
collective MPI-IO
major: Low-level I/O
minor: Can't get value
#004: H5Dmpio.c line 1567 in H5D__final_collective_io(): optimized write
failed
major: Dataset
minor: Write failed
#005: H5Dmpio.c line 312 in H5D__mpio_select_write(): can't finish
collective parallel write
major: Low-level I/O
minor: Write failed
#006: H5Fio.c line 158 in H5F_block_write(): write through metadata
accumulator failed
major: Low-level I/O
minor: Write failed
#007: H5Faccum.c line 816 in H5F_accum_write(): file write failed
major: Low-level I/O
minor: Write failed
#008: H5FDint.c line 185 in H5FD_write(): driver write request failed
major: Virtual File Layer
minor: Write failed
#009: H5FDmpio.c line 1844 in H5FD_mpio_write(): MPI_File_write_at_all
failed
major: Internal error (too specific to document in detail)
minor: Some MPI function failed
#010: H5FDmpio.c line 1844 in H5FD_mpio_write(): Other I/O error , error
stack:
ADIOI_GEN_WRITECONTIG(50): Other I/O error Invalid argument
major: Internal error (too specific to document in detail)
minor: MPI Error String
In order to see if this is my replayer problem or not, I wrote a manual HDF5
code to do the same thing and I had no problem replaying it. So, I'm gussing
it's something in H5Part that I am missing?
Here's the HDF5 code that I wrote and I have no problem recording and
replaying it:
/* DataSpace creation */
hsize_t cur_dim[] = {8388608};
hsize_t dmax = H5S_UNLIMITED;
hid_t mem_space_id = H5Screate_simple(1, cur_dim, &dmax);
hid_t second_simple_ds = H5Screate_simple(1, cur_dim, NULL);
hid_t file_space_id = H5Screate_simple(1, cur_dim, NULL);
/* Hyperslab selection */
hsize_t start[] = {0};
hsize_t stride[] = {1};
hsize_t count[] = {8388608};
//hsize_t block[] = {0};
H5Sselect_hyperslab(file_space_id, 0, start, stride, count, NULL);
printf("Before writing particles\n");
/* DataSet create */
hid_t x_dataset = H5Dcreate(step0_grp_id, "x", H5T_NATIVE_FLOAT,
second_simple_ds, H5P_DEFAULT, H5P_DEFAULT, H5P_DEFAULT);
/* Write the data */
size_t npoints = H5Sget_select_npoints(mem_space_id);
size_t size_of_data_type = H5Tget_size(H5T_NATIVE_FLOAT);
size_t total_size_written = npoints * size_of_data_type;
float* dummy_data = (float*) malloc(total_size_written);
int i;
for(i = 0; i < npoints; i++)
dummy_data[i] = 99.99;
H5Dwrite( x_dataset, H5T_NATIVE_FLOAT, mem_space_id, file_space_id,
mpio_prop, dummy_data);
printf("Written variable 1\n");
free(dummy_data);
H5Dclose(x_dataset);
/* DataSet create */
hid_t y_dataset = H5Dcreate(step0_grp_id, "y", H5T_NATIVE_FLOAT,
second_simple_ds, H5P_DEFAULT, H5P_DEFAULT, H5P_DEFAULT);
/* Write the data */
npoints = H5Sget_select_npoints(mem_space_id);
size_of_data_type = H5Tget_size(H5T_NATIVE_FLOAT);
total_size_written = npoints * size_of_data_type;
dummy_data = (float*) malloc(total_size_written);
for(i = 0; i < npoints; i++)
dummy_data[i] = 89.99;
H5Dwrite( y_dataset, H5T_NATIVE_FLOAT, mem_space_id, file_space_id,
mpio_prop, dummy_data);
printf("Written variable 2\n");
free(dummy_data);
···
--
View this message in context: http://hdf-forum.184993.n3.nabble.com/H5Part-vs-HDF-couldn-t-finish-shared-collective-MPI-IO-tp4025883.html
Sent from the hdf-forum mailing list archive at Nabble.com.