(Deleted my previous post as I realize my failure analysis was false)
I’ve been tasked with evaluating VOL-Async by the DKRZ (Deutsches Klimarechenzentrum GmbH Hamburg). Please consider me to be a novice HDF user.
I’m in the process of implementing Async by replacing all calls with their async counterparts as described here https://hdf5-vol-async.readthedocs.io/en/latest/gettingstarted.html#explicit-mode and here https://www.hdfgroup.org/wp-content/uploads/2021/10/HUG21-Async-VOL.pdf .
hdf5-vol-async was installed with spack, loaded through modules and environmental variables were set as instructed in the guide. Compiled with all necessary flags and execute with mpiexec.
The code is supposed to generated an exactly 1GB HDF5 file. Prints were used for quick and dirty debugging.
Though I seem to missing something important as this fails and it’s quite hard for me to debug as I haven’t compiled the library by commenting in the debug lines
. It’s probably something very simple but I just can’t figure it out due to my lack in experience with HDF5. I can definitely infer that I have failing operations due to H5ESwait()
.
I would greatly appreciated the help in advance.
void create_hdf5_async(int argc, char **argv, bool with_chunking)
{
hid_t plist_id, file_id, dataspace_id, dataset_id; /* file identifier */
herr_t status;
hsize_t dims[1];
hsize_t cdims[1];
herr_t es_id;
int mpi_thread_required = MPI_THREAD_MULTIPLE;
int mpi_thread_provided = 0;
/* Initialize MPI with threading support */
MPI_Init_thread(&argc, &argv, mpi_thread_required, &mpi_thread_provided);
es_id = H5EScreate();
/* Create a new file using default properties. */
printf("Create file \n");
file_id = H5Fcreate_async("data/datasets/test_dataset_hdf5-c.h5", H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT, es_id);
// setup dimensions
int some_size = 134217728;
printf("size: %d \n", some_size);
dims[0] = some_size;
dataspace_id = H5Screate_simple(1, dims, NULL);
plist_id = H5Pcreate(H5P_DATASET_CREATE);
if (with_chunking)
{
// setup chunking
cdims[0] = 512;
status = H5Pset_chunk(plist_id, 1, cdims);
}
// create Dataset
printf("Create dataset \n");
dataset_id = H5Dcreate_async(file_id, "/X", H5T_IEEE_F64LE, dataspace_id, H5P_DEFAULT, plist_id, H5P_DEFAULT, es_id);
// fill buffer
float *dset_data = calloc(some_size, sizeof(float));
if (!dset_data)
{
fprintf(stderr, "Fatal: unable to allocate array\n");
exit(EXIT_FAILURE);
}
int i, j, k;
printf("Fill with values \n");
for (i = 0; i < some_size; i++)
{
float rand_float = (float)rand() / RAND_MAX;
// printf("i: %d, j: %d, k: %d, random float: %f \n", i, j, k, rand_float);
dset_data[i] = (float)rand() / RAND_MAX;
}
printf("Write data to dataset \n");
status = H5Dwrite_async(dataset_id, H5T_NATIVE_FLOAT, H5S_ALL, H5S_ALL, H5P_DEFAULT, dset_data, es_id);
printf("status: %d \n", status);
free(dset_data);
printf("Finish writing data to dataset \n");
/* Terminate access to the file. */
size_t num_in_progress;
hbool_t op_failed;
printf("Wait for async answer \n");
H5ESwait(es_id, H5ES_WAIT_FOREVER, &num_in_progress, &op_failed);
printf("Finish waiting for async, num in progess: %ld, failed: %d \n", num_in_progress, op_failed);
printf("Close async dataset \n");
status = H5Dclose_async(dataset_id, es_id);
printf("Close sync dataspace \n");
status = H5Sclose(dataspace_id);
printf("Close async file \n");
status = H5Fclose_async(file_id, es_id);
status = H5ESclose(es_id);
}
Stacktrace
Bench hdf5 variable async
Create file
size: 134217728
Create dataset
Fill with values
Write data to dataset
status: 0
Finish writing data to dataset
Wait for async answer
Finish waiting for async, num in progess: 139881184669379, failed: 203
Close async dataset
[leucht:51478] *** Process received signal ***
[leucht:51478] Signal: Segmentation fault (11)
[leucht:51478] Signal code: Address not mapped (1)
[leucht:51478] Failing at address: 0x7f386bfff010
[leucht:51478] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f38a04e0520]
[leucht:51478] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x1a67cd)[0x7f38a06447cd]
[leucht:51478] [ 2] /home/dev/spack/opt/spack/linux-ubuntu22.04-x86_64_v4/gcc-11.4.0/hdf5-1.14.5-5tlcfqadprf4tpamuxv4tvn67bcastlj/lib/libhdf5.so.310(H5D__gather_mem+0x140)[0x7f38a07d4330]
[leucht:51478] [ 3] /home/dev/spack/opt/spack/linux-ubuntu22.04-x86_64_v4/gcc-11.4.0/hdf5-1.14.5-5tlcfqadprf4tpamuxv4tvn67bcastlj/lib/libhdf5.so.310(H5D__scatgath_write+0x2ab)[0x7f38a07d502b]
[leucht:51478] [ 4] /home/dev/spack/opt/spack/linux-ubuntu22.04-x86_64_v4/gcc-11.4.0/hdf5-1.14.5-5tlcfqadprf4tpamuxv4tvn67bcastlj/lib/libhdf5.so.310(H5D__contig_write+0x2d)[0x7f38a07ae26d]
[leucht:51478] [ 5] /home/dev/spack/opt/spack/linux-ubuntu22.04-x86_64_v4/gcc-11.4.0/hdf5-1.14.5-5tlcfqadprf4tpamuxv4tvn67bcastlj/lib/libhdf5.so.310(H5D__write+0xe4d)[0x7f38a07c370d]
[leucht:51478] [ 6] /home/dev/spack/opt/spack/linux-ubuntu22.04-x86_64_v4/gcc-11.4.0/hdf5-1.14.5-5tlcfqadprf4tpamuxv4tvn67bcastlj/lib/libhdf5.so.310(H5VL__native_dataset_write+0xba)[0x7f38a0a3302a]
[leucht:51478] [ 7] /home/dev/spack/opt/spack/linux-ubuntu22.04-x86_64_v4/gcc-11.4.0/hdf5-1.14.5-5tlcfqadprf4tpamuxv4tvn67bcastlj/lib/libhdf5.so.310(H5VLdataset_write+0x142)[0x7f38a0a1f5c2]
[leucht:51478] [ 8] /home/dev/spack/opt/spack/linux-ubuntu22.04-x86_64_v4/gcc-11.4.0/hdf5-vol-async-1.7-vbevlbxizecl24eswhgtycejo4lbfvft/lib/libh5async.so(+0x1bff8)[0x7f389f479ff8]
[leucht:51478] [ 9] /home/dev/spack/opt/spack/linux-ubuntu22.04-x86_64_v4/gcc-11.4.0/argobots-1.2-jxiwbmgrzkanyu2vqjxtvcb6xjfqkvwx/lib/libabt.so.1(+0x223cc)[0x7f389f4453cc]
[leucht:51478] [10] /home/dev/spack/opt/spack/linux-ubuntu22.04-x86_64_v4/gcc-11.4.0/argobots-1.2-jxiwbmgrzkanyu2vqjxtvcb6xjfqkvwx/lib/libabt.so.1(+0x28b5f)[0x7f389f44bb5f]
[leucht:51478] *** End of error message ***
--------------------------------------------------------------------------
prterun noticed that process rank 0 with PID 51478 on node leucht exited on
signal 11 (Segmentation fault).
--------------------------------------------------------------------------