Both MPI and single-processor file access

fredpz · July 19, 2023, 1:36pm

Hi

I write in collective mode via H5Pset_fapl_mpio then H5Pset_dxpl_mpio with the transfer mode set to H5FD_MPIO_COLLECTIVE. This works well when each process needs to writes distinct pieces of a dataset.

Now I need 1 process only to write another piece just by itself, the other processes being useless at that point. What works is to make all processes write the same data. The code is identical for all processes. But I am worried that this approach may cause performance issues for many processes, as they are basically all writing at the same time. Am I right that this may be an issue? Or is HDF5 somehow magically handling the multiple writes on the same data?

Alternately, I tried to make other processes than the rank 0 write with a selection of 0 elements, hoping that they would simply not access the disk at all, but I get a deadlock. Is there another way to make this work? Note that I would like to avoid closing/opening the file (without MPI) over and over.

Thanks

fredpz · July 19, 2023, 3:00pm

Let me specify that the deadlock I mentioned occurs when flushing the file.

fredpz · July 19, 2023, 3:15pm

UPDATE
The deadlock actually occurred because of a call to H5Dset_extent by only the master process. It now works to have all processes call H5Dset_extent then all processes call H5Dwrite, but processes with non-zero ranks use an empty selection.

So I guess my question is a bit different now. Is it good, performance-wise, to do like that? Is HDF5 clever enough to have the disk accessed only by one processor for both H5Dset_extent and H5Dwrite?

jhenderson · July 19, 2023, 5:05pm

Hi @fredpz,

When it comes to writing data to datasets collectively where one or more ranks have nothing to contribute to the write operation, it is indeed standard to have the non-contributing ranks use a selection of 0 elements, often by having them call H5Sselect_none on a dataspace object that is provided for the file_space_id parameter to H5Dwrite.

When it comes to metadata operations like H5Dset_extent, it’s important to realize that parallel HDF5 requires all processes in the MPI communicator which opened the file to perform all operations that modify file metadata collectively; H5Dset_extent is one such operation. When determining whether an API call needs to be made by all processes or can be made by a subset of those which opened the file, it’s useful to refer to Collective Calling Requirements in Parallel HDF5 Applications first. If you encounter a deadlock after following that page, it would be best to ask about it here or on our GitHub so we can help.

As for how the I/O is handled internally, the library will generally not try to optimize for the case of all processes writing the same data with H5Dwrite, except for very specific circumstances like all the processes using H5S_ALL for their file dataspace selections. However, when it comes to metadata, by default parallel HDF5 will distribute the I/O among the different processes and have them perform I/O independently. The library can also be configured to have just rank 0 perform all the I/O. Depending on the amount and size of metadata involved, both of these approaches can perform poorly in parallel, so there is the H5Pset_coll_metadata_write API routine that signals to the library that metadata should be written out using collective MPI I/O, which should generally perform better on parallel file systems. There is also the H5Pset_all_coll_metadata_ops API routine that controls whether metadata reads should be performed using collective MPI I/O. In general, it’s usually recommended to use both API routines to enable collective metadata writes and reads in parallel HDF5 applications, but if the amount and size of metadata involved is generally small then the overhead of collective MPI I/O may cause performance to be worse than if the library had just performed the I/O independently.

fredpz · July 19, 2023, 7:29pm

@jhenderson Thank you for your detailed explanation.

You said that H5Dset_extent is a metadata operation, and I don’t really understand why. I apply that function on a dataset, not on attributes or such. And I expect the extent being much more than metadata, as the allocated disk space will change.

In any case, it seems like I don’t have another option than using a collective call to H5Dset_extent. But isn’t that a potential issue for performance? Is there any other approach that would not require a collective call even when the file access is set to collective?

Would it be better to close the file, then reopen it without collective support? I am a bit afraid that opening/closing a file multiple times only to change the access will take even more time. And I imagine that I cannot open the same file twice for writing on 2 distinct datasets (one with collective access and the other one without).

jhenderson · July 19, 2023, 9:21pm

@fredpz The reason that H5Dset_extent is considered a metadata operation is due to:

The possible need to reallocate space in the file
The possible need to reallocate dataset chunks

Both of these operations modify file metadata and must be done collectively so that all MPI processes have a consistent view of the file. If you look at the entry for H5Dset_extent in the collective calling requirements page I linked before, you can see that it says “All processes must participate only if the number of chunks in the dataset actually changes.” This is likely true in that if an H5Dset_extent call doesn’t grow/shrink the dimensionality of the dataset such that new chunks have to be allocated or old chunks have to be de-allocated, then no file metadata would be modified and this could be done independently on the MPI processes. But you’d have to know that information ahead of time or be able to calculate it and I can’t immediately think of a use case for having different dataset extents across MPI processes, but I’d be interested to know if someone has one!

In general, it’s probably safest to always make a collective call to H5Dset_extent. The direct modifications that H5Dset_extent makes won’t usually be a performance problem because these operations can often just be modifications to the library’s in-memory state that don’t touch the file until the file is closed/truncated. However, due to differences in how serial and parallel HDF5 deal with allocation of file space, there are some side effects of the call that can make a difference for performance. Parallel HDF5 requires all file space to be allocated up front at dataset creation time, while serial HDF5 has a little more freedom and by default will allocate file space for chunked datasets incrementally as chunks are written to. Since file space allocation time is when the library determines whether data fill values should be written to chunks, this means that a parallel dataset extend operation could trigger a large write of fill value data, unless fill values are disabled by setting the fill time to H5D_FILL_TIME_NEVER using H5Pset_fill_time on a Dataset Creation Property List that is used to create a dataset. You’d likely notice a lot of overhead from this I/O if you do repeated dataset extends in parallel without having disabled dataset fill values.

Note that as of the 1.14.0, 1.12.2 and 1.10.9 releases, the situation can be mitigated a bit if your datasets have filters applied to them. In that case, the parallel library can do incremental file space allocation for chunked datasets the same way that the serial library does, but if fill values aren’t disabled you would still pay the cost of writing those out at the time you initially write to a dataset chunk and allocate file space for it, unless the chunk is fully overwritten. You can read more about that in the “Incremental file space allocation support” section at https://www.hdfgroup.org/2022/03/parallel-compression-improvements-in-hdf5-1-13-1/.

You could also take the approach you mentioned where the file is closed and reopened for serial access, the dataset is extended and then the file is closed again and reopened for parallel access, but I would also imagine that would be worse than just extending the dataset collectively, assuming that fill values are disabled. That said, if you see that extending a dataset collectively is much worse than the time it takes to go through the cycle of multiple file opens to extend a dataset with serial access, we’d definitely like to investigate that.

For your last point, the “Attention” note in the documentation of H5Fopen may be helpful, but you’d likely run into file consistency issues if you have a file opened for both serial and parallel access at the same time and made modifications to the file opened with serial access.

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Both MPI and single-processor file access