Hi all,
I have a problem connected to the other thread here H5DOpen collective, driver MPIO but it is an obstacle further down the road.
I am able to write to datasets (contiguous ones) independently via the write_direct method but get stuck when trying to close the file. The high level method does the following:
if self.mpi_rank == 0:
if truncate_file:
self.truncate_h5_file()
if not self.f:
self.open_h5_file_serial()
self.ingest_metadata(image_path, spectra_path)
self.close_h5_file()
self.comm.Barrier()
self.open_h5_file_parallel()
if self.mpi_rank == 0:
self.distribute_work(self.image_path_list)
else:
self.write_image_data()
self.comm.Barrier()
if self.mpi_rank == 0:
self.distribute_work(self.spectra_path_list)
else:
self.write_spectra_data()
self.close_h5_file()
but on the self.close_h5_file() all processes hang (everybody consuming 100% cpu, the typical mpi active wait). Namely it happens in the method h5i.dec_ref(id_) in file files.py, line 453.
def close(self):
""" Close the file. All open objects become invalid """
with phil:
# Check that the file is still open, otherwise skip
if self.id.valid:
# We have to explicitly murder all open objects related to the file
# Close file-resident objects first, then the files.
# Otherwise we get errors in MPI mode.
id_list = h5f.get_obj_ids(self.id, ~h5f.OBJ_FILE)
file_list = h5f.get_obj_ids(self.id, h5f.OBJ_FILE)
id_list = [x for x in id_list if h5i.get_file_id(x).id == self.id.id]
file_list = [x for x in file_list if h5i.get_file_id(x).id == self.id.id]
for id_ in id_list:
while id_.valid:
h5i.dec_ref(id_)
for id_ in file_list:
while id_.valid:
h5i.dec_ref(id_)
self.id.close()
_objects.nonlocal_close()
The strange thing is that every process indeed calls this method, calls the “while id_.valid” the same amount of times (3x) and on the third run of h5i.dec_ref(id_) it hangs. Presumably waiting for other processes to call the function collectively for the last reference to the file.
I have verified that the following program does not hang for any number of processes:
f = h5py.File(H5PATH, 'r+', driver='mpio', comm=MPI.COMM_WORLD)
f.close()
And flushing all datasets or files before closing does not help either.
Thank you very much for your help!
Cheers,
Jiri


(h5dump throws exception when reading the file). By the way I’ve tried to omit the creation of the region reference datasets if they were the source of the issue and it stays broken.