Hi all,
I have a problem connected to the other thread here H5DOpen collective, driver MPIO but it is an obstacle further down the road.
I am able to write to datasets (contiguous ones) independently via the write_direct method but get stuck when trying to close the file. The high level method does the following:
if self.mpi_rank == 0:
if truncate_file:
self.truncate_h5_file()
if not self.f:
self.open_h5_file_serial()
self.ingest_metadata(image_path, spectra_path)
self.close_h5_file()
self.comm.Barrier()
self.open_h5_file_parallel()
if self.mpi_rank == 0:
self.distribute_work(self.image_path_list)
else:
self.write_image_data()
self.comm.Barrier()
if self.mpi_rank == 0:
self.distribute_work(self.spectra_path_list)
else:
self.write_spectra_data()
self.close_h5_file()
but on the self.close_h5_file() all processes hang (everybody consuming 100% cpu, the typical mpi active wait). Namely it happens in the method h5i.dec_ref(id_) in file files.py, line 453.
def close(self):
""" Close the file. All open objects become invalid """
with phil:
# Check that the file is still open, otherwise skip
if self.id.valid:
# We have to explicitly murder all open objects related to the file
# Close file-resident objects first, then the files.
# Otherwise we get errors in MPI mode.
id_list = h5f.get_obj_ids(self.id, ~h5f.OBJ_FILE)
file_list = h5f.get_obj_ids(self.id, h5f.OBJ_FILE)
id_list = [x for x in id_list if h5i.get_file_id(x).id == self.id.id]
file_list = [x for x in file_list if h5i.get_file_id(x).id == self.id.id]
for id_ in id_list:
while id_.valid:
h5i.dec_ref(id_)
for id_ in file_list:
while id_.valid:
h5i.dec_ref(id_)
self.id.close()
_objects.nonlocal_close()
The strange thing is that every process indeed calls this method, calls the “while id_.valid” the same amount of times (3x) and on the third run of h5i.dec_ref(id_) it hangs. Presumably waiting for other processes to call the function collectively for the last reference to the file.
I have verified that the following program does not hang for any number of processes:
f = h5py.File(H5PATH, 'r+', driver='mpio', comm=MPI.COMM_WORLD)
f.close()
And flushing all datasets or files before closing does not help either.
Thank you very much for your help!
Cheers,
Jiri