I have a problem connected to the other thread here H5DOpen collective, driver MPIO but it is an obstacle further down the road.
I am able to write to datasets (contiguous ones) independently via the write_direct method but get stuck when trying to close the file. The high level method does the following:
if self.mpi_rank == 0: if truncate_file: self.truncate_h5_file() if not self.f: self.open_h5_file_serial() self.ingest_metadata(image_path, spectra_path) self.close_h5_file() self.comm.Barrier() self.open_h5_file_parallel() if self.mpi_rank == 0: self.distribute_work(self.image_path_list) else: self.write_image_data() self.comm.Barrier() if self.mpi_rank == 0: self.distribute_work(self.spectra_path_list) else: self.write_spectra_data() self.close_h5_file()
but on the self.close_h5_file() all processes hang (everybody consuming 100% cpu, the typical mpi active wait). Namely it happens in the method h5i.dec_ref(id_) in file files.py, line 453.
def close(self): """ Close the file. All open objects become invalid """ with phil: # Check that the file is still open, otherwise skip if self.id.valid: # We have to explicitly murder all open objects related to the file # Close file-resident objects first, then the files. # Otherwise we get errors in MPI mode. id_list = h5f.get_obj_ids(self.id, ~h5f.OBJ_FILE) file_list = h5f.get_obj_ids(self.id, h5f.OBJ_FILE) id_list = [x for x in id_list if h5i.get_file_id(x).id == self.id.id] file_list = [x for x in file_list if h5i.get_file_id(x).id == self.id.id] for id_ in id_list: while id_.valid: h5i.dec_ref(id_) for id_ in file_list: while id_.valid: h5i.dec_ref(id_) self.id.close() _objects.nonlocal_close()
The strange thing is that every process indeed calls this method, calls the “while id_.valid” the same amount of times (3x) and on the third run of h5i.dec_ref(id_) it hangs. Presumably waiting for other processes to call the function collectively for the last reference to the file.
I have verified that the following program does not hang for any number of processes:
f = h5py.File(H5PATH, 'r+', driver='mpio', comm=MPI.COMM_WORLD) f.close()
And flushing all datasets or files before closing does not help either.
Thank you very much for your help!