Existing group/dataset not detected on files generated on a different computing node


#1

Hi,

I am facing a rather strange issue. Within a Fortran code every mpi rank generates its own h5 file (open/create group/DS/close). Later on, during the same simulation, the master mpi opens these files and collect the information stored there. Proper mpi barriers are present to ensure files are properly generated before the master does that. This is done a couple of times and several distinct files containing different information are generated by every rank.

For some reason, the master mpi when reading one of these h5 files generated by every rank can only load info for those files generated by the same computing node. All other files generated by another computing node are opened/closed without problem (error=0) but when retrieving group info null members are returned. When opening the files with hdfview the group/DS are actually there. Strange that this only happens with one file while not with other ones for which we do the same.

Any clue about what could possible be wrong here?.

Thanks in advance.


#2

OK. Apparently adding a simple sleep() command after the point where files are generated and closed by mpi slaves, and before the master mpi reads them, has resolved the problem.

It looks like this is a sync issue. Is there any command to ensure that h5 files are properly closed before proceeding?. (modern Fortran intrinsic have for example something like an additional parameter ‘wait’=yes?). It looks like the h5fclose_f retrieves error zero but the work is not fully completed after the call returns.


#3

Are you sure that there are no open HDF5 objects which are preventing the file from being closed immediately?

You could make a system call fuser in master to verify that there are no processes using the file.

Also, are you using mpi-io or sec2 driver for file-per-process I/O?


#4

As far I can see there are no other objects opened.

A possible problem is that opening and closing the fortran (CALL h5open_f(error)/CALL h5close_f(error)) interface might be not fully OK because now this interface is closed but keeping the hdf5 file opened… and later on the fileID of the hdf5 file is used after re-openig the fortran interface…

Actually I was wondering if there is any hdf5 subroutine to check if the Fortran interface is opened or not at a given point.

“Also, are you using mpi-io or sec2 driver for file-per-process I/O?”
Not sure what you mean with this. Each mpi rank calls independently hdf5 subroutines to createFile/createGroup/CreateDS/close…them. In other words we are not using any simultaneous mpi-IO hdf5 file access.


#5

The only object created in h5open_f is an h5Tcopy of the C datatypes, which should not be holding the file open.
You can use H5Fget_obj_count to verify there are no open objects.

There is no hdf5 subroutine to check if the interface for Fortran is open. I’m assuming you mean if h5open_f was called? But if that is not called, then nothing will work. The fileID is not valid once it is closed, so I’m not sure what you mean by using it again. Are you making multiple calls to H5open_f and H5close_f, if so, why?

Are you using H5Pset_fapl_mpio_f with MPI_COMM_SELF or using H5P_DEFAULT fapl (i.e., sec2 driver)?


#6

Thanks. OK. I will try “H5Fget_obj_count”. Yes, that’s what I meant (check if h5open_f was called and not h5close_f).

“The fileID is not valid once it is closed”. Apparently it is still valid if the hdf5 file was not closed (h5fclose_f) but the interface well (h5close_f) and the interface then is re-opened… although this might be wrong programming I agree.

“Are you using H5Pset_fapl_mpio_f with MPI_COMM_SELF or using H5P_DEFAULT fapl (i.e., sec2 driver)?” I am not using any of those. I don’t know where should I do so either…

When creating a hdf5 file I am using: CALL h5fcreate_f(trim(fileH5Gauss), H5F_ACC_TRUNC_F, file_h5_hydGauss, h5err), so I guess that the defaults are taken

creation_prp INTEGER(HID_T), OPTIONAL
(Default value: H5P_DEFAULT_F)
access_prp INTEGER(HID_T), OPTIONAL
(Default value: H5P_DEFAULT_F)