Hi Paul,
From the error message you provided, I think I can tell you
the proximate cause of the failure.
Briefly, HDF5 maintains metadata consistency across all processes
by requiring all processes to perform all operations that modify
metadata collectively so that all processes see the same stream of
dirty metadata.
This in turn allows us to use only the process zero metadata
cache to write dirty metadata to file -- all other processes are
required to hold dirty metadata in cache until informed by the
process 0 metadata cache that the piece of dirty metdata in question
has been written to cache and is now clean.
As a sanity check, the non process 0 metadata caches verify that
all the entries listed in a "these entries are now clean" message
are both in cache and marked dirty upon receipt of the message.
It is this sanity check that is failing and causing your crash
on shutdown. It implies that process 0 thinks that some piece of
metadata is dirty, but at least one other process thinks the entry
is clean.
I can think of two ways for this to happen:
1) a bug in the HDF5 library.
2) a user program that either:
a) makes a library call that modifies metadata on
some but not all processes, or
b) makes library calls that modify metadata on all processes
but in different order on different processes.
For a list of library calls that must be called collectively, please
see:
http://www.hdfgroup.org/HDF5/faq/parallel-apis.html#coll
Unless the above points to an obvious solution, please send us
the sample code that Elena mentioned. If there is a bug here, I'd
like to squash it.
Best regards,
John Mainzer
···
From hdf-forum-bounces@hdfgroup.org Tue Apr 20 08:50:50 2010
From: Elena Pourmal <epourmal@hdfgroup.org>
Date: Tue, 20 Apr 2010 08:52:59 -0500
To: HDF Users Discussion List <hdf-forum@hdfgroup.org>
Subject: Re: [Hdf-forum] Infinite closing loop with (parallel) HDF-1.8.4-1
Reply-To: HDF Users Discussion List <hdf-forum@hdfgroup.org>Paul,
Any chance you can provide us with the example code that demonstrates the
problem? If so, could you please mail it to help@hdfgroup.org? We will
enter a bug report and will take a look. It will also help if you can
indicate OS, compiler version and MPI I/O version.Thank you!
Elena
On Apr 20, 2010, at 8:29 AM, Paul Hilscher wrote:
Dear all,
I have tried to fix this following problem since more than 3 months but still did not succeeded, I hope
some of you gurus could help me out.I am using HDF5 to store my results from a plasma turbulence code (basically 6-D and 3-D data,
and a table (to store several scalar data). In a single CPU run, HDF5 (and parallel HDF5) works fine
but for a larger CPU number (and large amount of data output steps) I got the following error message
at the end of the simulation when I want to close the HDF5 file :********* snip ****
HDF5-DIAG: Error detected in HDF5 (1.8.4-patch1) MPI-process 24:
#000: H5F.c line 1956 in H5Fclose(): decrementing file ID failed
major: Object atom
minor: Unable to close file
#001: H5F.c line 1756 in H5F_close(): can't close file
major: File accessability
minor: Unable to close file
#002: H5F.c line 1902 in H5F_try_close(): unable to flush cache
major: Object cache
minor: Unable to flush data from cache
#003: H5F.c line 1681 in H5F_flush(): unable to flush metadata cache
major: Object cache
minor: Unable to flush data from cache
#004: H5AC.c line 950 in H5AC_flush(): Can't flush.
major: Object cache
minor: Unable to flush data from cache
#005: H5AC.c line 4695 in H5AC_flush_entries(): Can't propagate clean entries list.
major: Object cache
minor: Unable to flush data from cache
#006: H5AC.c line 4450 in H5AC_propagate_flushed_and_still_clean_entries_list(): Can't receive and/or process clean slist broadcast.
major: Object cache
minor: Internal error detected
#007: H5AC.c line 4595 in H5AC_receive_and_apply_clean_list(): Can't mark entries clean.
major: Object cache
minor: Internal error detected
#008: H5C.c line 5150 in H5C_mark_entries_as_clean(): Listed entry not in cache?!?!?.
major: Object cache
minor: Internal error detected
^[[0mHDF5: infinite loop closing library
D,G,A,S,T,F,F,AC,FD,P,FD,P,FD,P,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD****** snap ***
I get this error message deterministically, if I increase the data output frequency, (or CPU number). Finally I cannot open
this file anymore, because HDF5 complains it is corrupted (sure, because it was not probably closed).
I get the same error on different computers ( with different environment, e.g. compiler, openmpi library, distribution).
Any Idea to fix this problem is highly appreciated.Thanks for your help & time
Paul
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org