Just so it's clear: the fixes are mostly in the plugins and in how
PETSc calls into the HDF5 code. (It should probably never have mixed
simple and null dataspaces in one collective write.) The fixes to
HDF5 itself are:
* Dana's observations with regard to the H5MM APIs:
* the inappropriate assert(size > 0) in H5MM_[mc]alloc in the
develop branch; and
* the recommendation to use H5allocate/resize/free_memory() rather
than the private APIs.
* The recommendation to sort by chunk address within each owner's
range of chunk entries, to avoid risk of deadlock in the
H5D__chunk_redistribute_shared_chunks() code.
I haven't switched to H5allocate/resize/free_memory() yet, but here's
(minimally tested) code to handle the other two issues.
Cheers,
- Michael
diff --git a/src/H5Dmpio.c b/src/H5Dmpio.c
index 79572c0..60e9f03 100644
--- a/src/H5Dmpio.c
+++ b/src/H5Dmpio.c
@@ -2328,14 +2328,22 @@
H5D__cmp_filtered_collective_io_info_entry(const void
*filtered_collective_io_in
static int
H5D__cmp_filtered_collective_io_info_entry_owner(const void
*filtered_collective_io_info_entry1, const void
*filtered_collective_io_info_entry2)
{
- int owner1 = -1, owner2 = -1;
+ int owner1 = -1, owner2 = -1, delta = 0;
+ haddr_t addr1 = HADDR_UNDEF, addr2 = HADDR_UNDEF;
FUNC_ENTER_STATIC_NOERR
owner1 = ((const H5D_filtered_collective_io_info_t *)
filtered_collective_io_info_entry1)->owners.original_owner;
owner2 = ((const H5D_filtered_collective_io_info_t *)
filtered_collective_io_info_entry2)->owners.original_owner;
···
-
- FUNC_LEAVE_NOAPI(owner1 - owner2)
+ if (owner1 != owner2) {
+ delta = owner1 - owner2;
+ } else {
+ addr1 = ((const H5D_filtered_collective_io_info_t *)
filtered_collective_io_info_entry1)->chunk_states.new_chunk.offset;
+ addr2 = ((const H5D_filtered_collective_io_info_t *)
filtered_collective_io_info_entry2)->chunk_states.new_chunk.offset;
+ delta = H5F_addr_cmp(addr1, addr2);
+ }
+
+ FUNC_LEAVE_NOAPI(delta)
} /* end H5D__cmp_filtered_collective_io_info_entry_owner() */
^L
diff --git a/src/H5MM.c b/src/H5MM.c
index ee3b28f..3f06850 100644
--- a/src/H5MM.c
+++ b/src/H5MM.c
@@ -268,8 +268,6 @@ H5MM_malloc(size_t size)
{
void *ret_value = NULL;
- HDassert(size);
-
/* Use FUNC_ENTER_NOAPI_NOINIT_NOERR here to avoid performance issues */
FUNC_ENTER_NOAPI_NOINIT_NOERR
@@ -357,8 +355,6 @@ H5MM_calloc(size_t size)
{
void *ret_value = NULL;
- HDassert(size);
-
/* Use FUNC_ENTER_NOAPI_NOINIT_NOERR here to avoid performance issues */
FUNC_ENTER_NOAPI_NOINIT_NOERR
On Thu, Nov 9, 2017 at 2:27 PM, Michael K. Edwards <m.k.edwards@gmail.com> wrote:
That does appear to have been the problem. I modified H5Z-blosc to
allocate enough room for the BLOSC header, and to fall back to memcpy
mode (clevel=0) if the data expands during "compressed" encoding.
This unblocks me, though I think it might be a good idea for the
collective filtered I/O path to handle H5Z_FLAG_OPTIONAL properly.
Would it be helpful for me to send a patch once I've cleaned up my
debugging goop? What's a good way to do that -- github pull request?
Do you need a contributor agreement / copyright assignment / some such
thing?
On Thu, Nov 9, 2017 at 1:44 PM, Michael K. Edwards > <m.k.edwards@gmail.com> wrote:
I observe this comment in the H5Z-blosc code:
/* Allocate an output buffer exactly as long as the input data; if
the result is larger, we simply return 0. The filter is flagged
as optional, so HDF5 marks the chunk as uncompressed and
proceeds.
*/
In my current setup, I have not marked the filter with
H5Z_FLAG_MANDATORY, for this reason. Is this comment accurate for the
collective filtered path, or is it possible that the zero return code
is being treated as "compressed data is zero bytes long"?
On Thu, Nov 9, 2017 at 1:37 PM, Michael K. Edwards >> <m.k.edwards@gmail.com> wrote:
Thank you for the explanation. That's consistent with what I see when
I add a debug printf into H5D__construct_filtered_io_info_list(). So
I'm now looking into the filter situation. It's possible that the
H5Z-blosc glue is mishandling the case where the compressed data is
larger than the uncompressed data.
About to write 12 of 20
About to write 0 of 20
About to write 0 of 20
About to write 8 of 20
Rank 0 selected 12 of 20
Rank 1 selected 8 of 20
HDF5-DIAG: Error detected in HDF5 (1.11.0) MPI-process 0:
#000: H5Dio.c line 319 in H5Dwrite(): can't prepare for writing data
major: Dataset
minor: Write failed
#001: H5Dio.c line 395 in H5D__pre_write(): can't write data
major: Dataset
minor: Write failed
#002: H5Dio.c line 836 in H5D__write(): can't write data
major: Dataset
minor: Write failed
#003: H5Dmpio.c line 1019 in H5D__chunk_collective_write(): write error
major: Dataspace
minor: Write failed
#004: H5Dmpio.c line 934 in H5D__chunk_collective_io(): couldn't
finish filtered linked chunk MPI-IO
major: Low-level I/O
minor: Can't get value
#005: H5Dmpio.c line 1474 in
H5D__link_chunk_filtered_collective_io(): couldn't process chunk entry
major: Dataset
minor: Write failed
#006: H5Dmpio.c line 3278 in
H5D__filtered_collective_chunk_entry_io(): couldn't unfilter chunk for
modifying
major: Data filters
minor: Filter operation failed
#007: H5Z.c line 1256 in H5Z_pipeline(): filter returned failure during read
major: Data filters
minor: Read failed
On Thu, Nov 9, 2017 at 1:02 PM, Jordan Henderson >>> <jhenderson@hdfgroup.org> wrote:
For the purpose of collective I/O it is true that all ranks must call
H5Dwrite() so that they can participate in those collective operations that
are necessary (the file space re-allocation and so on). However, even though
they called H5Dwrite() with a valid memspace, the fact that they have a NONE
selection in the given file space should cause their chunk-file mapping
struct (see lines 357-385 of H5Dpkg.h for the struct's definition and the
code for H5D__link_chunk_filtered_collective_io() to see how it uses this
built up list of chunks selected in the file) to contain no entries in the
"fm->sel_chunks" field. That alone should mean that during the chunk
redistribution, they will not actually send anything at all to any of the
ranks. They only participate there for the sake that, were the method of
redistribution modified, ranks which previously had no chunks selected could
potentially be given some chunks to work on.
For all practical purposes, every single chunk_entry seen in the list from
rank 0's perspective should be a valid I/O caused by some rank writing some
positive amount of bytes to the chunk. On rank 0's side, you should be able
to check the io_size field of each of the chunk_entry entries and see how
big the I/O is from the "original_owner" to that chunk. If any of these are
0, something is likely very wrong. If that is indeed the case, you could
likely pull a hacky workaround by manually removing them from the list, but
I'd be more concerned about the root of the problem if there are zero-size
I/O chunk_entry entries being added to the list.