Collective IO and filters

Michael_K_Edwards · November 9, 2017, 8:47pm

So I think the distinction here is between "participating" for
synchronization purposes and having a nonzero slice of data locally.
I think (correct me if I'm wrong) that all ranks have to call
H5Dwrite() even if they have called H5Sselect_none() on the filespace.
That will cause them to send metadata describing their zero-sized
contributions to shared chunks to the rank 0 coordinator. They won't
get chosen as the new owner, but their metadata will be included in
the chunk_entry list sent from rank 0 to the new owner, which means
they will be expected to send chunks to the new owner. The crash
happens when these zero-sized chunks are decoded by the filter plugin;
even if I stop the plugin itself from crashing, it has to return size
0 to H5Z_pipeline(), which interprets that as filter failure and
crashes out in H5Z.c line 1256.

That's something I can probably work around, but before I go too far
down that road, I'd love it if you could correct any misapprehensions
in this. Is it the case that all ranks have to call H5Dwrite()? Is
there a way to know what the uncompressed data size will be, and skip
the zero-sized chunk_entry units somewhere up the stack?

···

On Thu, Nov 9, 2017 at 12:21 PM, Jordan Henderson <jhenderson@hdfgroup.org> wrote:

In the H5D__link_chunk_filtered_collective_io() function, all ranks (after
some initialization work) should first hit
H5D__construct_filtered_io_info_list(). Inside that function, at line 2741,
each rank counts the number of chunks it has selected. Only if a rank has
any selected should it then proceed with building its local list of chunks.
At that point, all the ranks which aren't participating should skip this and
wait for the other ranks to get done before everyone participates in the
chunk redistribution. Then, the non-participating ranks shouldn't have any
chunks assigned to them since they could not be considered among the crowd
of ranks writing the most to any of the chunks. They should then return from
the function back to H5D__link_chunk_filtered_collective_io(), with
chunk_list_num_entries telling them that they have no chunks to work on. At
that point they should skip the loop at 1471-1474 and wait for the others.
The only case I can currently imagine where the chunk redistribution could
get confused would be where no one at all is writing to anything.
Multi-chunk I/O specifically handles this but I'm not sure if Link-chunk I/O
will handle the case as well as Multi-Chunk does.

This is all of course if I understand what you mean by the zero-sized
chunks, which I believe I understand due to the fact that your file space
for the chunks is positive in size.

Michael_K_Edwards · November 9, 2017, 9:37pm

Thank you for the explanation. That's consistent with what I see when
I add a debug printf into H5D__construct_filtered_io_info_list(). So
I'm now looking into the filter situation. It's possible that the
H5Z-blosc glue is mishandling the case where the compressed data is
larger than the uncompressed data.

About to write 12 of 20
About to write 0 of 20
About to write 0 of 20
About to write 8 of 20
Rank 0 selected 12 of 20
Rank 1 selected 8 of 20
HDF5-DIAG: Error detected in HDF5 (1.11.0) MPI-process 0:
  #000: H5Dio.c line 319 in H5Dwrite(): can't prepare for writing data
    major: Dataset
    minor: Write failed
  #001: H5Dio.c line 395 in H5D__pre_write(): can't write data
    major: Dataset
    minor: Write failed
  #002: H5Dio.c line 836 in H5D__write(): can't write data
    major: Dataset
    minor: Write failed
  #003: H5Dmpio.c line 1019 in H5D__chunk_collective_write(): write error
    major: Dataspace
    minor: Write failed
  #004: H5Dmpio.c line 934 in H5D__chunk_collective_io(): couldn't
finish filtered linked chunk MPI-IO
    major: Low-level I/O
    minor: Can't get value
  #005: H5Dmpio.c line 1474 in
H5D__link_chunk_filtered_collective_io(): couldn't process chunk entry
    major: Dataset
    minor: Write failed
  #006: H5Dmpio.c line 3278 in
H5D__filtered_collective_chunk_entry_io(): couldn't unfilter chunk for
modifying
    major: Data filters
    minor: Filter operation failed
  #007: H5Z.c line 1256 in H5Z_pipeline(): filter returned failure during read
    major: Data filters
    minor: Read failed

···

On Thu, Nov 9, 2017 at 1:02 PM, Jordan Henderson <jhenderson@hdfgroup.org> wrote:

For the purpose of collective I/O it is true that all ranks must call
H5Dwrite() so that they can participate in those collective operations that
are necessary (the file space re-allocation and so on). However, even though
they called H5Dwrite() with a valid memspace, the fact that they have a NONE
selection in the given file space should cause their chunk-file mapping
struct (see lines 357-385 of H5Dpkg.h for the struct's definition and the
code for H5D__link_chunk_filtered_collective_io() to see how it uses this
built up list of chunks selected in the file) to contain no entries in the
"fm->sel_chunks" field. That alone should mean that during the chunk
redistribution, they will not actually send anything at all to any of the
ranks. They only participate there for the sake that, were the method of
redistribution modified, ranks which previously had no chunks selected could
potentially be given some chunks to work on.

For all practical purposes, every single chunk_entry seen in the list from
rank 0's perspective should be a valid I/O caused by some rank writing some
positive amount of bytes to the chunk. On rank 0's side, you should be able
to check the io_size field of each of the chunk_entry entries and see how
big the I/O is from the "original_owner" to that chunk. If any of these are
0, something is likely very wrong. If that is indeed the case, you could
likely pull a hacky workaround by manually removing them from the list, but
I'd be more concerned about the root of the problem if there are zero-size
I/O chunk_entry entries being added to the list.

Michael_K_Edwards · November 9, 2017, 9:39pm

I don't think szip will work, because it bombs out when there isn't
enough data to reach its minimal compression unit (usually configured
as 32 bytes). I can try zlib (deflate).

···

On Thu, Nov 9, 2017 at 1:35 PM, Jordan Henderson <jhenderson@hdfgroup.org> wrote:

Also, now that the hanging issue has been resolved, would it be possible to
try this same code again with a different filter, perhaps within the
gzip/szip family? I'm curious as to whether the filter has anything to do
with this issue or not.

jhenderson · November 9, 2017, 10:45pm

As the filtered collective path simply calls through the filter pipeline by way of the H5Z_pipeline() function, it would seem that either the filter pipeline itself is not handling this case correctly, or this is somewhat unexpected behavior for the pipeline to deal with.

Either way, I think a pull request/diff file would be very useful for going over this. If you're able to generate a diff between what you have now and the current develop branch/H5Z-blosc code and put it here that would be useful. I don't think that there should be too much in the way of logistics for getting this code in, we just want to make sure that we approach the solution in the right way without breaking something else.

Michael_K_Edwards · November 9, 2017, 11:01pm

And here's the change to H5Z-blosc (still using the private H5MM APIs):

diff --git a/src/blosc_filter.c b/src/blosc_filter.c
index bfd8c3e..9bc1a42 100644
--- a/src/blosc_filter.c
+++ b/src/blosc_filter.c
@@ -16,6 +16,7 @@
#include <string.h>
#include <errno.h>
#include "hdf5.h"
+#include "H5MMprivate.h"
#include "blosc_filter.h"

#if defined(__GNUC__)
@@ -194,20 +195,21 @@ size_t blosc_filter(unsigned flags, size_t cd_nelmts,
/* We're compressing */
if (!(flags & H5Z_FLAG_REVERSE)) {

- /* Allocate an output buffer exactly as long as the input data; if
- the result is larger, we simply return 0. The filter is flagged
- as optional, so HDF5 marks the chunk as uncompressed and
- proceeds.
+ /* Allocate an output buffer BLOSC_MAX_OVERHEAD (currently 16) bytes
+ larger than the input data, to accommodate the BLOSC header.
+ If compression with the requested parameters causes the data itself
+ to grow (thereby causing the compressed data, with header, to exceed
+ the output buffer size), fall back to memcpy mode (clevel=0).
*/

- outbuf_size = (*buf_size);
+ outbuf_size = nbytes + BLOSC_MAX_OVERHEAD;

#ifdef BLOSC_DEBUG
fprintf(stderr, "Blosc: Compress %zd chunk w/buffer %zd\n",
nbytes, outbuf_size);
#endif

- outbuf = malloc(outbuf_size);
+ outbuf = H5MM_malloc(outbuf_size);

if (outbuf == NULL) {
PUSH_ERR("blosc_filter", H5E_CALLBACK,
@@ -217,7 +219,11 @@ size_t blosc_filter(unsigned flags, size_t cd_nelmts,

     blosc_set_compressor(compname);
     status = blosc_compress(clevel, doshuffle, typesize, nbytes,
- *buf, outbuf, nbytes);
+ *buf, outbuf, outbuf_size);
+ if (status < 0) {
+ status = blosc_compress(0, doshuffle, typesize, nbytes,
+ *buf, outbuf, outbuf_size);
+ }
     if (status < 0) {
       PUSH_ERR("blosc_filter", H5E_CALLBACK, "Blosc compression error");
       goto failed;
@@ -228,7 +234,7 @@ size_t blosc_filter(unsigned flags, size_t cd_nelmts,
     /* declare dummy variables */
     size_t cbytes, blocksize;

- free(outbuf);
+ H5MM_xfree(outbuf);

/* Extract the exact outbuf_size from the buffer header.

···

*
@@ -243,7 +249,14 @@ size_t blosc_filter(unsigned flags, size_t cd_nelmts,
fprintf(stderr, "Blosc: Decompress %zd chunk w/buffer %zd\n",
nbytes, outbuf_size);
#endif

- outbuf = malloc(outbuf_size);
+ if (outbuf_size == 0) {
+ H5MM_xfree(*buf);
+ *buf = NULL;
+ *buf_size = outbuf_size;
+ return 0; /* Size of compressed/decompressed data */
+ }
+
+ outbuf = H5MM_malloc(outbuf_size);

     if (outbuf == NULL) {
       PUSH_ERR("blosc_filter", H5E_CALLBACK, "Can't allocate
decompression buffer");
@@ -259,14 +272,14 @@ size_t blosc_filter(unsigned flags, size_t cd_nelmts,
   } /* compressing vs decompressing */

   if (status != 0) {
- free(*buf);
+ H5MM_xfree(*buf);
     *buf = outbuf;
     *buf_size = outbuf_size;
     return status; /* Size of compressed/decompressed data */
   }

failed:
- free(outbuf);
+ H5MM_xfree(outbuf);
return 0;

} /* End filter function */

On Thu, Nov 9, 2017 at 2:45 PM, Jordan Henderson <jhenderson@hdfgroup.org> wrote:

As the filtered collective path simply calls through the filter pipeline by
way of the H5Z_pipeline() function, it would seem that either the filter
pipeline itself is not handling this case correctly, or this is somewhat
unexpected behavior for the pipeline to deal with.

Either way, I think a pull request/diff file would be very useful for going
over this. If you're able to generate a diff between what you have now and
the current develop branch/H5Z-blosc code and put it here that would be
useful. I don't think that there should be too much in the way of logistics
for getting this code in, we just want to make sure that we approach the
solution in the right way without breaking something else.

jhenderson · November 9, 2017, 11:22pm

Thanks! I'll discuss this with others and see what the best way to proceed forward from this is. I think this has been a very productive discussion and very useful feedback.

···

________________________________
From: Michael K. Edwards <m.k.edwards@gmail.com>
Sent: Thursday, November 9, 2017 5:01:33 PM
To: Jordan Henderson
Cc: HDF Users Discussion List
Subject: Re: [Hdf-forum] Collective IO and filters