Collective IO and filters

Oddly enough, it is not the tag that is mismatched between receiver
and senders; it is io_info->comm. Something is decidedly out of whack
here.

Rank 0, owner 0 probing with tag 0 on comm -1006632942
Rank 2, owner 0 sent with tag 0 to comm -1006632952 as request 0
Rank 3, owner 0 sent with tag 0 to comm -1006632952 as request 0
Rank 1, owner 0 sent with tag 0 to comm -1006632952 as request 0

···

On Wed, Nov 8, 2017 at 2:51 PM, Michael K. Edwards <m.k.edwards@gmail.com> wrote:

I see that you're re-sorting by owner using a comparator called
H5D__cmp_filtered_collective_io_info_entry_owner() which does not sort
by a secondary key within items with equal owners. That, together
with a sort that isn't stable (which HDqsort() probably isn't on most
platforms; quicksort/introsort is not stable), will scramble the order
in which different ranks traverse their local chunk arrays. That will
cause deadly embraces between ranks that are waiting for each other's
chunks to be sent. To fix that, it's probably sufficient to use the
chunk offset as a secondary sort key in that comparator.

That's not the root cause of the hang I'm currently experiencing,
though. Still digging into that.

On Wed, Nov 8, 2017 at 1:50 PM, Dana Robinson <derobins@hdfgroup.org> wrote:
> Yes. All outside code that frees, allocates, or reallocates memory created
> inside the library (or that will be passed back into the library, where it
> could be freed or reallocated) should use these functions. This includes
> filters.
>
>
>
> Dana
>
>
>
> From: Jordan Henderson <jhenderson@hdfgroup.org>
> Date: Wednesday, November 8, 2017 at 13:46
> To: Dana Robinson <derobins@hdfgroup.org>, "M.K.Edwards@gmail.com"
> <m.k.edwards@gmail.com>, HDF List <hdf-forum@lists.hdfgroup.org>
> Subject: Re: [Hdf-forum] Collective IO and filters
>
>
>
> Dana,
>
>
>
> would it then make sense for all outside filters to use these routines? Due
> to Parallel Compression's internal nature, it uses buffers allocated via
> H5MM_ routines to collect and scatter data, which works fine for the
> internal filters like deflate, since they use these as well. However, since
> some of the outside filters use the raw malloc/free routines, causing
> issues, I'm wondering if having all outside filters use the H5_ routines is
> the cleanest solution..
>
>
>
> Michael,
>
>
>
> Based on the "num_writers: 4" field, the NULL "receive_requests_array" and
> the fact that for the same chunk, rank 0 shows "original owner: 0, new
> owner: 0" and rank 3 shows "original owner: 3, new_owner: 0", it seems as
> though everyone IS interested in the chunk the rank 0 is now working on, but
> now I'm more confident that at some point either the messages may have
> failed to send or rank 0 is having problems finding the messages.
>
>
>
> Since in the unfiltered case it won't hit this particular code path, I'm not
> surprised that that case succeeds. If I had to make another guess based on
> this, I would be inclined to think that rank 0 must be hanging on the
> MPI_Mprobe due to a mismatch in the "tag" field. I use the index of the
> chunk as the tag for the message in order to funnel specific messages to the
> correct rank for the correct chunk during the last part of the chunk
> redistribution and if rank 0 can't match the tag it of course won't find the
> message. Why this might be happening, I'm not entirely certain currently.

Replacing Intel's build of MVAPICH2 2.2 with a fresh build of MVAPICH2
2.3b got me farther along. The comm mismatch does not seem to be a
problem. I am guessing that the root cause was whatever bug is listed
in http://mvapich.cse.ohio-state.edu/static/media/mvapich/MV2_CHANGELOG-2.3b.txt
as:

    - Fix hang in MPI_Probe
        - Thanks to John Westlund@Intel for the report

I fixed the H5D__cmp_filtered_collective_io_info_entry_owner
comparator, and now I'm back to fixing things about my patch to PETSc.
I seem to be trying to filter a dataset that I shouldn't be.

HDF5-DIAG: Error detected in HDF5 (1.11.0) MPI-process 0:
  #000: H5Dio.c line 319 in H5Dwrite(): can't prepare for writing data
    major: Dataset
    minor: Write failed
  #001: H5Dio.c line 395 in H5D__pre_write(): can't write data
    major: Dataset
    minor: Write failed
  #002: H5Dio.c line 831 in H5D__write(): unable to adjust I/O info
for parallel I/O
    major: Dataset
    minor: Unable to initialize object
  #003: H5Dio.c line 1264 in H5D__ioinfo_adjust(): Can't perform
independent write with filters in pipeline.
    The following caused a break from collective I/O:
        Local causes:
        Global causes: one of the dataspaces was neither simple nor scalar
    major: Low-level I/O
    minor: Can't perform independent IO

···

On Wed, Nov 8, 2017 at 11:37 PM, Michael K. Edwards <m.k.edwards@gmail.com> wrote:

Oddly enough, it is not the tag that is mismatched between receiver
and senders; it is io_info->comm. Something is decidedly out of whack
here.

Rank 0, owner 0 probing with tag 0 on comm -1006632942
Rank 2, owner 0 sent with tag 0 to comm -1006632952 as request 0
Rank 3, owner 0 sent with tag 0 to comm -1006632952 as request 0
Rank 1, owner 0 sent with tag 0 to comm -1006632952 as request 0

On Wed, Nov 8, 2017 at 2:51 PM, Michael K. Edwards > <m.k.edwards@gmail.com> wrote:

I see that you're re-sorting by owner using a comparator called
H5D__cmp_filtered_collective_io_info_entry_owner() which does not sort
by a secondary key within items with equal owners. That, together
with a sort that isn't stable (which HDqsort() probably isn't on most
platforms; quicksort/introsort is not stable), will scramble the order
in which different ranks traverse their local chunk arrays. That will
cause deadly embraces between ranks that are waiting for each other's
chunks to be sent. To fix that, it's probably sufficient to use the
chunk offset as a secondary sort key in that comparator.

That's not the root cause of the hang I'm currently experiencing,
though. Still digging into that.

On Wed, Nov 8, 2017 at 1:50 PM, Dana Robinson <derobins@hdfgroup.org> wrote:
> Yes. All outside code that frees, allocates, or reallocates memory created
> inside the library (or that will be passed back into the library, where it
> could be freed or reallocated) should use these functions. This includes
> filters.
>
>
>
> Dana
>
>
>
> From: Jordan Henderson <jhenderson@hdfgroup.org>
> Date: Wednesday, November 8, 2017 at 13:46
> To: Dana Robinson <derobins@hdfgroup.org>, "M.K.Edwards@gmail.com"
> <m.k.edwards@gmail.com>, HDF List <hdf-forum@lists.hdfgroup.org>
> Subject: Re: [Hdf-forum] Collective IO and filters
>
>
>
> Dana,
>
>
>
> would it then make sense for all outside filters to use these routines? Due
> to Parallel Compression's internal nature, it uses buffers allocated via
> H5MM_ routines to collect and scatter data, which works fine for the
> internal filters like deflate, since they use these as well. However, since
> some of the outside filters use the raw malloc/free routines, causing
> issues, I'm wondering if having all outside filters use the H5_ routines is
> the cleanest solution..
>
>
>
> Michael,
>
>
>
> Based on the "num_writers: 4" field, the NULL "receive_requests_array" and
> the fact that for the same chunk, rank 0 shows "original owner: 0, new
> owner: 0" and rank 3 shows "original owner: 3, new_owner: 0", it seems as
> though everyone IS interested in the chunk the rank 0 is now working on, but
> now I'm more confident that at some point either the messages may have
> failed to send or rank 0 is having problems finding the messages.
>
>
>
> Since in the unfiltered case it won't hit this particular code path, I'm not
> surprised that that case succeeds. If I had to make another guess based on
> this, I would be inclined to think that rank 0 must be hanging on the
> MPI_Mprobe due to a mismatch in the "tag" field. I use the index of the
> chunk as the tag for the message in order to funnel specific messages to the
> correct rank for the correct chunk during the last part of the chunk
> redistribution and if rank 0 can't match the tag it of course won't find the
> message. Why this might be happening, I'm not entirely certain currently.

It seems you're discovering the issues right as I'm typing this!

I'm glad you were able to solve the issue with the hanging. I was starting to suspect an issue with the MPI implementation but it's usually the last thing on the list after inspecting the code itself.

As you've seen, it seems that PETSc is creating a NULL dataspace for the ranks which are not contributing, instead of creating a Scalar/Simple dataspace on all ranks and calling H5Sselect_none() for those that don't participate. This would most likely explain the reason you saw the assertion failure in the non-filtered case, as the legacy code probably was not expecting to receive a NULL dataspace. On top of that, the NULL dataspace seems like it is causing the parallel operation to break collective mode, which is not allowed when filters are involved. I would need to do some research as to why this happens before deciding whether it's more appropriate to modify this in HDF5 or to have PETSc not use NULL dataspaces.

Avoiding deadlock from the final sort has been an issue I had to re-tackle a few different times due to the nature of the code's complexity, but I will investigate using the chunk offset as a secondary sort key and see if it will run into problems in any other cases. Ideally, the chunk redistribution might be updated in the future to involve all ranks in the operation instead of just rank 0, also allowing for improvements to the redistribution algorithm that may solve these problems, but for the time being this may be sufficient.

And that's because of this logic up in PETSc:

  if (n > 0) {
    PetscStackCallHDF5Return(memspace,H5Screate_simple,(dim, count, NULL));
  } else {
    /* Can't create dataspace with zero for any dimension, so create
null dataspace. */
    PetscStackCallHDF5Return(memspace,H5Screate,(H5S_NULL));
  }

where n is the number of elements in the rank's slice of the data. I
think. There is a corresponding branch later in the code:

  if (n > 0) {
    PetscStackCallHDF5Return(filespace,H5Dget_space,(dset_id));
    PetscStackCallHDF5(H5Sselect_hyperslab,(filespace, H5S_SELECT_SET,
offset, NULL, count, NULL));
  } else {
    /* Create null filespace to match null memspace. */
    PetscStackCallHDF5Return(filespace,H5Screate,(H5S_NULL));
  }

It seems clear that PETSc is mishandling this situation, but I'm not
sure how to fix it if the comment is right. Advice?

···

On Thu, Nov 9, 2017 at 7:43 AM, Michael K. Edwards <m.k.edwards@gmail.com> wrote:

Replacing Intel's build of MVAPICH2 2.2 with a fresh build of MVAPICH2
2.3b got me farther along. The comm mismatch does not seem to be a
problem. I am guessing that the root cause was whatever bug is listed
in http://mvapich.cse.ohio-state.edu/static/media/mvapich/MV2_CHANGELOG-2.3b.txt
as:

    - Fix hang in MPI_Probe
        - Thanks to John Westlund@Intel for the report

I fixed the H5D__cmp_filtered_collective_io_info_entry_owner
comparator, and now I'm back to fixing things about my patch to PETSc.
I seem to be trying to filter a dataset that I shouldn't be.

HDF5-DIAG: Error detected in HDF5 (1.11.0) MPI-process 0:
  #000: H5Dio.c line 319 in H5Dwrite(): can't prepare for writing data
    major: Dataset
    minor: Write failed
  #001: H5Dio.c line 395 in H5D__pre_write(): can't write data
    major: Dataset
    minor: Write failed
  #002: H5Dio.c line 831 in H5D__write(): unable to adjust I/O info
for parallel I/O
    major: Dataset
    minor: Unable to initialize object
  #003: H5Dio.c line 1264 in H5D__ioinfo_adjust(): Can't perform
independent write with filters in pipeline.
    The following caused a break from collective I/O:
        Local causes:
        Global causes: one of the dataspaces was neither simple nor scalar
    major: Low-level I/O
    minor: Can't perform independent IO

On Wed, Nov 8, 2017 at 11:37 PM, Michael K. Edwards > <m.k.edwards@gmail.com> wrote:

Oddly enough, it is not the tag that is mismatched between receiver
and senders; it is io_info->comm. Something is decidedly out of whack
here.

Rank 0, owner 0 probing with tag 0 on comm -1006632942
Rank 2, owner 0 sent with tag 0 to comm -1006632952 as request 0
Rank 3, owner 0 sent with tag 0 to comm -1006632952 as request 0
Rank 1, owner 0 sent with tag 0 to comm -1006632952 as request 0

On Wed, Nov 8, 2017 at 2:51 PM, Michael K. Edwards >> <m.k.edwards@gmail.com> wrote:

I see that you're re-sorting by owner using a comparator called
H5D__cmp_filtered_collective_io_info_entry_owner() which does not sort
by a secondary key within items with equal owners. That, together
with a sort that isn't stable (which HDqsort() probably isn't on most
platforms; quicksort/introsort is not stable), will scramble the order
in which different ranks traverse their local chunk arrays. That will
cause deadly embraces between ranks that are waiting for each other's
chunks to be sent. To fix that, it's probably sufficient to use the
chunk offset as a secondary sort key in that comparator.

That's not the root cause of the hang I'm currently experiencing,
though. Still digging into that.

On Wed, Nov 8, 2017 at 1:50 PM, Dana Robinson <derobins@hdfgroup.org> wrote:
> Yes. All outside code that frees, allocates, or reallocates memory created
> inside the library (or that will be passed back into the library, where it
> could be freed or reallocated) should use these functions. This includes
> filters.
>
>
>
> Dana
>
>
>
> From: Jordan Henderson <jhenderson@hdfgroup.org>
> Date: Wednesday, November 8, 2017 at 13:46
> To: Dana Robinson <derobins@hdfgroup.org>, "M.K.Edwards@gmail.com"
> <m.k.edwards@gmail.com>, HDF List <hdf-forum@lists.hdfgroup.org>
> Subject: Re: [Hdf-forum] Collective IO and filters
>
>
>
> Dana,
>
>
>
> would it then make sense for all outside filters to use these routines? Due
> to Parallel Compression's internal nature, it uses buffers allocated via
> H5MM_ routines to collect and scatter data, which works fine for the
> internal filters like deflate, since they use these as well. However, since
> some of the outside filters use the raw malloc/free routines, causing
> issues, I'm wondering if having all outside filters use the H5_ routines is
> the cleanest solution..
>
>
>
> Michael,
>
>
>
> Based on the "num_writers: 4" field, the NULL "receive_requests_array" and
> the fact that for the same chunk, rank 0 shows "original owner: 0, new
> owner: 0" and rank 3 shows "original owner: 3, new_owner: 0", it seems as
> though everyone IS interested in the chunk the rank 0 is now working on, but
> now I'm more confident that at some point either the messages may have
> failed to send or rank 0 is having problems finding the messages.
>
>
>
> Since in the unfiltered case it won't hit this particular code path, I'm not
> surprised that that case succeeds. If I had to make another guess based on
> this, I would be inclined to think that rank 0 must be hanging on the
> MPI_Mprobe due to a mismatch in the "tag" field. I use the index of the
> chunk as the tag for the message in order to funnel specific messages to the
> correct rank for the correct chunk during the last part of the chunk
> redistribution and if rank 0 can't match the tag it of course won't find the
> message. Why this might be happening, I'm not entirely certain currently.

Thank you for the validation, and for the suggestion to use
H5Sselect_none(). That is probably the right thing for the dataspace.
Not quite sure what to do about the memspace, though; the comment is
correct that we crash if any of the dimensions is zero.

···

On Thu, Nov 9, 2017 at 8:34 AM, Jordan Henderson <jhenderson@hdfgroup.org> wrote:

It seems you're discovering the issues right as I'm typing this!

I'm glad you were able to solve the issue with the hanging. I was starting
to suspect an issue with the MPI implementation but it's usually the last
thing on the list after inspecting the code itself.

As you've seen, it seems that PETSc is creating a NULL dataspace for the
ranks which are not contributing, instead of creating a Scalar/Simple
dataspace on all ranks and calling H5Sselect_none() for those that don't
participate. This would most likely explain the reason you saw the assertion
failure in the non-filtered case, as the legacy code probably was not
expecting to receive a NULL dataspace. On top of that, the NULL dataspace
seems like it is causing the parallel operation to break collective mode,
which is not allowed when filters are involved. I would need to do some
research as to why this happens before deciding whether it's more
appropriate to modify this in HDF5 or to have PETSc not use NULL dataspaces.

Avoiding deadlock from the final sort has been an issue I had to re-tackle a
few different times due to the nature of the code's complexity, but I will
investigate using the chunk offset as a secondary sort key and see if it
will run into problems in any other cases. Ideally, the chunk redistribution
might be updated in the future to involve all ranks in the operation instead
of just rank 0, also allowing for improvements to the redistribution
algorithm that may solve these problems, but for the time being this may be
sufficient.

Apparently this has been reported before as a problem with PETSc/HDF5
integration: https://lists.mcs.anl.gov/pipermail/petsc-users/2012-January/011980.html

···

On Thu, Nov 9, 2017 at 8:37 AM, Michael K. Edwards <m.k.edwards@gmail.com> wrote:

Thank you for the validation, and for the suggestion to use
H5Sselect_none(). That is probably the right thing for the dataspace.
Not quite sure what to do about the memspace, though; the comment is
correct that we crash if any of the dimensions is zero.

On Thu, Nov 9, 2017 at 8:34 AM, Jordan Henderson > <jhenderson@hdfgroup.org> wrote:

It seems you're discovering the issues right as I'm typing this!

I'm glad you were able to solve the issue with the hanging. I was starting
to suspect an issue with the MPI implementation but it's usually the last
thing on the list after inspecting the code itself.

As you've seen, it seems that PETSc is creating a NULL dataspace for the
ranks which are not contributing, instead of creating a Scalar/Simple
dataspace on all ranks and calling H5Sselect_none() for those that don't
participate. This would most likely explain the reason you saw the assertion
failure in the non-filtered case, as the legacy code probably was not
expecting to receive a NULL dataspace. On top of that, the NULL dataspace
seems like it is causing the parallel operation to break collective mode,
which is not allowed when filters are involved. I would need to do some
research as to why this happens before deciding whether it's more
appropriate to modify this in HDF5 or to have PETSc not use NULL dataspaces.

Avoiding deadlock from the final sort has been an issue I had to re-tackle a
few different times due to the nature of the code's complexity, but I will
investigate using the chunk offset as a secondary sort key and see if it
will run into problems in any other cases. Ideally, the chunk redistribution
might be updated in the future to involve all ranks in the operation instead
of just rank 0, also allowing for improvements to the redistribution
algorithm that may solve these problems, but for the time being this may be
sufficient.

Actually, it's not the H5Screate() that crashes; that works fine since
HDF5 1.8.7. It's a zero-sized malloc somewhere inside the call to
H5Dwrite(), possibly in the filter. I think this is close to
resolution; just have to get tools on it.

···

On Thu, Nov 9, 2017 at 8:47 AM, Michael K. Edwards <m.k.edwards@gmail.com> wrote:

Apparently this has been reported before as a problem with PETSc/HDF5
integration: https://lists.mcs.anl.gov/pipermail/petsc-users/2012-January/011980.html

On Thu, Nov 9, 2017 at 8:37 AM, Michael K. Edwards > <m.k.edwards@gmail.com> wrote:

Thank you for the validation, and for the suggestion to use
H5Sselect_none(). That is probably the right thing for the dataspace.
Not quite sure what to do about the memspace, though; the comment is
correct that we crash if any of the dimensions is zero.

On Thu, Nov 9, 2017 at 8:34 AM, Jordan Henderson >> <jhenderson@hdfgroup.org> wrote:

It seems you're discovering the issues right as I'm typing this!

I'm glad you were able to solve the issue with the hanging. I was starting
to suspect an issue with the MPI implementation but it's usually the last
thing on the list after inspecting the code itself.

As you've seen, it seems that PETSc is creating a NULL dataspace for the
ranks which are not contributing, instead of creating a Scalar/Simple
dataspace on all ranks and calling H5Sselect_none() for those that don't
participate. This would most likely explain the reason you saw the assertion
failure in the non-filtered case, as the legacy code probably was not
expecting to receive a NULL dataspace. On top of that, the NULL dataspace
seems like it is causing the parallel operation to break collective mode,
which is not allowed when filters are involved. I would need to do some
research as to why this happens before deciding whether it's more
appropriate to modify this in HDF5 or to have PETSc not use NULL dataspaces.

Avoiding deadlock from the final sort has been an issue I had to re-tackle a
few different times due to the nature of the code's complexity, but I will
investigate using the chunk offset as a secondary sort key and see if it
will run into problems in any other cases. Ideally, the chunk redistribution
might be updated in the future to involve all ranks in the operation instead
of just rank 0, also allowing for improvements to the redistribution
algorithm that may solve these problems, but for the time being this may be
sufficient.

It does appear as though it's the "update" chunk that is zero-sized.
Is there any way to know that before decompressing, and to skip the
update higher up in the stack (perhaps in
H5D__link_chunk_filtered_collective_io())?

···

On Thu, Nov 9, 2017 at 11:18 AM, Michael K. Edwards <m.k.edwards@gmail.com> wrote:

Would it be better for that read-decompress-update-recompress-write
operation to skip zero-sized chunks? I imagine it's a bit tricky if
the lowest-indexed rank's contribution to the chunk is zero-sized; but
can that happen? Doesn't ownership move to the rank that has the
largest contribution to the chunk that's being written?

On Thu, Nov 9, 2017 at 10:26 AM, Jordan Henderson > <jhenderson@hdfgroup.org> wrote:

Since Parallel Compression operates by applying the filter on a
per-chunk-basis, this should be consistent with what you're seeing. However,
zero-sized chunks is a case I had not actually considered yet, and I could
reasonably see blosc failing due to a zero-sized allocation.

Since reading in the parallel case with filters doesn't affect the metadata,
the H5D__construct_filtered_io_info_list() function will simply cause each
rank to construct a local list of all the chunks they have selected in the
read operation, read their respective chunks into locally-allocated buffers,
and decompress the data on a chunk-by-chunk basis, scattering it to the read
buffer along the way. Writing works the same way in that each rank works on
their own local list of chunks, with the exception that some of the chunks
may get shifted around before the actual write operation of "pull data from
the read buffer, decompress the chunk, update the chunk, re-compress the
chunk and write it" happens. In general, it shouldn't cause an issue that
you're reading the Dataset with a different number of MPI ranks than it was
written with.

I observe this comment in the H5Z-blosc code:

    /* Allocate an output buffer exactly as long as the input data; if
       the result is larger, we simply return 0. The filter is flagged
       as optional, so HDF5 marks the chunk as uncompressed and
       proceeds.
    */

In my current setup, I have not marked the filter with
H5Z_FLAG_MANDATORY, for this reason. Is this comment accurate for the
collective filtered path, or is it possible that the zero return code
is being treated as "compressed data is zero bytes long"?

···

On Thu, Nov 9, 2017 at 1:37 PM, Michael K. Edwards <m.k.edwards@gmail.com> wrote:

Thank you for the explanation. That's consistent with what I see when
I add a debug printf into H5D__construct_filtered_io_info_list(). So
I'm now looking into the filter situation. It's possible that the
H5Z-blosc glue is mishandling the case where the compressed data is
larger than the uncompressed data.

About to write 12 of 20
About to write 0 of 20
About to write 0 of 20
About to write 8 of 20
Rank 0 selected 12 of 20
Rank 1 selected 8 of 20
HDF5-DIAG: Error detected in HDF5 (1.11.0) MPI-process 0:
  #000: H5Dio.c line 319 in H5Dwrite(): can't prepare for writing data
    major: Dataset
    minor: Write failed
  #001: H5Dio.c line 395 in H5D__pre_write(): can't write data
    major: Dataset
    minor: Write failed
  #002: H5Dio.c line 836 in H5D__write(): can't write data
    major: Dataset
    minor: Write failed
  #003: H5Dmpio.c line 1019 in H5D__chunk_collective_write(): write error
    major: Dataspace
    minor: Write failed
  #004: H5Dmpio.c line 934 in H5D__chunk_collective_io(): couldn't
finish filtered linked chunk MPI-IO
    major: Low-level I/O
    minor: Can't get value
  #005: H5Dmpio.c line 1474 in
H5D__link_chunk_filtered_collective_io(): couldn't process chunk entry
    major: Dataset
    minor: Write failed
  #006: H5Dmpio.c line 3278 in
H5D__filtered_collective_chunk_entry_io(): couldn't unfilter chunk for
modifying
    major: Data filters
    minor: Filter operation failed
  #007: H5Z.c line 1256 in H5Z_pipeline(): filter returned failure during read
    major: Data filters
    minor: Read failed

On Thu, Nov 9, 2017 at 1:02 PM, Jordan Henderson > <jhenderson@hdfgroup.org> wrote:

For the purpose of collective I/O it is true that all ranks must call
H5Dwrite() so that they can participate in those collective operations that
are necessary (the file space re-allocation and so on). However, even though
they called H5Dwrite() with a valid memspace, the fact that they have a NONE
selection in the given file space should cause their chunk-file mapping
struct (see lines 357-385 of H5Dpkg.h for the struct's definition and the
code for H5D__link_chunk_filtered_collective_io() to see how it uses this
built up list of chunks selected in the file) to contain no entries in the
"fm->sel_chunks" field. That alone should mean that during the chunk
redistribution, they will not actually send anything at all to any of the
ranks. They only participate there for the sake that, were the method of
redistribution modified, ranks which previously had no chunks selected could
potentially be given some chunks to work on.

For all practical purposes, every single chunk_entry seen in the list from
rank 0's perspective should be a valid I/O caused by some rank writing some
positive amount of bytes to the chunk. On rank 0's side, you should be able
to check the io_size field of each of the chunk_entry entries and see how
big the I/O is from the "original_owner" to that chunk. If any of these are
0, something is likely very wrong. If that is indeed the case, you could
likely pull a hacky workaround by manually removing them from the list, but
I'd be more concerned about the root of the problem if there are zero-size
I/O chunk_entry entries being added to the list.

That does appear to have been the problem. I modified H5Z-blosc to
allocate enough room for the BLOSC header, and to fall back to memcpy
mode (clevel=0) if the data expands during "compressed" encoding.
This unblocks me, though I think it might be a good idea for the
collective filtered I/O path to handle H5Z_FLAG_OPTIONAL properly.

Would it be helpful for me to send a patch once I've cleaned up my
debugging goop? What's a good way to do that -- github pull request?
Do you need a contributor agreement / copyright assignment / some such
thing?

···

On Thu, Nov 9, 2017 at 1:44 PM, Michael K. Edwards <m.k.edwards@gmail.com> wrote:

I observe this comment in the H5Z-blosc code:

    /* Allocate an output buffer exactly as long as the input data; if
       the result is larger, we simply return 0. The filter is flagged
       as optional, so HDF5 marks the chunk as uncompressed and
       proceeds.
    */

In my current setup, I have not marked the filter with
H5Z_FLAG_MANDATORY, for this reason. Is this comment accurate for the
collective filtered path, or is it possible that the zero return code
is being treated as "compressed data is zero bytes long"?

On Thu, Nov 9, 2017 at 1:37 PM, Michael K. Edwards > <m.k.edwards@gmail.com> wrote:

Thank you for the explanation. That's consistent with what I see when
I add a debug printf into H5D__construct_filtered_io_info_list(). So
I'm now looking into the filter situation. It's possible that the
H5Z-blosc glue is mishandling the case where the compressed data is
larger than the uncompressed data.

About to write 12 of 20
About to write 0 of 20
About to write 0 of 20
About to write 8 of 20
Rank 0 selected 12 of 20
Rank 1 selected 8 of 20
HDF5-DIAG: Error detected in HDF5 (1.11.0) MPI-process 0:
  #000: H5Dio.c line 319 in H5Dwrite(): can't prepare for writing data
    major: Dataset
    minor: Write failed
  #001: H5Dio.c line 395 in H5D__pre_write(): can't write data
    major: Dataset
    minor: Write failed
  #002: H5Dio.c line 836 in H5D__write(): can't write data
    major: Dataset
    minor: Write failed
  #003: H5Dmpio.c line 1019 in H5D__chunk_collective_write(): write error
    major: Dataspace
    minor: Write failed
  #004: H5Dmpio.c line 934 in H5D__chunk_collective_io(): couldn't
finish filtered linked chunk MPI-IO
    major: Low-level I/O
    minor: Can't get value
  #005: H5Dmpio.c line 1474 in
H5D__link_chunk_filtered_collective_io(): couldn't process chunk entry
    major: Dataset
    minor: Write failed
  #006: H5Dmpio.c line 3278 in
H5D__filtered_collective_chunk_entry_io(): couldn't unfilter chunk for
modifying
    major: Data filters
    minor: Filter operation failed
  #007: H5Z.c line 1256 in H5Z_pipeline(): filter returned failure during read
    major: Data filters
    minor: Read failed

On Thu, Nov 9, 2017 at 1:02 PM, Jordan Henderson >> <jhenderson@hdfgroup.org> wrote:

For the purpose of collective I/O it is true that all ranks must call
H5Dwrite() so that they can participate in those collective operations that
are necessary (the file space re-allocation and so on). However, even though
they called H5Dwrite() with a valid memspace, the fact that they have a NONE
selection in the given file space should cause their chunk-file mapping
struct (see lines 357-385 of H5Dpkg.h for the struct's definition and the
code for H5D__link_chunk_filtered_collective_io() to see how it uses this
built up list of chunks selected in the file) to contain no entries in the
"fm->sel_chunks" field. That alone should mean that during the chunk
redistribution, they will not actually send anything at all to any of the
ranks. They only participate there for the sake that, were the method of
redistribution modified, ranks which previously had no chunks selected could
potentially be given some chunks to work on.

For all practical purposes, every single chunk_entry seen in the list from
rank 0's perspective should be a valid I/O caused by some rank writing some
positive amount of bytes to the chunk. On rank 0's side, you should be able
to check the io_size field of each of the chunk_entry entries and see how
big the I/O is from the "original_owner" to that chunk. If any of these are
0, something is likely very wrong. If that is indeed the case, you could
likely pull a hacky workaround by manually removing them from the list, but
I'd be more concerned about the root of the problem if there are zero-size
I/O chunk_entry entries being added to the list.

Just so it's clear: the fixes are mostly in the plugins and in how
PETSc calls into the HDF5 code. (It should probably never have mixed
simple and null dataspaces in one collective write.) The fixes to
HDF5 itself are:

* Dana's observations with regard to the H5MM APIs:
  * the inappropriate assert(size > 0) in H5MM_[mc]alloc in the
develop branch; and
  * the recommendation to use H5allocate/resize/free_memory() rather
than the private APIs.
* The recommendation to sort by chunk address within each owner's
range of chunk entries, to avoid risk of deadlock in the
H5D__chunk_redistribute_shared_chunks() code.

I haven't switched to H5allocate/resize/free_memory() yet, but here's
(minimally tested) code to handle the other two issues.

Cheers,
- Michael

diff --git a/src/H5Dmpio.c b/src/H5Dmpio.c
index 79572c0..60e9f03 100644
--- a/src/H5Dmpio.c
+++ b/src/H5Dmpio.c
@@ -2328,14 +2328,22 @@
H5D__cmp_filtered_collective_io_info_entry(const void
*filtered_collective_io_in
static int
H5D__cmp_filtered_collective_io_info_entry_owner(const void
*filtered_collective_io_info_entry1, const void
*filtered_collective_io_info_entry2)
{
- int owner1 = -1, owner2 = -1;
+ int owner1 = -1, owner2 = -1, delta = 0;
+ haddr_t addr1 = HADDR_UNDEF, addr2 = HADDR_UNDEF;

     FUNC_ENTER_STATIC_NOERR

     owner1 = ((const H5D_filtered_collective_io_info_t *)
filtered_collective_io_info_entry1)->owners.original_owner;
     owner2 = ((const H5D_filtered_collective_io_info_t *)
filtered_collective_io_info_entry2)->owners.original_owner;

···

-
- FUNC_LEAVE_NOAPI(owner1 - owner2)
+ if (owner1 != owner2) {
+ delta = owner1 - owner2;
+ } else {
+ addr1 = ((const H5D_filtered_collective_io_info_t *)
filtered_collective_io_info_entry1)->chunk_states.new_chunk.offset;
+ addr2 = ((const H5D_filtered_collective_io_info_t *)
filtered_collective_io_info_entry2)->chunk_states.new_chunk.offset;
+ delta = H5F_addr_cmp(addr1, addr2);
+ }
+
+ FUNC_LEAVE_NOAPI(delta)
} /* end H5D__cmp_filtered_collective_io_info_entry_owner() */

^L
diff --git a/src/H5MM.c b/src/H5MM.c
index ee3b28f..3f06850 100644
--- a/src/H5MM.c
+++ b/src/H5MM.c
@@ -268,8 +268,6 @@ H5MM_malloc(size_t size)
{
     void *ret_value = NULL;

- HDassert(size);
-
     /* Use FUNC_ENTER_NOAPI_NOINIT_NOERR here to avoid performance issues */
     FUNC_ENTER_NOAPI_NOINIT_NOERR

@@ -357,8 +355,6 @@ H5MM_calloc(size_t size)
{
     void *ret_value = NULL;

- HDassert(size);
-
     /* Use FUNC_ENTER_NOAPI_NOINIT_NOERR here to avoid performance issues */
     FUNC_ENTER_NOAPI_NOINIT_NOERR

On Thu, Nov 9, 2017 at 2:27 PM, Michael K. Edwards <m.k.edwards@gmail.com> wrote:

That does appear to have been the problem. I modified H5Z-blosc to
allocate enough room for the BLOSC header, and to fall back to memcpy
mode (clevel=0) if the data expands during "compressed" encoding.
This unblocks me, though I think it might be a good idea for the
collective filtered I/O path to handle H5Z_FLAG_OPTIONAL properly.

Would it be helpful for me to send a patch once I've cleaned up my
debugging goop? What's a good way to do that -- github pull request?
Do you need a contributor agreement / copyright assignment / some such
thing?

On Thu, Nov 9, 2017 at 1:44 PM, Michael K. Edwards > <m.k.edwards@gmail.com> wrote:

I observe this comment in the H5Z-blosc code:

    /* Allocate an output buffer exactly as long as the input data; if
       the result is larger, we simply return 0. The filter is flagged
       as optional, so HDF5 marks the chunk as uncompressed and
       proceeds.
    */

In my current setup, I have not marked the filter with
H5Z_FLAG_MANDATORY, for this reason. Is this comment accurate for the
collective filtered path, or is it possible that the zero return code
is being treated as "compressed data is zero bytes long"?

On Thu, Nov 9, 2017 at 1:37 PM, Michael K. Edwards >> <m.k.edwards@gmail.com> wrote:

Thank you for the explanation. That's consistent with what I see when
I add a debug printf into H5D__construct_filtered_io_info_list(). So
I'm now looking into the filter situation. It's possible that the
H5Z-blosc glue is mishandling the case where the compressed data is
larger than the uncompressed data.

About to write 12 of 20
About to write 0 of 20
About to write 0 of 20
About to write 8 of 20
Rank 0 selected 12 of 20
Rank 1 selected 8 of 20
HDF5-DIAG: Error detected in HDF5 (1.11.0) MPI-process 0:
  #000: H5Dio.c line 319 in H5Dwrite(): can't prepare for writing data
    major: Dataset
    minor: Write failed
  #001: H5Dio.c line 395 in H5D__pre_write(): can't write data
    major: Dataset
    minor: Write failed
  #002: H5Dio.c line 836 in H5D__write(): can't write data
    major: Dataset
    minor: Write failed
  #003: H5Dmpio.c line 1019 in H5D__chunk_collective_write(): write error
    major: Dataspace
    minor: Write failed
  #004: H5Dmpio.c line 934 in H5D__chunk_collective_io(): couldn't
finish filtered linked chunk MPI-IO
    major: Low-level I/O
    minor: Can't get value
  #005: H5Dmpio.c line 1474 in
H5D__link_chunk_filtered_collective_io(): couldn't process chunk entry
    major: Dataset
    minor: Write failed
  #006: H5Dmpio.c line 3278 in
H5D__filtered_collective_chunk_entry_io(): couldn't unfilter chunk for
modifying
    major: Data filters
    minor: Filter operation failed
  #007: H5Z.c line 1256 in H5Z_pipeline(): filter returned failure during read
    major: Data filters
    minor: Read failed

On Thu, Nov 9, 2017 at 1:02 PM, Jordan Henderson >>> <jhenderson@hdfgroup.org> wrote:

For the purpose of collective I/O it is true that all ranks must call
H5Dwrite() so that they can participate in those collective operations that
are necessary (the file space re-allocation and so on). However, even though
they called H5Dwrite() with a valid memspace, the fact that they have a NONE
selection in the given file space should cause their chunk-file mapping
struct (see lines 357-385 of H5Dpkg.h for the struct's definition and the
code for H5D__link_chunk_filtered_collective_io() to see how it uses this
built up list of chunks selected in the file) to contain no entries in the
"fm->sel_chunks" field. That alone should mean that during the chunk
redistribution, they will not actually send anything at all to any of the
ranks. They only participate there for the sake that, were the method of
redistribution modified, ranks which previously had no chunks selected could
potentially be given some chunks to work on.

For all practical purposes, every single chunk_entry seen in the list from
rank 0's perspective should be a valid I/O caused by some rank writing some
positive amount of bytes to the chunk. On rank 0's side, you should be able
to check the io_size field of each of the chunk_entry entries and see how
big the I/O is from the "original_owner" to that chunk. If any of these are
0, something is likely very wrong. If that is indeed the case, you could
likely pull a hacky workaround by manually removing them from the list, but
I'd be more concerned about the root of the problem if there are zero-size
I/O chunk_entry entries being added to the list.

In develop, H5MM_malloc() and H5MM_calloc() will throw an assert if size is zero. That should not be there and the function docs even say that we return NULL on size zero.

The bad line is at lines 271 and 360 in H5MM.c if you want to try yanking that out and rebuilding.

Dana

    Actually, it's not the H5Screate() that crashes; that works fine since
    HDF5 1.8.7. It's a zero-sized malloc somewhere inside the call to
    H5Dwrite(), possibly in the filter. I think this is close to
    resolution; just have to get tools on it.

···

On 11/9/17, 09:06, "Hdf-forum on behalf of Michael K. Edwards" <hdf-forum-bounces@lists.hdfgroup.org on behalf of m.k.edwards@gmail.com> wrote:
    
    On Thu, Nov 9, 2017 at 8:47 AM, Michael K. Edwards <m.k.edwards@gmail.com> wrote:
    > Apparently this has been reported before as a problem with PETSc/HDF5
    > integration: https://lists.mcs.anl.gov/pipermail/petsc-users/2012-January/011980.html
    >
    > On Thu, Nov 9, 2017 at 8:37 AM, Michael K. Edwards > <m.k.edwards@gmail.com> wrote:
    >> Thank you for the validation, and for the suggestion to use
    >> H5Sselect_none(). That is probably the right thing for the dataspace.
    >> Not quite sure what to do about the memspace, though; the comment is
    >> correct that we crash if any of the dimensions is zero.
    >>
    >> On Thu, Nov 9, 2017 at 8:34 AM, Jordan Henderson >> <jhenderson@hdfgroup.org> wrote:
    >>> It seems you're discovering the issues right as I'm typing this!
    >>>
    >>>
    >>> I'm glad you were able to solve the issue with the hanging. I was starting
    >>> to suspect an issue with the MPI implementation but it's usually the last
    >>> thing on the list after inspecting the code itself.
    >>>
    >>>
    >>> As you've seen, it seems that PETSc is creating a NULL dataspace for the
    >>> ranks which are not contributing, instead of creating a Scalar/Simple
    >>> dataspace on all ranks and calling H5Sselect_none() for those that don't
    >>> participate. This would most likely explain the reason you saw the assertion
    >>> failure in the non-filtered case, as the legacy code probably was not
    >>> expecting to receive a NULL dataspace. On top of that, the NULL dataspace
    >>> seems like it is causing the parallel operation to break collective mode,
    >>> which is not allowed when filters are involved. I would need to do some
    >>> research as to why this happens before deciding whether it's more
    >>> appropriate to modify this in HDF5 or to have PETSc not use NULL dataspaces.
    >>>
    >>>
    >>> Avoiding deadlock from the final sort has been an issue I had to re-tackle a
    >>> few different times due to the nature of the code's complexity, but I will
    >>> investigate using the chunk offset as a secondary sort key and see if it
    >>> will run into problems in any other cases. Ideally, the chunk redistribution
    >>> might be updated in the future to involve all ranks in the operation instead
    >>> of just rank 0, also allowing for improvements to the redistribution
    >>> algorithm that may solve these problems, but for the time being this may be
    >>> sufficient.
    
    _______________________________________________
    Hdf-forum is for HDF software users discussion.
    Hdf-forum@lists.hdfgroup.org
    http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
    Twitter: https://twitter.com/hdf5

Since Parallel Compression operates by applying the filter on a per-chunk-basis, this should be consistent with what you're seeing. However, zero-sized chunks is a case I had not actually considered yet, and I could reasonably see blosc failing due to a zero-sized allocation.

Since reading in the parallel case with filters doesn't affect the metadata, the H5D__construct_filtered_io_info_list() function will simply cause each rank to construct a local list of all the chunks they have selected in the read operation, read their respective chunks into locally-allocated buffers, and decompress the data on a chunk-by-chunk basis, scattering it to the read buffer along the way. Writing works the same way in that each rank works on their own local list of chunks, with the exception that some of the chunks may get shifted around before the actual write operation of "pull data from the read buffer, decompress the chunk, update the chunk, re-compress the chunk and write it" happens. In general, it shouldn't cause an issue that you're reading the Dataset with a different number of MPI ranks than it was written with.

By zero-sized chunks do you mean to say that the actual chunks in the dataset are zero-sized or the data going to the write is zero-sized? It would seem odd to me if you were writing to an essentially zero-sized dataset composed of zero-sized chunks.

On the other hand, for ranks that aren't participating, they should never construct a list of chunks in the H5D__construct_filtered_io_info_list() function and thus should never participate in any chunk updating, only the collective file space re-allocations and re-insertion of chunks into the chunk index. That being said, if you are indeed seeing zero-sized malloc calls in the chunk update function, something must be wrong somewhere. While it is true that the chunks currently move to the rank with the largest contribution to the chunk which ALSO has the least amount of chunks currently assigned to it (to try and get a more even distribution of chunks among all the ranks), any rank which has a zero-sized contribution to a chunk should never have created a chunk struct entry for the chunk and thus should not be participating in the chunk updating loop (lines 1471-1474 in the current develop branch). They should pass that loop and wait at the subsequent H5D__mpio_array_gatherv() until the other ranks get done processing. Again, this is what should happen but in your case may not be the actuality of the situation.

In the H5D__link_chunk_filtered_collective_io() function, all ranks (after some initialization work) should first hit H5D__construct_filtered_io_info_list(). Inside that function, at line 2741, each rank counts the number of chunks it has selected. Only if a rank has any selected should it then proceed with building its local list of chunks. At that point, all the ranks which aren't participating should skip this and wait for the other ranks to get done before everyone participates in the chunk redistribution. Then, the non-participating ranks shouldn't have any chunks assigned to them since they could not be considered among the crowd of ranks writing the most to any of the chunks. They should then return from the function back to H5D__link_chunk_filtered_collective_io(), with chunk_list_num_entries telling them that they have no chunks to work on. At that point they should skip the loop at 1471-1474 and wait for the others. The only case I can currently imagine where the chunk redistribution could get confused would be where no one at all is writing to anything. Multi-chunk I/O specifically handles this but I'm not sure if Link-chunk I/O will handle the case as well as Multi-Chunk does.

This is all of course if I understand what you mean by the zero-sized chunks, which I believe I understand due to the fact that your file space for the chunks is positive in size.

For the purpose of collective I/O it is true that all ranks must call H5Dwrite() so that they can participate in those collective operations that are necessary (the file space re-allocation and so on). However, even though they called H5Dwrite() with a valid memspace, the fact that they have a NONE selection in the given file space should cause their chunk-file mapping struct (see lines 357-385 of H5Dpkg.h for the struct's definition and the code for H5D__link_chunk_filtered_collective_io() to see how it uses this built up list of chunks selected in the file) to contain no entries in the "fm->sel_chunks" field. That alone should mean that during the chunk redistribution, they will not actually send anything at all to any of the ranks. They only participate there for the sake that, were the method of redistribution modified, ranks which previously had no chunks selected could potentially be given some chunks to work on.

For all practical purposes, every single chunk_entry seen in the list from rank 0's perspective should be a valid I/O caused by some rank writing some positive amount of bytes to the chunk. On rank 0's side, you should be able to check the io_size field of each of the chunk_entry entries and see how big the I/O is from the "original_owner" to that chunk. If any of these are 0, something is likely very wrong. If that is indeed the case, you could likely pull a hacky workaround by manually removing them from the list, but I'd be more concerned about the root of the problem if there are zero-size I/O chunk_entry entries being added to the list.

Also, now that the hanging issue has been resolved, would it be possible to try this same code again with a different filter, perhaps within the gzip/szip family? I'm curious as to whether the filter has anything to do with this issue or not.

Thank you. That got me farther along. The crash is now in the
H5Z-blosc filter glue, and should be easy to fix. It's interesting
that the filter is applied on a per-chunk basis, including on
zero-sized chunks; it's possible that something is wrong higher up the
stack. I haven't really thought about collective read with filters
yet. Jordan, can you fill me in on how that's supposed to work,
especially if the reader has a different number of MPI ranks than the
writer had?

HDF5-DIAG: Error detected in HDF5 (1.11.0) MPI-process 0:
  #000: H5Dio.c line 319 in H5Dwrite(): can't prepare for writing data
    major: Dataset
    minor: Write failed
  #001: H5Dio.c line 395 in H5D__pre_write(): can't write data
    major: Dataset
    minor: Write failed
  #002: H5Dio.c line 836 in H5D__write(): can't write data
    major: Dataset
    minor: Write failed
  #003: H5Dmpio.c line 1019 in H5D__chunk_collective_write(): write error
    major: Dataspace
    minor: Write failed
  #004: H5Dmpio.c line 934 in H5D__chunk_collective_io(): couldn't
finish filtered linked chunk MPI-IO
    major: Low-level I/O
    minor: Can't get value
  #005: H5Dmpio.c line 1474 in
H5D__link_chunk_filtered_collective_io(): couldn't process chunk entry
    major: Dataset
    minor: Write failed
  #006: H5Dmpio.c line 3277 in
H5D__filtered_collective_chunk_entry_io(): couldn't unfilter chunk for
modifying
    major: Data filters
    minor: Filter operation failed
  #007: H5Z.c line 1256 in H5Z_pipeline(): filter returned failure during read
    major: Data filters
    minor: Read failed
  #008: /home/centos/blosc/hdf5-blosc/src/blosc_filter.c line 250 in
blosc_filter(): Can't allocate decompression buffer
    major: Data filters
    minor: Callback failed

···

On Thu, Nov 9, 2017 at 9:22 AM, Dana Robinson <derobins@hdfgroup.org> wrote:

In develop, H5MM_malloc() and H5MM_calloc() will throw an assert if size is zero. That should not be there and the function docs even say that we return NULL on size zero.

The bad line is at lines 271 and 360 in H5MM.c if you want to try yanking that out and rebuilding.

Dana

On 11/9/17, 09:06, "Hdf-forum on behalf of Michael K. Edwards" <hdf-forum-bounces@lists.hdfgroup.org on behalf of m.k.edwards@gmail.com> wrote:

    Actually, it's not the H5Screate() that crashes; that works fine since
    HDF5 1.8.7. It's a zero-sized malloc somewhere inside the call to
    H5Dwrite(), possibly in the filter. I think this is close to
    resolution; just have to get tools on it.

    On Thu, Nov 9, 2017 at 8:47 AM, Michael K. Edwards > <m.k.edwards@gmail.com> wrote:
    > Apparently this has been reported before as a problem with PETSc/HDF5
    > integration: https://lists.mcs.anl.gov/pipermail/petsc-users/2012-January/011980.html
    >
    > On Thu, Nov 9, 2017 at 8:37 AM, Michael K. Edwards > > <m.k.edwards@gmail.com> wrote:
    >> Thank you for the validation, and for the suggestion to use
    >> H5Sselect_none(). That is probably the right thing for the dataspace.
    >> Not quite sure what to do about the memspace, though; the comment is
    >> correct that we crash if any of the dimensions is zero.
    >>
    >> On Thu, Nov 9, 2017 at 8:34 AM, Jordan Henderson > >> <jhenderson@hdfgroup.org> wrote:
    >>> It seems you're discovering the issues right as I'm typing this!
    >>>
    >>>
    >>> I'm glad you were able to solve the issue with the hanging. I was starting
    >>> to suspect an issue with the MPI implementation but it's usually the last
    >>> thing on the list after inspecting the code itself.
    >>>
    >>>
    >>> As you've seen, it seems that PETSc is creating a NULL dataspace for the
    >>> ranks which are not contributing, instead of creating a Scalar/Simple
    >>> dataspace on all ranks and calling H5Sselect_none() for those that don't
    >>> participate. This would most likely explain the reason you saw the assertion
    >>> failure in the non-filtered case, as the legacy code probably was not
    >>> expecting to receive a NULL dataspace. On top of that, the NULL dataspace
    >>> seems like it is causing the parallel operation to break collective mode,
    >>> which is not allowed when filters are involved. I would need to do some
    >>> research as to why this happens before deciding whether it's more
    >>> appropriate to modify this in HDF5 or to have PETSc not use NULL dataspaces.
    >>>
    >>>
    >>> Avoiding deadlock from the final sort has been an issue I had to re-tackle a
    >>> few different times due to the nature of the code's complexity, but I will
    >>> investigate using the chunk offset as a secondary sort key and see if it
    >>> will run into problems in any other cases. Ideally, the chunk redistribution
    >>> might be updated in the future to involve all ranks in the operation instead
    >>> of just rank 0, also allowing for improvements to the redistribution
    >>> algorithm that may solve these problems, but for the time being this may be
    >>> sufficient.

    _______________________________________________
    Hdf-forum is for HDF software users discussion.
    Hdf-forum@lists.hdfgroup.org
    http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
    Twitter: https://twitter.com/hdf5

Would it be better for that read-decompress-update-recompress-write
operation to skip zero-sized chunks? I imagine it's a bit tricky if
the lowest-indexed rank's contribution to the chunk is zero-sized; but
can that happen? Doesn't ownership move to the rank that has the
largest contribution to the chunk that's being written?

···

On Thu, Nov 9, 2017 at 10:26 AM, Jordan Henderson <jhenderson@hdfgroup.org> wrote:

Since Parallel Compression operates by applying the filter on a
per-chunk-basis, this should be consistent with what you're seeing. However,
zero-sized chunks is a case I had not actually considered yet, and I could
reasonably see blosc failing due to a zero-sized allocation.

Since reading in the parallel case with filters doesn't affect the metadata,
the H5D__construct_filtered_io_info_list() function will simply cause each
rank to construct a local list of all the chunks they have selected in the
read operation, read their respective chunks into locally-allocated buffers,
and decompress the data on a chunk-by-chunk basis, scattering it to the read
buffer along the way. Writing works the same way in that each rank works on
their own local list of chunks, with the exception that some of the chunks
may get shifted around before the actual write operation of "pull data from
the read buffer, decompress the chunk, update the chunk, re-compress the
chunk and write it" happens. In general, it shouldn't cause an issue that
you're reading the Dataset with a different number of MPI ranks than it was
written with.

I added a debug printf (I am currently running a test with 4 ranks on
the same host), and here is what I see. The "M of N" numbers reflect
the size of the memspace and filespace respectively. The printf is
inserted immediately before H5Dwrite() in my modified version of
ISView_General_HDF5() (in PETSc's
src/vec/is/is/impls/general/general.c).

About to write 148 of 636
About to write 176 of 636
About to write 163 of 636
About to write 149 of 636
About to write 176 of 636
About to write 148 of 636
About to write 149 of 636
About to write 163 of 636
About to write 310 of 1136
About to write 266 of 1136
About to write 258 of 1136
About to write 302 of 1136
About to write 310 of 1136
About to write 266 of 1136
About to write 258 of 1136
About to write 302 of 1136
About to write 124 of 520
About to write 120 of 520
About to write 140 of 520
About to write 136 of 520
About to write 23 of 80
About to write 19 of 80
About to write 14 of 80
About to write 24 of 80
About to write 12 of 20
About to write 0 of 20
About to write 0 of 20
About to write 8 of 20
HDF5-DIAG: Error detected in HDF5 (1.11.0) MPI-process 0:
  #000: H5Dio.c line 319 in H5Dwrite(): can't prepare for writing data
    major: Dataset
    minor: Write failed
  #001: H5Dio.c line 395 in H5D__pre_write(): can't write data
    major: Dataset
    minor: Write failed
  #002: H5Dio.c line 836 in H5D__write(): can't write data
    major: Dataset
    minor: Write failed
  #003: H5Dmpio.c line 1019 in H5D__chunk_collective_write(): write error
    major: Dataspace
    minor: Write failed
  #004: H5Dmpio.c line 934 in H5D__chunk_collective_io(): couldn't
finish filtered linked chunk MPI-IO
    major: Low-level I/O
    minor: Can't get value
  #005: H5Dmpio.c line 1474 in
H5D__link_chunk_filtered_collective_io(): couldn't process chunk entry
    major: Dataset
    minor: Write failed
  #006: H5Dmpio.c line 3277 in
H5D__filtered_collective_chunk_entry_io(): couldn't unfilter chunk for
modifying
    major: Data filters
    minor: Filter operation failed
  #007: H5Z.c line 1256 in H5Z_pipeline(): filter returned failure during read
    major: Data filters
    minor: Read failed

I'm trying to do this in the way you suggested, where non-contributing
ranks create a zero-sized memspace (with the appropriate dimensions)
and call H5Sselect_none() on the filespace, then call H5Dwrite() in
the usual way to participate in the collective write. Where in the
code would you expect the test that filters out zero-sized chunks to
be?

···

On Thu, Nov 9, 2017 at 11:39 AM, Jordan Henderson <jhenderson@hdfgroup.org> wrote:

By zero-sized chunks do you mean to say that the actual chunks in the
dataset are zero-sized or the data going to the write is zero-sized? It
would seem odd to me if you were writing to an essentially zero-sized
dataset composed of zero-sized chunks.

On the other hand, for ranks that aren't participating, they should never
construct a list of chunks in the H5D__construct_filtered_io_info_list()
function and thus should never participate in any chunk updating, only the
collective file space re-allocations and re-insertion of chunks into the
chunk index. That being said, if you are indeed seeing zero-sized malloc
calls in the chunk update function, something must be wrong somewhere. While
it is true that the chunks currently move to the rank with the largest
contribution to the chunk which ALSO has the least amount of chunks
currently assigned to it (to try and get a more even distribution of chunks
among all the ranks), any rank which has a zero-sized contribution to a
chunk should never have created a chunk struct entry for the chunk and thus
should not be participating in the chunk updating loop (lines 1471-1474 in
the current develop branch). They should pass that loop and wait at the
subsequent H5D__mpio_array_gatherv() until the other ranks get done
processing. Again, this is what should happen but in your case may not be
the actuality of the situation.