Collective IO and filters

I'm trying to write an HDF5 file with dataset compression from an MPI
job. (Using PETSc 3.8 compiled against MVAPICH2, if that matters.)
After running into the "Parallel I/O does not support filters yet"
error message in release versions of HDF5, I have turned to the
develop branch. Clearly there has been much work towards collective
filtered IO in the run-up to a 1.11 (1.12?) release; equally clearly
it is not quite ready for prime time yet. So far I've encountered a
livelock scenario with ZFP, reproduced it with SZIP, and, with no
filters at all, obtained this nifty error message:

ex12: H5Dchunk.c:1849: H5D__create_chunk_mem_map_hyper: Assertion
`fm->m_ndims==fm->f_ndims' failed.

Has anyone on this list been able to write parallel HDF5 using a
recent state of the develop branch, with or without filters
configured?

Thanks,
- Michael

Hi Michael,

I have not tried this in parallel yet. That said, what scale are you trying to do this at? 1000 ranks or 1,000,000 ranks? Something in between?

My understanding is that there are some known scaling issues out past maybe 10,000 ranks. Not heard of outright assertion failures there though.

Mark

"Hdf-forum on behalf of Michael K. Edwards" wrote:

I'm trying to write an HDF5 file with dataset compression from an MPI
job. (Using PETSc 3.8 compiled against MVAPICH2, if that matters.)
After running into the "Parallel I/O does not support filters yet"
error message in release versions of HDF5, I have turned to the
develop branch. Clearly there has been much work towards collective
filtered IO in the run-up to a 1.11 (1.12?) release; equally clearly
it is not quite ready for prime time yet. So far I've encountered a
livelock scenario with ZFP, reproduced it with SZIP, and, with no
filters at all, obtained this nifty error message:

ex12: H5Dchunk.c:1849: H5D__create_chunk_mem_map_hyper: Assertion
`fm->m_ndims==fm->f_ndims' failed.

Has anyone on this list been able to write parallel HDF5 using a
recent state of the develop branch, with or without filters
configured?

Thanks,
- Michael

···

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org<mailto:Hdf-forum@lists.hdfgroup.org>
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Closer to 1000 ranks initially. There's a bug in handling the case
where some of the writers don't have any data to contribute (because
there's a dimension smaller than the number of ranks), which I have
worked around like this:

diff --git a/src/H5Dchunk.c b/src/H5Dchunk.c
index af6599a..9522478 100644
--- a/src/H5Dchunk.c
+++ b/src/H5Dchunk.c
@@ -1836,6 +1836,9 @@ H5D__create_chunk_mem_map_hyper(const H5D_chunk_map_t *fm)
         /* Indicate that the chunk's memory space is shared */
         chunk_info->mspace_shared = TRUE;
     } /* end if */
+ else if(H5SL_count(fm->sel_chunks)==0) {
+ /* No chunks, because no local data; avoid
HDassert(fm->m_ndims==fm->f_ndims) on null mem_space */
+ } /* end else if */
     else {
         /* Get bounding box for file selection */
         if(H5S_SELECT_BOUNDS(fm->file_space, file_sel_start, file_sel_end) < 0)

That makes the assert go away. Now I'm investigating a hang in the
chunk redistribution logic in rank 0, with a backtrace that looks like
this:

#0 0x00007f4bd456a6c6 in psm2_mq_ipeek2 () from /lib64/libpsm2.so.2
#1 0x00007f4bd5d3b341 in psm_progress_wait () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#2 0x00007f4bd5d3012d in MPID_Mprobe () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#3 0x00007f4bd5cbeeb4 in PMPI_Mprobe () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#4 0x00007f4bd81aadf6 in H5D__chunk_redistribute_shared_chunks
(io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, fm=0x17eeec0,
local_chunk_array=0x17f0f80,
    local_chunk_array_num_entries=0x7ffdfb83d9f8) at H5Dmpio.c:3041
#5 0x00007f4bd81a9696 in H5D__construct_filtered_io_info_list
(io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, fm=0x17eeec0,
chunk_list=0x7ffdfb83daf0, num_entries=0x7ffdfb83db00)
    at H5Dmpio.c:2794
#6 0x00007f4bd81a2d58 in H5D__link_chunk_filtered_collective_io
(io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, fm=0x17eeec0,
dx_plist=0x16f7230) at H5Dmpio.c:1447
#7 0x00007f4bd81a027d in H5D__chunk_collective_io
(io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, fm=0x17eeec0) at
H5Dmpio.c:933
#8 0x00007f4bd81a0968 in H5D__chunk_collective_write
(io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, nelmts=104,
file_space=0x17e2dc0, mem_space=0x17dc770, fm=0x17eeec0) at
H5Dmpio.c:1018
#9 0x00007f4bd7ce3d63 in H5D__write (dataset=0x17e0010,
mem_type_id=216172782113783851, mem_space=0x17dc770,
file_space=0x17e2dc0, dxpl_id=720575940379279384, buf=0x17d6240) at
H5Dio.c:835
#10 0x00007f4bd7ce181c in H5D__pre_write (dset=0x17e0010,
direct_write=false, mem_type_id=216172782113783851,
mem_space=0x17dc770, file_space=0x17e2dc0, dxpl_id=720575940379279384,
buf=0x17d6240)
    at H5Dio.c:394
#11 0x00007f4bd7ce0fd1 in H5Dwrite (dset_id=360287970189639680,
mem_type_id=216172782113783851, mem_space_id=288230376151711749,
file_space_id=288230376151711750, dxpl_id=720575940379279384,
    buf=0x17d6240) at H5Dio.c:318

The other ranks have moved past this and are hanging here:

#0 0x00007feb6e6546c6 in psm2_mq_ipeek2 () from /lib64/libpsm2.so.2
#1 0x00007feb6fe25341 in psm_progress_wait () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#2 0x00007feb6fdd8975 in MPIC_Wait () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#3 0x00007feb6fdd918b in MPIC_Sendrecv () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#4 0x00007feb6fcf0fda in MPIR_Allreduce_pt2pt_rd_MV2 () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#5 0x00007feb6fcf48ef in MPIR_Allreduce_index_tuned_intra_MV2 () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#6 0x00007feb6fca1534 in MPIR_Allreduce_impl () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#7 0x00007feb6fca1b93 in PMPI_Allreduce () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#8 0x00007feb72287c2a in H5D__mpio_array_gatherv
(local_array=0x125f2d0, local_array_num_entries=0,
array_entry_size=368, _gathered_array=0x7ffff083f1d8,
    _gathered_array_num_entries=0x7ffff083f1e8, nprocs=4,
allgather=true, root=0, comm=-1006632952, sort_func=0x0) at
H5Dmpio.c:479
#9 0x00007feb7228cfb8 in H5D__link_chunk_filtered_collective_io
(io_info=0x7ffff083f540, type_info=0x7ffff083f4c0, fm=0x125d280,
dx_plist=0x11cf240) at H5Dmpio.c:1479
#10 0x00007feb7228a27d in H5D__chunk_collective_io
(io_info=0x7ffff083f540, type_info=0x7ffff083f4c0, fm=0x125d280) at
H5Dmpio.c:933
#11 0x00007feb7228a968 in H5D__chunk_collective_write
(io_info=0x7ffff083f540, type_info=0x7ffff083f4c0, nelmts=74,
file_space=0x12514e0, mem_space=0x124b450, fm=0x125d280) at
H5Dmpio.c:1018
#12 0x00007feb71dcdd63 in H5D__write (dataset=0x124e7d0,
mem_type_id=216172782113783851, mem_space=0x124b450,
file_space=0x12514e0, dxpl_id=720575940379279384, buf=0x1244e80) at
H5Dio.c:835
#13 0x00007feb71dcb81c in H5D__pre_write (dset=0x124e7d0,
direct_write=false, mem_type_id=216172782113783851,
mem_space=0x124b450, file_space=0x12514e0, dxpl_id=720575940379279384,
buf=0x1244e80)
    at H5Dio.c:394
#14 0x00007feb71dcafd1 in H5Dwrite (dset_id=360287970189639680,
mem_type_id=216172782113783851, mem_space_id=288230376151711749,
file_space_id=288230376151711750, dxpl_id=720575940379279384,
    buf=0x1244e80) at H5Dio.c:318

(I'm currently running with this patch atop commit bf570b1, on an
earlier theory that the crashing bug may have crept in after Jordan's
big merge. I'll rebase on current develop but I doubt that'll change
much.)

The hang may or may not be directly related to the workaround being a
bit of a hack. I can set you up with full reproduction details if you
like; I seem to be getting some traction on it, but more eyeballs are
always good, especially if they're better set up for MPI tracing than
I am right now.

···

On Wed, Nov 8, 2017 at 8:48 AM, Miller, Mark C. <miller86@llnl.gov> wrote:

Hi Michael,

I have not tried this in parallel yet. That said, what scale are you trying
to do this at? 1000 ranks or 1,000,000 ranks? Something in between?

My understanding is that there are some known scaling issues out past maybe
10,000 ranks. Not heard of outright assertion failures there though.

Mark

"Hdf-forum on behalf of Michael K. Edwards" wrote:

I'm trying to write an HDF5 file with dataset compression from an MPI

job. (Using PETSc 3.8 compiled against MVAPICH2, if that matters.)

After running into the "Parallel I/O does not support filters yet"

error message in release versions of HDF5, I have turned to the

develop branch. Clearly there has been much work towards collective

filtered IO in the run-up to a 1.11 (1.12?) release; equally clearly

it is not quite ready for prime time yet. So far I've encountered a

livelock scenario with ZFP, reproduced it with SZIP, and, with no

filters at all, obtained this nifty error message:

ex12: H5Dchunk.c:1849: H5D__create_chunk_mem_map_hyper: Assertion

`fm->m_ndims==fm->f_ndims' failed.

Has anyone on this list been able to write parallel HDF5 using a

recent state of the develop branch, with or without filters

configured?

Thanks,

- Michael

_______________________________________________

Hdf-forum is for HDF software users discussion.

Hdf-forum@lists.hdfgroup.org

http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

Twitter: https://twitter.com/hdf5

Hi Michael,

during the design phase of this feature I tried to both account for and test the case where some of the writers do not have any data to contribute. However, it seems like your use case falls outside of what I have tested (perhaps I have not used enough ranks?). In particular my test cases were small and simply had some of the ranks call H5Sselect_none(), which doesn't seem to trigger this particular assertion failure. Is this how you're approaching these particular ranks in your code or is there a different way you are having them participate in the write operation?

As for the hanging issue, it looks as though rank 0 is waiting to receive some modification data from another rank for a particular chunk. Whether or not there is actually valid data that rank 0 should be waiting for, I cannot easily tell without being able to trace it through. As the other ranks have finished modifying their particular sets of chunks, they have moved on and are waiting for everyone to get together and broadcast their new chunk sizes so that free space in the file can be collectively re-allocated, but of course rank 0 is not proceeding forward. My best guess is that either:

  * The "num_writers" field for the chunk struct corresponding to the particular chunk that rank 0 is working on has been incorrectly set, causing rank 0 to think that there are more ranks writing to the chunk than the actual amount and consequently causing rank 0 to wait forever for a non-existent MPI message

or

  * The "new_owner" field of the chunk struct for this chunk was incorrectly set on the other ranks, causing them to never issue an MPI_Isend to rank 0, also causing rank 0 to wait for a non-existent MPI message

This feature should still be regarded as being in beta and its complexity can lead to difficult to track down bugs such as the ones you are currently encountering. That being said, your feedback is very useful and will help to push this feature towards a production-ready level of quality. Also, if it is feasible to come up with a minimal example that reproduces this issue, it would be very helpful and would make it much easier to diagnose why exactly these failures are occurring.

Thanks,
Jordan

···

________________________________
From: Hdf-forum <hdf-forum-bounces@lists.hdfgroup.org> on behalf of Michael K. Edwards <m.k.edwards@gmail.com>
Sent: Wednesday, November 8, 2017 11:23 AM
To: Miller, Mark C.
Cc: HDF Users Discussion List
Subject: Re: [Hdf-forum] Collective IO and filters

Closer to 1000 ranks initially. There's a bug in handling the case
where some of the writers don't have any data to contribute (because
there's a dimension smaller than the number of ranks), which I have
worked around like this:

diff --git a/src/H5Dchunk.c b/src/H5Dchunk.c
index af6599a..9522478 100644
--- a/src/H5Dchunk.c
+++ b/src/H5Dchunk.c
@@ -1836,6 +1836,9 @@ H5D__create_chunk_mem_map_hyper(const H5D_chunk_map_t *fm)
         /* Indicate that the chunk's memory space is shared */
         chunk_info->mspace_shared = TRUE;
     } /* end if */
+ else if(H5SL_count(fm->sel_chunks)==0) {
+ /* No chunks, because no local data; avoid
HDassert(fm->m_ndims==fm->f_ndims) on null mem_space */
+ } /* end else if */
     else {
         /* Get bounding box for file selection */
         if(H5S_SELECT_BOUNDS(fm->file_space, file_sel_start, file_sel_end) < 0)

That makes the assert go away. Now I'm investigating a hang in the
chunk redistribution logic in rank 0, with a backtrace that looks like
this:

#0 0x00007f4bd456a6c6 in psm2_mq_ipeek2 () from /lib64/libpsm2.so.2
#1 0x00007f4bd5d3b341 in psm_progress_wait () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#2 0x00007f4bd5d3012d in MPID_Mprobe () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#3 0x00007f4bd5cbeeb4 in PMPI_Mprobe () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#4 0x00007f4bd81aadf6 in H5D__chunk_redistribute_shared_chunks
(io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, fm=0x17eeec0,
local_chunk_array=0x17f0f80,
    local_chunk_array_num_entries=0x7ffdfb83d9f8) at H5Dmpio.c:3041
#5 0x00007f4bd81a9696 in H5D__construct_filtered_io_info_list
(io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, fm=0x17eeec0,
chunk_list=0x7ffdfb83daf0, num_entries=0x7ffdfb83db00)
    at H5Dmpio.c:2794
#6 0x00007f4bd81a2d58 in H5D__link_chunk_filtered_collective_io
(io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, fm=0x17eeec0,
dx_plist=0x16f7230) at H5Dmpio.c:1447
#7 0x00007f4bd81a027d in H5D__chunk_collective_io
(io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, fm=0x17eeec0) at
H5Dmpio.c:933
#8 0x00007f4bd81a0968 in H5D__chunk_collective_write
(io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, nelmts=104,
file_space=0x17e2dc0, mem_space=0x17dc770, fm=0x17eeec0) at
H5Dmpio.c:1018
#9 0x00007f4bd7ce3d63 in H5D__write (dataset=0x17e0010,
mem_type_id=216172782113783851, mem_space=0x17dc770,
file_space=0x17e2dc0, dxpl_id=720575940379279384, buf=0x17d6240) at
H5Dio.c:835
#10 0x00007f4bd7ce181c in H5D__pre_write (dset=0x17e0010,
direct_write=false, mem_type_id=216172782113783851,
mem_space=0x17dc770, file_space=0x17e2dc0, dxpl_id=720575940379279384,
buf=0x17d6240)
    at H5Dio.c:394
#11 0x00007f4bd7ce0fd1 in H5Dwrite (dset_id=360287970189639680,
mem_type_id=216172782113783851, mem_space_id=288230376151711749,
file_space_id=288230376151711750, dxpl_id=720575940379279384,
    buf=0x17d6240) at H5Dio.c:318

The other ranks have moved past this and are hanging here:

#0 0x00007feb6e6546c6 in psm2_mq_ipeek2 () from /lib64/libpsm2.so.2
#1 0x00007feb6fe25341 in psm_progress_wait () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#2 0x00007feb6fdd8975 in MPIC_Wait () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#3 0x00007feb6fdd918b in MPIC_Sendrecv () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#4 0x00007feb6fcf0fda in MPIR_Allreduce_pt2pt_rd_MV2 () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#5 0x00007feb6fcf48ef in MPIR_Allreduce_index_tuned_intra_MV2 () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#6 0x00007feb6fca1534 in MPIR_Allreduce_impl () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#7 0x00007feb6fca1b93 in PMPI_Allreduce () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#8 0x00007feb72287c2a in H5D__mpio_array_gatherv
(local_array=0x125f2d0, local_array_num_entries=0,
array_entry_size=368, _gathered_array=0x7ffff083f1d8,
    _gathered_array_num_entries=0x7ffff083f1e8, nprocs=4,
allgather=true, root=0, comm=-1006632952, sort_func=0x0) at
H5Dmpio.c:479
#9 0x00007feb7228cfb8 in H5D__link_chunk_filtered_collective_io
(io_info=0x7ffff083f540, type_info=0x7ffff083f4c0, fm=0x125d280,
dx_plist=0x11cf240) at H5Dmpio.c:1479
#10 0x00007feb7228a27d in H5D__chunk_collective_io
(io_info=0x7ffff083f540, type_info=0x7ffff083f4c0, fm=0x125d280) at
H5Dmpio.c:933
#11 0x00007feb7228a968 in H5D__chunk_collective_write
(io_info=0x7ffff083f540, type_info=0x7ffff083f4c0, nelmts=74,
file_space=0x12514e0, mem_space=0x124b450, fm=0x125d280) at
H5Dmpio.c:1018
#12 0x00007feb71dcdd63 in H5D__write (dataset=0x124e7d0,
mem_type_id=216172782113783851, mem_space=0x124b450,
file_space=0x12514e0, dxpl_id=720575940379279384, buf=0x1244e80) at
H5Dio.c:835
#13 0x00007feb71dcb81c in H5D__pre_write (dset=0x124e7d0,
direct_write=false, mem_type_id=216172782113783851,
mem_space=0x124b450, file_space=0x12514e0, dxpl_id=720575940379279384,
buf=0x1244e80)
    at H5Dio.c:394
#14 0x00007feb71dcafd1 in H5Dwrite (dset_id=360287970189639680,
mem_type_id=216172782113783851, mem_space_id=288230376151711749,
file_space_id=288230376151711750, dxpl_id=720575940379279384,
    buf=0x1244e80) at H5Dio.c:318

(I'm currently running with this patch atop commit bf570b1, on an
earlier theory that the crashing bug may have crept in after Jordan's
big merge. I'll rebase on current develop but I doubt that'll change
much.)

The hang may or may not be directly related to the workaround being a
bit of a hack. I can set you up with full reproduction details if you
like; I seem to be getting some traction on it, but more eyeballs are
always good, especially if they're better set up for MPI tracing than
I am right now.

On Wed, Nov 8, 2017 at 8:48 AM, Miller, Mark C. <miller86@llnl.gov> wrote:

Hi Michael,

I have not tried this in parallel yet. That said, what scale are you trying
to do this at? 1000 ranks or 1,000,000 ranks? Something in between?

My understanding is that there are some known scaling issues out past maybe
10,000 ranks. Not heard of outright assertion failures there though.

Mark

"Hdf-forum on behalf of Michael K. Edwards" wrote:

I'm trying to write an HDF5 file with dataset compression from an MPI

job. (Using PETSc 3.8 compiled against MVAPICH2, if that matters.)

After running into the "Parallel I/O does not support filters yet"

error message in release versions of HDF5, I have turned to the

develop branch. Clearly there has been much work towards collective

filtered IO in the run-up to a 1.11 (1.12?) release; equally clearly

it is not quite ready for prime time yet. So far I've encountered a

livelock scenario with ZFP, reproduced it with SZIP, and, with no

filters at all, obtained this nifty error message:

ex12: H5Dchunk.c:1849: H5D__create_chunk_mem_map_hyper: Assertion

`fm->m_ndims==fm->f_ndims' failed.

Has anyone on this list been able to write parallel HDF5 using a

recent state of the develop branch, with or without filters

configured?

Thanks,

- Michael

_______________________________________________

Hdf-forum is for HDF software users discussion.

Hdf-forum@lists.hdfgroup.org

http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

Twitter: https://twitter.com/hdf5

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
The HDF Group (@hdf5) | Twitter<https://twitter.com/hdf5&gt;
twitter.com
The latest Tweets from The HDF Group (@hdf5). Technologies and supporting services that make possible the management of large, complex data collections. Support ...

Thanks, Jordan. I recognize that this is very recent feature work and
my goal is to help push it forward.

My current use case is relatively straightforward, though there are a
couple of layers on top of HDF5 itself. The problem can be reproduced
by building PETSc 3.8.1 against libraries built from the develop
branch of HDF5, adding in the H5Dset_filter() calls, and running an
example that exercises them. (I'm using
src/snes/examples/tutorials/ex12.c with the -dm_view_hierarchy flag to
induce HDF5 writes.) If you want, I can supply full details for you
to reproduce it locally, or I can do any experiments you'd like me to
within this setup. (It also involves patches to the out-of-tree H5Z
plugins to make them use H5MM_malloc/H5MM_xfree rather than raw
malloc/free, which in turn involves exposing H5MMprivate.h to the
plugins. Is this something you've solved in a different way?)

···

On Wed, Nov 8, 2017 at 11:44 AM, Jordan Henderson <jhenderson@hdfgroup.org> wrote:

Hi Michael,

during the design phase of this feature I tried to both account for and test
the case where some of the writers do not have any data to contribute.
However, it seems like your use case falls outside of what I have tested
(perhaps I have not used enough ranks?). In particular my test cases were
small and simply had some of the ranks call H5Sselect_none(), which doesn't
seem to trigger this particular assertion failure. Is this how you're
approaching these particular ranks in your code or is there a different way
you are having them participate in the write operation?

As for the hanging issue, it looks as though rank 0 is waiting to receive
some modification data from another rank for a particular chunk. Whether or
not there is actually valid data that rank 0 should be waiting for, I cannot
easily tell without being able to trace it through. As the other ranks have
finished modifying their particular sets of chunks, they have moved on and
are waiting for everyone to get together and broadcast their new chunk sizes
so that free space in the file can be collectively re-allocated, but of
course rank 0 is not proceeding forward. My best guess is that either:

The "num_writers" field for the chunk struct corresponding to the particular
chunk that rank 0 is working on has been incorrectly set, causing rank 0 to
think that there are more ranks writing to the chunk than the actual amount
and consequently causing rank 0 to wait forever for a non-existent MPI
message

or

The "new_owner" field of the chunk struct for this chunk was incorrectly set
on the other ranks, causing them to never issue an MPI_Isend to rank 0, also
causing rank 0 to wait for a non-existent MPI message

This feature should still be regarded as being in beta and its complexity
can lead to difficult to track down bugs such as the ones you are currently
encountering. That being said, your feedback is very useful and will help to
push this feature towards a production-ready level of quality. Also, if it
is feasible to come up with a minimal example that reproduces this issue, it
would be very helpful and would make it much easier to diagnose why exactly
these failures are occurring.

Thanks,
Jordan

________________________________
From: Hdf-forum <hdf-forum-bounces@lists.hdfgroup.org> on behalf of Michael
K. Edwards <m.k.edwards@gmail.com>
Sent: Wednesday, November 8, 2017 11:23 AM
To: Miller, Mark C.
Cc: HDF Users Discussion List
Subject: Re: [Hdf-forum] Collective IO and filters

Closer to 1000 ranks initially. There's a bug in handling the case
where some of the writers don't have any data to contribute (because
there's a dimension smaller than the number of ranks), which I have
worked around like this:

diff --git a/src/H5Dchunk.c b/src/H5Dchunk.c
index af6599a..9522478 100644
--- a/src/H5Dchunk.c
+++ b/src/H5Dchunk.c
@@ -1836,6 +1836,9 @@ H5D__create_chunk_mem_map_hyper(const H5D_chunk_map_t
*fm)
         /* Indicate that the chunk's memory space is shared */
         chunk_info->mspace_shared = TRUE;
     } /* end if */
+ else if(H5SL_count(fm->sel_chunks)==0) {
+ /* No chunks, because no local data; avoid
HDassert(fm->m_ndims==fm->f_ndims) on null mem_space */
+ } /* end else if */
     else {
         /* Get bounding box for file selection */
         if(H5S_SELECT_BOUNDS(fm->file_space, file_sel_start, file_sel_end)
< 0)

That makes the assert go away. Now I'm investigating a hang in the
chunk redistribution logic in rank 0, with a backtrace that looks like
this:

#0 0x00007f4bd456a6c6 in psm2_mq_ipeek2 () from /lib64/libpsm2.so.2
#1 0x00007f4bd5d3b341 in psm_progress_wait () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#2 0x00007f4bd5d3012d in MPID_Mprobe () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#3 0x00007f4bd5cbeeb4 in PMPI_Mprobe () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#4 0x00007f4bd81aadf6 in H5D__chunk_redistribute_shared_chunks
(io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, fm=0x17eeec0,
local_chunk_array=0x17f0f80,
    local_chunk_array_num_entries=0x7ffdfb83d9f8) at H5Dmpio.c:3041
#5 0x00007f4bd81a9696 in H5D__construct_filtered_io_info_list
(io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, fm=0x17eeec0,
chunk_list=0x7ffdfb83daf0, num_entries=0x7ffdfb83db00)
    at H5Dmpio.c:2794
#6 0x00007f4bd81a2d58 in H5D__link_chunk_filtered_collective_io
(io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, fm=0x17eeec0,
dx_plist=0x16f7230) at H5Dmpio.c:1447
#7 0x00007f4bd81a027d in H5D__chunk_collective_io
(io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, fm=0x17eeec0) at
H5Dmpio.c:933
#8 0x00007f4bd81a0968 in H5D__chunk_collective_write
(io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, nelmts=104,
file_space=0x17e2dc0, mem_space=0x17dc770, fm=0x17eeec0) at
H5Dmpio.c:1018
#9 0x00007f4bd7ce3d63 in H5D__write (dataset=0x17e0010,
mem_type_id=216172782113783851, mem_space=0x17dc770,
file_space=0x17e2dc0, dxpl_id=720575940379279384, buf=0x17d6240) at
H5Dio.c:835
#10 0x00007f4bd7ce181c in H5D__pre_write (dset=0x17e0010,
direct_write=false, mem_type_id=216172782113783851,
mem_space=0x17dc770, file_space=0x17e2dc0, dxpl_id=720575940379279384,
buf=0x17d6240)
    at H5Dio.c:394
#11 0x00007f4bd7ce0fd1 in H5Dwrite (dset_id=360287970189639680,
mem_type_id=216172782113783851, mem_space_id=288230376151711749,
file_space_id=288230376151711750, dxpl_id=720575940379279384,
    buf=0x17d6240) at H5Dio.c:318

The other ranks have moved past this and are hanging here:

#0 0x00007feb6e6546c6 in psm2_mq_ipeek2 () from /lib64/libpsm2.so.2
#1 0x00007feb6fe25341 in psm_progress_wait () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#2 0x00007feb6fdd8975 in MPIC_Wait () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#3 0x00007feb6fdd918b in MPIC_Sendrecv () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#4 0x00007feb6fcf0fda in MPIR_Allreduce_pt2pt_rd_MV2 () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#5 0x00007feb6fcf48ef in MPIR_Allreduce_index_tuned_intra_MV2 () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#6 0x00007feb6fca1534 in MPIR_Allreduce_impl () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#7 0x00007feb6fca1b93 in PMPI_Allreduce () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#8 0x00007feb72287c2a in H5D__mpio_array_gatherv
(local_array=0x125f2d0, local_array_num_entries=0,
array_entry_size=368, _gathered_array=0x7ffff083f1d8,
    _gathered_array_num_entries=0x7ffff083f1e8, nprocs=4,
allgather=true, root=0, comm=-1006632952, sort_func=0x0) at
H5Dmpio.c:479
#9 0x00007feb7228cfb8 in H5D__link_chunk_filtered_collective_io
(io_info=0x7ffff083f540, type_info=0x7ffff083f4c0, fm=0x125d280,
dx_plist=0x11cf240) at H5Dmpio.c:1479
#10 0x00007feb7228a27d in H5D__chunk_collective_io
(io_info=0x7ffff083f540, type_info=0x7ffff083f4c0, fm=0x125d280) at
H5Dmpio.c:933
#11 0x00007feb7228a968 in H5D__chunk_collective_write
(io_info=0x7ffff083f540, type_info=0x7ffff083f4c0, nelmts=74,
file_space=0x12514e0, mem_space=0x124b450, fm=0x125d280) at
H5Dmpio.c:1018
#12 0x00007feb71dcdd63 in H5D__write (dataset=0x124e7d0,
mem_type_id=216172782113783851, mem_space=0x124b450,
file_space=0x12514e0, dxpl_id=720575940379279384, buf=0x1244e80) at
H5Dio.c:835
#13 0x00007feb71dcb81c in H5D__pre_write (dset=0x124e7d0,
direct_write=false, mem_type_id=216172782113783851,
mem_space=0x124b450, file_space=0x12514e0, dxpl_id=720575940379279384,
buf=0x1244e80)
    at H5Dio.c:394
#14 0x00007feb71dcafd1 in H5Dwrite (dset_id=360287970189639680,
mem_type_id=216172782113783851, mem_space_id=288230376151711749,
file_space_id=288230376151711750, dxpl_id=720575940379279384,
    buf=0x1244e80) at H5Dio.c:318

(I'm currently running with this patch atop commit bf570b1, on an
earlier theory that the crashing bug may have crept in after Jordan's
big merge. I'll rebase on current develop but I doubt that'll change
much.)

The hang may or may not be directly related to the workaround being a
bit of a hack. I can set you up with full reproduction details if you
like; I seem to be getting some traction on it, but more eyeballs are
always good, especially if they're better set up for MPI tracing than
I am right now.

On Wed, Nov 8, 2017 at 8:48 AM, Miller, Mark C. <miller86@llnl.gov> wrote:

Hi Michael,

I have not tried this in parallel yet. That said, what scale are you
trying
to do this at? 1000 ranks or 1,000,000 ranks? Something in between?

My understanding is that there are some known scaling issues out past
maybe
10,000 ranks. Not heard of outright assertion failures there though.

Mark

"Hdf-forum on behalf of Michael K. Edwards" wrote:

I'm trying to write an HDF5 file with dataset compression from an MPI

job. (Using PETSc 3.8 compiled against MVAPICH2, if that matters.)

After running into the "Parallel I/O does not support filters yet"

error message in release versions of HDF5, I have turned to the

develop branch. Clearly there has been much work towards collective

filtered IO in the run-up to a 1.11 (1.12?) release; equally clearly

it is not quite ready for prime time yet. So far I've encountered a

livelock scenario with ZFP, reproduced it with SZIP, and, with no

filters at all, obtained this nifty error message:

ex12: H5Dchunk.c:1849: H5D__create_chunk_mem_map_hyper: Assertion

`fm->m_ndims==fm->f_ndims' failed.

Has anyone on this list been able to write parallel HDF5 using a

recent state of the develop branch, with or without filters

configured?

Thanks,

- Michael

_______________________________________________

Hdf-forum is for HDF software users discussion.

Hdf-forum@lists.hdfgroup.org

http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

Twitter: https://twitter.com/hdf5

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
The HDF Group (@hdf5) | Twitter
twitter.com
The latest Tweets from The HDF Group (@hdf5). Technologies and supporting
services that make possible the management of large, complex data
collections. Support ...

It's not even clear to me yet whether this is the same dataset that
triggered the assert. Working on getting complete details. But FWIW
the PETSc code does not call H5Sselect_none(). It calls
H5Sselect_hyperslab() in all ranks, and that's why the ranks in which
the slice is zero columns wide hit the "empty sel_chunks" pathway I
added to H5D__create_chunk_mem_map_hyper().

···

On Wed, Nov 8, 2017 at 12:02 PM, Michael K. Edwards <m.k.edwards@gmail.com> wrote:

Thanks, Jordan. I recognize that this is very recent feature work and
my goal is to help push it forward.

My current use case is relatively straightforward, though there are a
couple of layers on top of HDF5 itself. The problem can be reproduced
by building PETSc 3.8.1 against libraries built from the develop
branch of HDF5, adding in the H5Dset_filter() calls, and running an
example that exercises them. (I'm using
src/snes/examples/tutorials/ex12.c with the -dm_view_hierarchy flag to
induce HDF5 writes.) If you want, I can supply full details for you
to reproduce it locally, or I can do any experiments you'd like me to
within this setup. (It also involves patches to the out-of-tree H5Z
plugins to make them use H5MM_malloc/H5MM_xfree rather than raw
malloc/free, which in turn involves exposing H5MMprivate.h to the
plugins. Is this something you've solved in a different way?)

On Wed, Nov 8, 2017 at 11:44 AM, Jordan Henderson > <jhenderson@hdfgroup.org> wrote:

Hi Michael,

during the design phase of this feature I tried to both account for and test
the case where some of the writers do not have any data to contribute.
However, it seems like your use case falls outside of what I have tested
(perhaps I have not used enough ranks?). In particular my test cases were
small and simply had some of the ranks call H5Sselect_none(), which doesn't
seem to trigger this particular assertion failure. Is this how you're
approaching these particular ranks in your code or is there a different way
you are having them participate in the write operation?

As for the hanging issue, it looks as though rank 0 is waiting to receive
some modification data from another rank for a particular chunk. Whether or
not there is actually valid data that rank 0 should be waiting for, I cannot
easily tell without being able to trace it through. As the other ranks have
finished modifying their particular sets of chunks, they have moved on and
are waiting for everyone to get together and broadcast their new chunk sizes
so that free space in the file can be collectively re-allocated, but of
course rank 0 is not proceeding forward. My best guess is that either:

The "num_writers" field for the chunk struct corresponding to the particular
chunk that rank 0 is working on has been incorrectly set, causing rank 0 to
think that there are more ranks writing to the chunk than the actual amount
and consequently causing rank 0 to wait forever for a non-existent MPI
message

or

The "new_owner" field of the chunk struct for this chunk was incorrectly set
on the other ranks, causing them to never issue an MPI_Isend to rank 0, also
causing rank 0 to wait for a non-existent MPI message

This feature should still be regarded as being in beta and its complexity
can lead to difficult to track down bugs such as the ones you are currently
encountering. That being said, your feedback is very useful and will help to
push this feature towards a production-ready level of quality. Also, if it
is feasible to come up with a minimal example that reproduces this issue, it
would be very helpful and would make it much easier to diagnose why exactly
these failures are occurring.

Thanks,
Jordan

________________________________
From: Hdf-forum <hdf-forum-bounces@lists.hdfgroup.org> on behalf of Michael
K. Edwards <m.k.edwards@gmail.com>
Sent: Wednesday, November 8, 2017 11:23 AM
To: Miller, Mark C.
Cc: HDF Users Discussion List
Subject: Re: [Hdf-forum] Collective IO and filters

Closer to 1000 ranks initially. There's a bug in handling the case
where some of the writers don't have any data to contribute (because
there's a dimension smaller than the number of ranks), which I have
worked around like this:

diff --git a/src/H5Dchunk.c b/src/H5Dchunk.c
index af6599a..9522478 100644
--- a/src/H5Dchunk.c
+++ b/src/H5Dchunk.c
@@ -1836,6 +1836,9 @@ H5D__create_chunk_mem_map_hyper(const H5D_chunk_map_t
*fm)
         /* Indicate that the chunk's memory space is shared */
         chunk_info->mspace_shared = TRUE;
     } /* end if */
+ else if(H5SL_count(fm->sel_chunks)==0) {
+ /* No chunks, because no local data; avoid
HDassert(fm->m_ndims==fm->f_ndims) on null mem_space */
+ } /* end else if */
     else {
         /* Get bounding box for file selection */
         if(H5S_SELECT_BOUNDS(fm->file_space, file_sel_start, file_sel_end)
< 0)

That makes the assert go away. Now I'm investigating a hang in the
chunk redistribution logic in rank 0, with a backtrace that looks like
this:

#0 0x00007f4bd456a6c6 in psm2_mq_ipeek2 () from /lib64/libpsm2.so.2
#1 0x00007f4bd5d3b341 in psm_progress_wait () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#2 0x00007f4bd5d3012d in MPID_Mprobe () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#3 0x00007f4bd5cbeeb4 in PMPI_Mprobe () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#4 0x00007f4bd81aadf6 in H5D__chunk_redistribute_shared_chunks
(io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, fm=0x17eeec0,
local_chunk_array=0x17f0f80,
    local_chunk_array_num_entries=0x7ffdfb83d9f8) at H5Dmpio.c:3041
#5 0x00007f4bd81a9696 in H5D__construct_filtered_io_info_list
(io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, fm=0x17eeec0,
chunk_list=0x7ffdfb83daf0, num_entries=0x7ffdfb83db00)
    at H5Dmpio.c:2794
#6 0x00007f4bd81a2d58 in H5D__link_chunk_filtered_collective_io
(io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, fm=0x17eeec0,
dx_plist=0x16f7230) at H5Dmpio.c:1447
#7 0x00007f4bd81a027d in H5D__chunk_collective_io
(io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, fm=0x17eeec0) at
H5Dmpio.c:933
#8 0x00007f4bd81a0968 in H5D__chunk_collective_write
(io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, nelmts=104,
file_space=0x17e2dc0, mem_space=0x17dc770, fm=0x17eeec0) at
H5Dmpio.c:1018
#9 0x00007f4bd7ce3d63 in H5D__write (dataset=0x17e0010,
mem_type_id=216172782113783851, mem_space=0x17dc770,
file_space=0x17e2dc0, dxpl_id=720575940379279384, buf=0x17d6240) at
H5Dio.c:835
#10 0x00007f4bd7ce181c in H5D__pre_write (dset=0x17e0010,
direct_write=false, mem_type_id=216172782113783851,
mem_space=0x17dc770, file_space=0x17e2dc0, dxpl_id=720575940379279384,
buf=0x17d6240)
    at H5Dio.c:394
#11 0x00007f4bd7ce0fd1 in H5Dwrite (dset_id=360287970189639680,
mem_type_id=216172782113783851, mem_space_id=288230376151711749,
file_space_id=288230376151711750, dxpl_id=720575940379279384,
    buf=0x17d6240) at H5Dio.c:318

The other ranks have moved past this and are hanging here:

#0 0x00007feb6e6546c6 in psm2_mq_ipeek2 () from /lib64/libpsm2.so.2
#1 0x00007feb6fe25341 in psm_progress_wait () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#2 0x00007feb6fdd8975 in MPIC_Wait () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#3 0x00007feb6fdd918b in MPIC_Sendrecv () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#4 0x00007feb6fcf0fda in MPIR_Allreduce_pt2pt_rd_MV2 () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#5 0x00007feb6fcf48ef in MPIR_Allreduce_index_tuned_intra_MV2 () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#6 0x00007feb6fca1534 in MPIR_Allreduce_impl () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#7 0x00007feb6fca1b93 in PMPI_Allreduce () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#8 0x00007feb72287c2a in H5D__mpio_array_gatherv
(local_array=0x125f2d0, local_array_num_entries=0,
array_entry_size=368, _gathered_array=0x7ffff083f1d8,
    _gathered_array_num_entries=0x7ffff083f1e8, nprocs=4,
allgather=true, root=0, comm=-1006632952, sort_func=0x0) at
H5Dmpio.c:479
#9 0x00007feb7228cfb8 in H5D__link_chunk_filtered_collective_io
(io_info=0x7ffff083f540, type_info=0x7ffff083f4c0, fm=0x125d280,
dx_plist=0x11cf240) at H5Dmpio.c:1479
#10 0x00007feb7228a27d in H5D__chunk_collective_io
(io_info=0x7ffff083f540, type_info=0x7ffff083f4c0, fm=0x125d280) at
H5Dmpio.c:933
#11 0x00007feb7228a968 in H5D__chunk_collective_write
(io_info=0x7ffff083f540, type_info=0x7ffff083f4c0, nelmts=74,
file_space=0x12514e0, mem_space=0x124b450, fm=0x125d280) at
H5Dmpio.c:1018
#12 0x00007feb71dcdd63 in H5D__write (dataset=0x124e7d0,
mem_type_id=216172782113783851, mem_space=0x124b450,
file_space=0x12514e0, dxpl_id=720575940379279384, buf=0x1244e80) at
H5Dio.c:835
#13 0x00007feb71dcb81c in H5D__pre_write (dset=0x124e7d0,
direct_write=false, mem_type_id=216172782113783851,
mem_space=0x124b450, file_space=0x12514e0, dxpl_id=720575940379279384,
buf=0x1244e80)
    at H5Dio.c:394
#14 0x00007feb71dcafd1 in H5Dwrite (dset_id=360287970189639680,
mem_type_id=216172782113783851, mem_space_id=288230376151711749,
file_space_id=288230376151711750, dxpl_id=720575940379279384,
    buf=0x1244e80) at H5Dio.c:318

(I'm currently running with this patch atop commit bf570b1, on an
earlier theory that the crashing bug may have crept in after Jordan's
big merge. I'll rebase on current develop but I doubt that'll change
much.)

The hang may or may not be directly related to the workaround being a
bit of a hack. I can set you up with full reproduction details if you
like; I seem to be getting some traction on it, but more eyeballs are
always good, especially if they're better set up for MPI tracing than
I am right now.

On Wed, Nov 8, 2017 at 8:48 AM, Miller, Mark C. <miller86@llnl.gov> wrote:

Hi Michael,

I have not tried this in parallel yet. That said, what scale are you
trying
to do this at? 1000 ranks or 1,000,000 ranks? Something in between?

My understanding is that there are some known scaling issues out past
maybe
10,000 ranks. Not heard of outright assertion failures there though.

Mark

"Hdf-forum on behalf of Michael K. Edwards" wrote:

I'm trying to write an HDF5 file with dataset compression from an MPI

job. (Using PETSc 3.8 compiled against MVAPICH2, if that matters.)

After running into the "Parallel I/O does not support filters yet"

error message in release versions of HDF5, I have turned to the

develop branch. Clearly there has been much work towards collective

filtered IO in the run-up to a 1.11 (1.12?) release; equally clearly

it is not quite ready for prime time yet. So far I've encountered a

livelock scenario with ZFP, reproduced it with SZIP, and, with no

filters at all, obtained this nifty error message:

ex12: H5Dchunk.c:1849: H5D__create_chunk_mem_map_hyper: Assertion

`fm->m_ndims==fm->f_ndims' failed.

Has anyone on this list been able to write parallel HDF5 using a

recent state of the develop branch, with or without filters

configured?

Thanks,

- Michael

_______________________________________________

Hdf-forum is for HDF software users discussion.

Hdf-forum@lists.hdfgroup.org

http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

Twitter: https://twitter.com/hdf5

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
The HDF Group (@hdf5) | Twitter
twitter.com
The latest Tweets from The HDF Group (@hdf5). Technologies and supporting
services that make possible the management of large, complex data
collections. Support ...

Also, I should add that the HDF5 files appear to be written properly
when run under "mpiexec -n 1", and valgrind doesn't report any bogus
malloc/free calls or wild pointers. So I don't think it's a problem
with how I've massaged the H5Z plugins or the PETSc code.

···

On Wed, Nov 8, 2017 at 12:22 PM, Michael K. Edwards <m.k.edwards@gmail.com> wrote:

It's not even clear to me yet whether this is the same dataset that
triggered the assert. Working on getting complete details. But FWIW
the PETSc code does not call H5Sselect_none(). It calls
H5Sselect_hyperslab() in all ranks, and that's why the ranks in which
the slice is zero columns wide hit the "empty sel_chunks" pathway I
added to H5D__create_chunk_mem_map_hyper().

On Wed, Nov 8, 2017 at 12:02 PM, Michael K. Edwards > <m.k.edwards@gmail.com> wrote:

Thanks, Jordan. I recognize that this is very recent feature work and
my goal is to help push it forward.

My current use case is relatively straightforward, though there are a
couple of layers on top of HDF5 itself. The problem can be reproduced
by building PETSc 3.8.1 against libraries built from the develop
branch of HDF5, adding in the H5Dset_filter() calls, and running an
example that exercises them. (I'm using
src/snes/examples/tutorials/ex12.c with the -dm_view_hierarchy flag to
induce HDF5 writes.) If you want, I can supply full details for you
to reproduce it locally, or I can do any experiments you'd like me to
within this setup. (It also involves patches to the out-of-tree H5Z
plugins to make them use H5MM_malloc/H5MM_xfree rather than raw
malloc/free, which in turn involves exposing H5MMprivate.h to the
plugins. Is this something you've solved in a different way?)

On Wed, Nov 8, 2017 at 11:44 AM, Jordan Henderson >> <jhenderson@hdfgroup.org> wrote:

Hi Michael,

during the design phase of this feature I tried to both account for and test
the case where some of the writers do not have any data to contribute.
However, it seems like your use case falls outside of what I have tested
(perhaps I have not used enough ranks?). In particular my test cases were
small and simply had some of the ranks call H5Sselect_none(), which doesn't
seem to trigger this particular assertion failure. Is this how you're
approaching these particular ranks in your code or is there a different way
you are having them participate in the write operation?

As for the hanging issue, it looks as though rank 0 is waiting to receive
some modification data from another rank for a particular chunk. Whether or
not there is actually valid data that rank 0 should be waiting for, I cannot
easily tell without being able to trace it through. As the other ranks have
finished modifying their particular sets of chunks, they have moved on and
are waiting for everyone to get together and broadcast their new chunk sizes
so that free space in the file can be collectively re-allocated, but of
course rank 0 is not proceeding forward. My best guess is that either:

The "num_writers" field for the chunk struct corresponding to the particular
chunk that rank 0 is working on has been incorrectly set, causing rank 0 to
think that there are more ranks writing to the chunk than the actual amount
and consequently causing rank 0 to wait forever for a non-existent MPI
message

or

The "new_owner" field of the chunk struct for this chunk was incorrectly set
on the other ranks, causing them to never issue an MPI_Isend to rank 0, also
causing rank 0 to wait for a non-existent MPI message

This feature should still be regarded as being in beta and its complexity
can lead to difficult to track down bugs such as the ones you are currently
encountering. That being said, your feedback is very useful and will help to
push this feature towards a production-ready level of quality. Also, if it
is feasible to come up with a minimal example that reproduces this issue, it
would be very helpful and would make it much easier to diagnose why exactly
these failures are occurring.

Thanks,
Jordan

________________________________
From: Hdf-forum <hdf-forum-bounces@lists.hdfgroup.org> on behalf of Michael
K. Edwards <m.k.edwards@gmail.com>
Sent: Wednesday, November 8, 2017 11:23 AM
To: Miller, Mark C.
Cc: HDF Users Discussion List
Subject: Re: [Hdf-forum] Collective IO and filters

Closer to 1000 ranks initially. There's a bug in handling the case
where some of the writers don't have any data to contribute (because
there's a dimension smaller than the number of ranks), which I have
worked around like this:

diff --git a/src/H5Dchunk.c b/src/H5Dchunk.c
index af6599a..9522478 100644
--- a/src/H5Dchunk.c
+++ b/src/H5Dchunk.c
@@ -1836,6 +1836,9 @@ H5D__create_chunk_mem_map_hyper(const H5D_chunk_map_t
*fm)
         /* Indicate that the chunk's memory space is shared */
         chunk_info->mspace_shared = TRUE;
     } /* end if */
+ else if(H5SL_count(fm->sel_chunks)==0) {
+ /* No chunks, because no local data; avoid
HDassert(fm->m_ndims==fm->f_ndims) on null mem_space */
+ } /* end else if */
     else {
         /* Get bounding box for file selection */
         if(H5S_SELECT_BOUNDS(fm->file_space, file_sel_start, file_sel_end)
< 0)

That makes the assert go away. Now I'm investigating a hang in the
chunk redistribution logic in rank 0, with a backtrace that looks like
this:

#0 0x00007f4bd456a6c6 in psm2_mq_ipeek2 () from /lib64/libpsm2.so.2
#1 0x00007f4bd5d3b341 in psm_progress_wait () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#2 0x00007f4bd5d3012d in MPID_Mprobe () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#3 0x00007f4bd5cbeeb4 in PMPI_Mprobe () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#4 0x00007f4bd81aadf6 in H5D__chunk_redistribute_shared_chunks
(io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, fm=0x17eeec0,
local_chunk_array=0x17f0f80,
    local_chunk_array_num_entries=0x7ffdfb83d9f8) at H5Dmpio.c:3041
#5 0x00007f4bd81a9696 in H5D__construct_filtered_io_info_list
(io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, fm=0x17eeec0,
chunk_list=0x7ffdfb83daf0, num_entries=0x7ffdfb83db00)
    at H5Dmpio.c:2794
#6 0x00007f4bd81a2d58 in H5D__link_chunk_filtered_collective_io
(io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, fm=0x17eeec0,
dx_plist=0x16f7230) at H5Dmpio.c:1447
#7 0x00007f4bd81a027d in H5D__chunk_collective_io
(io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, fm=0x17eeec0) at
H5Dmpio.c:933
#8 0x00007f4bd81a0968 in H5D__chunk_collective_write
(io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, nelmts=104,
file_space=0x17e2dc0, mem_space=0x17dc770, fm=0x17eeec0) at
H5Dmpio.c:1018
#9 0x00007f4bd7ce3d63 in H5D__write (dataset=0x17e0010,
mem_type_id=216172782113783851, mem_space=0x17dc770,
file_space=0x17e2dc0, dxpl_id=720575940379279384, buf=0x17d6240) at
H5Dio.c:835
#10 0x00007f4bd7ce181c in H5D__pre_write (dset=0x17e0010,
direct_write=false, mem_type_id=216172782113783851,
mem_space=0x17dc770, file_space=0x17e2dc0, dxpl_id=720575940379279384,
buf=0x17d6240)
    at H5Dio.c:394
#11 0x00007f4bd7ce0fd1 in H5Dwrite (dset_id=360287970189639680,
mem_type_id=216172782113783851, mem_space_id=288230376151711749,
file_space_id=288230376151711750, dxpl_id=720575940379279384,
    buf=0x17d6240) at H5Dio.c:318

The other ranks have moved past this and are hanging here:

#0 0x00007feb6e6546c6 in psm2_mq_ipeek2 () from /lib64/libpsm2.so.2
#1 0x00007feb6fe25341 in psm_progress_wait () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#2 0x00007feb6fdd8975 in MPIC_Wait () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#3 0x00007feb6fdd918b in MPIC_Sendrecv () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#4 0x00007feb6fcf0fda in MPIR_Allreduce_pt2pt_rd_MV2 () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#5 0x00007feb6fcf48ef in MPIR_Allreduce_index_tuned_intra_MV2 () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#6 0x00007feb6fca1534 in MPIR_Allreduce_impl () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#7 0x00007feb6fca1b93 in PMPI_Allreduce () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#8 0x00007feb72287c2a in H5D__mpio_array_gatherv
(local_array=0x125f2d0, local_array_num_entries=0,
array_entry_size=368, _gathered_array=0x7ffff083f1d8,
    _gathered_array_num_entries=0x7ffff083f1e8, nprocs=4,
allgather=true, root=0, comm=-1006632952, sort_func=0x0) at
H5Dmpio.c:479
#9 0x00007feb7228cfb8 in H5D__link_chunk_filtered_collective_io
(io_info=0x7ffff083f540, type_info=0x7ffff083f4c0, fm=0x125d280,
dx_plist=0x11cf240) at H5Dmpio.c:1479
#10 0x00007feb7228a27d in H5D__chunk_collective_io
(io_info=0x7ffff083f540, type_info=0x7ffff083f4c0, fm=0x125d280) at
H5Dmpio.c:933
#11 0x00007feb7228a968 in H5D__chunk_collective_write
(io_info=0x7ffff083f540, type_info=0x7ffff083f4c0, nelmts=74,
file_space=0x12514e0, mem_space=0x124b450, fm=0x125d280) at
H5Dmpio.c:1018
#12 0x00007feb71dcdd63 in H5D__write (dataset=0x124e7d0,
mem_type_id=216172782113783851, mem_space=0x124b450,
file_space=0x12514e0, dxpl_id=720575940379279384, buf=0x1244e80) at
H5Dio.c:835
#13 0x00007feb71dcb81c in H5D__pre_write (dset=0x124e7d0,
direct_write=false, mem_type_id=216172782113783851,
mem_space=0x124b450, file_space=0x12514e0, dxpl_id=720575940379279384,
buf=0x1244e80)
    at H5Dio.c:394
#14 0x00007feb71dcafd1 in H5Dwrite (dset_id=360287970189639680,
mem_type_id=216172782113783851, mem_space_id=288230376151711749,
file_space_id=288230376151711750, dxpl_id=720575940379279384,
    buf=0x1244e80) at H5Dio.c:318

(I'm currently running with this patch atop commit bf570b1, on an
earlier theory that the crashing bug may have crept in after Jordan's
big merge. I'll rebase on current develop but I doubt that'll change
much.)

The hang may or may not be directly related to the workaround being a
bit of a hack. I can set you up with full reproduction details if you
like; I seem to be getting some traction on it, but more eyeballs are
always good, especially if they're better set up for MPI tracing than
I am right now.

On Wed, Nov 8, 2017 at 8:48 AM, Miller, Mark C. <miller86@llnl.gov> wrote:

Hi Michael,

I have not tried this in parallel yet. That said, what scale are you
trying
to do this at? 1000 ranks or 1,000,000 ranks? Something in between?

My understanding is that there are some known scaling issues out past
maybe
10,000 ranks. Not heard of outright assertion failures there though.

Mark

"Hdf-forum on behalf of Michael K. Edwards" wrote:

I'm trying to write an HDF5 file with dataset compression from an MPI

job. (Using PETSc 3.8 compiled against MVAPICH2, if that matters.)

After running into the "Parallel I/O does not support filters yet"

error message in release versions of HDF5, I have turned to the

develop branch. Clearly there has been much work towards collective

filtered IO in the run-up to a 1.11 (1.12?) release; equally clearly

it is not quite ready for prime time yet. So far I've encountered a

livelock scenario with ZFP, reproduced it with SZIP, and, with no

filters at all, obtained this nifty error message:

ex12: H5Dchunk.c:1849: H5D__create_chunk_mem_map_hyper: Assertion

`fm->m_ndims==fm->f_ndims' failed.

Has anyone on this list been able to write parallel HDF5 using a

recent state of the develop branch, with or without filters

configured?

Thanks,

- Michael

_______________________________________________

Hdf-forum is for HDF software users discussion.

Hdf-forum@lists.hdfgroup.org

http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

Twitter: https://twitter.com/hdf5

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
The HDF Group (@hdf5) | Twitter
twitter.com
The latest Tweets from The HDF Group (@hdf5). Technologies and supporting
services that make possible the management of large, complex data
collections. Support ...

For ease of development I currently use the in-tree filters in my tests so I haven't had to deal with the issue of H5MM_ vs raw memory routines inside the filters, though I don't suspect this should make a difference anyway.

I had suspected that the underlying code might be approaching the write in a different way and certainly this will need to be addressed. I am surprised however that this kind of behavior hasn't been seen before, as it is legacy code and should have still been hit in the library before my merge of the new code during parallel HDF5 operations which did not use filters; this is worth looking into.

I should be able to look into building HDF5 against PETSc with MVAPICH2, but if there are any "gotchas" I should be aware of beforehand, please let me know. Also, if you happen to run into any revelations on the behavior you're seeing, I'd also be happy to discuss them and see what arises in the way of a workable solution.

The raw malloc/free calls inside the out-of-tree filters definitely
broke with the develop branch. The buffer pointer allocated by the
caller using H5MM_malloc(), and passed into H5Z_filter_zfp() with the
expectation that it will be replaced with a newly allocated buffer of
a different size, cannot be manipulated with raw free/malloc.

···

On Wed, Nov 8, 2017 at 12:32 PM, Jordan Henderson <jhenderson@hdfgroup.org> wrote:

For ease of development I currently use the in-tree filters in my tests so I
haven't had to deal with the issue of H5MM_ vs raw memory routines inside
the filters, though I don't suspect this should make a difference anyway.

I had suspected that the underlying code might be approaching the write in a
different way and certainly this will need to be addressed. I am surprised
however that this kind of behavior hasn't been seen before, as it is legacy
code and should have still been hit in the library before my merge of the
new code during parallel HDF5 operations which did not use filters; this is
worth looking into.

I should be able to look into building HDF5 against PETSc with MVAPICH2, but
if there are any "gotchas" I should be aware of beforehand, please let me
know. Also, if you happen to run into any revelations on the behavior you're
seeing, I'd also be happy to discuss them and see what arises in the way of
a workable solution.

In case it helps, here's an example of a patch to an out-of-tree
compressor plugin. It's not the right solution, because H5MMprivate.h
(and its dependencies) ought to stay private. Presumably plugins will
either need an isolated header with these two functions in it, or a
variant API that passes in a pair of function pointers.

H5Z-zfp.patch (2.79 KB)

···

On Wed, Nov 8, 2017 at 12:38 PM, Michael K. Edwards <m.k.edwards@gmail.com> wrote:

The raw malloc/free calls inside the out-of-tree filters definitely
broke with the develop branch. The buffer pointer allocated by the
caller using H5MM_malloc(), and passed into H5Z_filter_zfp() with the
expectation that it will be replaced with a newly allocated buffer of
a different size, cannot be manipulated with raw free/malloc.

On Wed, Nov 8, 2017 at 12:32 PM, Jordan Henderson > <jhenderson@hdfgroup.org> wrote:

For ease of development I currently use the in-tree filters in my tests so I
haven't had to deal with the issue of H5MM_ vs raw memory routines inside
the filters, though I don't suspect this should make a difference anyway.

I had suspected that the underlying code might be approaching the write in a
different way and certainly this will need to be addressed. I am surprised
however that this kind of behavior hasn't been seen before, as it is legacy
code and should have still been hit in the library before my merge of the
new code during parallel HDF5 operations which did not use filters; this is
worth looking into.

I should be able to look into building HDF5 against PETSc with MVAPICH2, but
if there are any "gotchas" I should be aware of beforehand, please let me
know. Also, if you happen to run into any revelations on the behavior you're
seeing, I'd also be happy to discuss them and see what arises in the way of
a workable solution.

Ah yes, I can see what you mean by the difference between the use of these causing issues between in-tree and out-of-tree plugins. This is particularly interesting in that it makes sense to allocate the chunk data buffers using the H5MM_ routines to be compliant with the standards of HDF5 library development, but causes issues with those plugins which use the raw memory routines. Conversely, if the chunk buffers were to be allocated using the raw routines, it would break compatibility with the in-tree filters. Thank you for bringing this to my attention; I believe I will need to think on this one, as there are a few different ways of approaching the problem, with some being more "correct" than others.

I'm reasonably confident now that this hang is unrelated to the
"writers contributing zero data" workaround. The three ranks that
have made it to H5Dmpio.c:1479 all have nonzero nelmts in the call to
H5D__chunk_collective_write() up the stack. (And I did check that
they're all still trying to write to the same dataset.)

Here's what I see in rank 0:

(gdb) p *chunk_entry
$5 = {index = 0, scaled = {0, 0, 0, 18446744073709551615 <repeats 30

}, full_overwrite = false, num_writers = 4, io_size = 832, buf =

0x0, chunk_states = {chunk_current = {offset = 4720,
      length = 6}, new_chunk = {offset = 4720, length = 6}}, owners =
{original_owner = 0, new_owner = 0}, async_info =
{receive_requests_array = 0x30c2870, receive_buffer_array = 0x30c2f20,
    num_receive_requests = 3}}

And here's what I see in rank 3:

(gdb) p *chunk_list
$3 = {index = 0, scaled = {0 <repeats 33 times>}, full_overwrite =
false, num_writers = 4, io_size = 592, buf = 0x0, chunk_states =
{chunk_current = {offset = 4720, length = 6}, new_chunk = {
      offset = 4720, length = 6}}, owners = {original_owner = 3,
new_owner = 0}, async_info = {receive_requests_array = 0x0,
receive_buffer_array = 0x0, num_receive_requests = 0}}

The loop index "j" in the receive loop in rank 0 is still 0, which
suggests that it has not received any messages from the other ranks.
The breakage could certainly be down in the MPI implementation. I am
running Intel's build of MVAPICH2 2.2 (as bundled with their current
Omni-Path release blob), and it visibly has performance "issues" in my
dev environment. It's not out of the realm of the plausible that it's
not delivering these messages. It's just odd that it manages to slog
through in the unfiltered case and not in this filtered case.

···

On Wed, Nov 8, 2017 at 12:57 PM, Jordan Henderson <jhenderson@hdfgroup.org> wrote:

Ah yes, I can see what you mean by the difference between the use of these
causing issues between in-tree and out-of-tree plugins. This is particularly
interesting in that it makes sense to allocate the chunk data buffers using
the H5MM_ routines to be compliant with the standards of HDF5 library
development, but causes issues with those plugins which use the raw memory
routines. Conversely, if the chunk buffers were to be allocated using the
raw routines, it would break compatibility with the in-tree filters. Thank
you for bringing this to my attention; I believe I will need to think on
this one, as there are a few different ways of approaching the problem, with
some being more "correct" than others.

The public H5allocate/resize/free_memory() API calls use the library's memory allocator to manage memory, if that is what you are looking for.

https://support.hdfgroup.org/HDF5/doc/RM/RM_H5.html

Dana Robinson
Software Developer
The HDF Group

···

From: Hdf-forum <hdf-forum-bounces@lists.hdfgroup.org> on behalf of Jordan Henderson <jhenderson@hdfgroup.org>
Reply-To: HDF List <hdf-forum@lists.hdfgroup.org>
Date: Wednesday, November 8, 2017 at 12:59
To: "M.K.Edwards@gmail.com" <m.k.edwards@gmail.com>
Cc: HDF List <hdf-forum@lists.hdfgroup.org>
Subject: Re: [Hdf-forum] Collective IO and filters

Ah yes, I can see what you mean by the difference between the use of these causing issues between in-tree and out-of-tree plugins. This is particularly interesting in that it makes sense to allocate the chunk data buffers using the H5MM_ routines to be compliant with the standards of HDF5 library development, but causes issues with those plugins which use the raw memory routines. Conversely, if the chunk buffers were to be allocated using the raw routines, it would break compatibility with the in-tree filters. Thank you for bringing this to my attention; I believe I will need to think on this one, as there are a few different ways of approaching the problem, with some being more "correct" than others.

Thank you, Dana! Do you think it would be appropriate (not just as of
the current implementation, but in terms of the interface contract) to
use H5free_memory() on the buffer passed into an H5Z plugin, replacing
it with a new (post-compression) buffer allocated via H5allocate()?

···

On Wed, Nov 8, 2017 at 1:23 PM, Dana Robinson <derobins@hdfgroup.org> wrote:

The public H5allocate/resize/free_memory() API calls use the library's
memory allocator to manage memory, if that is what you are looking for.

https://support.hdfgroup.org/HDF5/doc/RM/RM_H5.html

Dana Robinson

Software Developer

The HDF Group

From: Hdf-forum <hdf-forum-bounces@lists.hdfgroup.org> on behalf of Jordan
Henderson <jhenderson@hdfgroup.org>
Reply-To: HDF List <hdf-forum@lists.hdfgroup.org>
Date: Wednesday, November 8, 2017 at 12:59
To: "M.K.Edwards@gmail.com" <m.k.edwards@gmail.com>
Cc: HDF List <hdf-forum@lists.hdfgroup.org>
Subject: Re: [Hdf-forum] Collective IO and filters

Ah yes, I can see what you mean by the difference between the use of these
causing issues between in-tree and out-of-tree plugins. This is particularly
interesting in that it makes sense to allocate the chunk data buffers using
the H5MM_ routines to be compliant with the standards of HDF5 library
development, but causes issues with those plugins which use the raw memory
routines. Conversely, if the chunk buffers were to be allocated using the
raw routines, it would break compatibility with the in-tree filters. Thank
you for bringing this to my attention; I believe I will need to think on
this one, as there are a few different ways of approaching the problem, with
some being more "correct" than others.

Yes. We already do this in our test harness. See test/dynlib3.c in the source distribution. It's a very short source file and should be easy to understand.

Dana

    Thank you, Dana! Do you think it would be appropriate (not just as of
    the current implementation, but in terms of the interface contract) to
    use H5free_memory() on the buffer passed into an H5Z plugin, replacing
    it with a new (post-compression) buffer allocated via H5allocate()?

···

On 11/8/17, 13:28, "Michael K. Edwards" <m.k.edwards@gmail.com> wrote:
    
    On Wed, Nov 8, 2017 at 1:23 PM, Dana Robinson <derobins@hdfgroup.org> wrote:
    > The public H5allocate/resize/free_memory() API calls use the library's
    > memory allocator to manage memory, if that is what you are looking for.
    >
    >
    >
    > https://support.hdfgroup.org/HDF5/doc/RM/RM_H5.html
    >
    >
    >
    > Dana Robinson
    >
    > Software Developer
    >
    > The HDF Group
    >
    >
    >
    > From: Hdf-forum <hdf-forum-bounces@lists.hdfgroup.org> on behalf of Jordan
    > Henderson <jhenderson@hdfgroup.org>
    > Reply-To: HDF List <hdf-forum@lists.hdfgroup.org>
    > Date: Wednesday, November 8, 2017 at 12:59
    > To: "M.K.Edwards@gmail.com" <m.k.edwards@gmail.com>
    > Cc: HDF List <hdf-forum@lists.hdfgroup.org>
    > Subject: Re: [Hdf-forum] Collective IO and filters
    >
    >
    >
    > Ah yes, I can see what you mean by the difference between the use of these
    > causing issues between in-tree and out-of-tree plugins. This is particularly
    > interesting in that it makes sense to allocate the chunk data buffers using
    > the H5MM_ routines to be compliant with the standards of HDF5 library
    > development, but causes issues with those plugins which use the raw memory
    > routines. Conversely, if the chunk buffers were to be allocated using the
    > raw routines, it would break compatibility with the in-tree filters. Thank
    > you for bringing this to my attention; I believe I will need to think on
    > this one, as there are a few different ways of approaching the problem, with
    > some being more "correct" than others.

Dana,

would it then make sense for all outside filters to use these routines? Due to Parallel Compression's internal nature, it uses buffers allocated via H5MM_ routines to collect and scatter data, which works fine for the internal filters like deflate, since they use these as well. However, since some of the outside filters use the raw malloc/free routines, causing issues, I'm wondering if having all outside filters use the H5_ routines is the cleanest solution..

Michael,

Based on the "num_writers: 4" field, the NULL "receive_requests_array" and the fact that for the same chunk, rank 0 shows "original owner: 0, new owner: 0" and rank 3 shows "original owner: 3, new_owner: 0", it seems as though everyone IS interested in the chunk the rank 0 is now working on, but now I'm more confident that at some point either the messages may have failed to send or rank 0 is having problems finding the messages.

Since in the unfiltered case it won't hit this particular code path, I'm not surprised that that case succeeds. If I had to make another guess based on this, I would be inclined to think that rank 0 must be hanging on the MPI_Mprobe due to a mismatch in the "tag" field. I use the index of the chunk as the tag for the message in order to funnel specific messages to the correct rank for the correct chunk during the last part of the chunk redistribution and if rank 0 can't match the tag it of course won't find the message. Why this might be happening, I'm not entirely certain currently.

Great. What's the best way to communicate this to plugin developers,
so that their code gets updated appropriately in advance of the 1.12
release?

···

On Wed, Nov 8, 2017 at 1:41 PM, Dana Robinson <derobins@hdfgroup.org> wrote:

Yes. We already do this in our test harness. See test/dynlib3.c in the source distribution. It's a very short source file and should be easy to understand.

Dana

On 11/8/17, 13:28, "Michael K. Edwards" <m.k.edwards@gmail.com> wrote:

    Thank you, Dana! Do you think it would be appropriate (not just as of
    the current implementation, but in terms of the interface contract) to
    use H5free_memory() on the buffer passed into an H5Z plugin, replacing
    it with a new (post-compression) buffer allocated via H5allocate()?

    On Wed, Nov 8, 2017 at 1:23 PM, Dana Robinson <derobins@hdfgroup.org> wrote:
    > The public H5allocate/resize/free_memory() API calls use the library's
    > memory allocator to manage memory, if that is what you are looking for.
    >
    >
    >
    > https://support.hdfgroup.org/HDF5/doc/RM/RM_H5.html
    >
    >
    >
    > Dana Robinson
    >
    > Software Developer
    >
    > The HDF Group
    >
    >
    >
    > From: Hdf-forum <hdf-forum-bounces@lists.hdfgroup.org> on behalf of Jordan
    > Henderson <jhenderson@hdfgroup.org>
    > Reply-To: HDF List <hdf-forum@lists.hdfgroup.org>
    > Date: Wednesday, November 8, 2017 at 12:59
    > To: "M.K.Edwards@gmail.com" <m.k.edwards@gmail.com>
    > Cc: HDF List <hdf-forum@lists.hdfgroup.org>
    > Subject: Re: [Hdf-forum] Collective IO and filters
    >
    >
    >
    > Ah yes, I can see what you mean by the difference between the use of these
    > causing issues between in-tree and out-of-tree plugins. This is particularly
    > interesting in that it makes sense to allocate the chunk data buffers using
    > the H5MM_ routines to be compliant with the standards of HDF5 library
    > development, but causes issues with those plugins which use the raw memory
    > routines. Conversely, if the chunk buffers were to be allocated using the
    > raw routines, it would break compatibility with the in-tree filters. Thank
    > you for bringing this to my attention; I believe I will need to think on
    > this one, as there are a few different ways of approaching the problem, with
    > some being more "correct" than others.

Yes. All outside code that frees, allocates, or reallocates memory created inside the library (or that will be passed back into the library, where it could be freed or reallocated) should use these functions. This includes filters.

Dana

···

From: Jordan Henderson <jhenderson@hdfgroup.org>
Date: Wednesday, November 8, 2017 at 13:46
To: Dana Robinson <derobins@hdfgroup.org>, "M.K.Edwards@gmail.com" <m.k.edwards@gmail.com>, HDF List <hdf-forum@lists.hdfgroup.org>
Subject: Re: [Hdf-forum] Collective IO and filters

Dana,

would it then make sense for all outside filters to use these routines? Due to Parallel Compression's internal nature, it uses buffers allocated via H5MM_ routines to collect and scatter data, which works fine for the internal filters like deflate, since they use these as well. However, since some of the outside filters use the raw malloc/free routines, causing issues, I'm wondering if having all outside filters use the H5_ routines is the cleanest solution..

Michael,

Based on the "num_writers: 4" field, the NULL "receive_requests_array" and the fact that for the same chunk, rank 0 shows "original owner: 0, new owner: 0" and rank 3 shows "original owner: 3, new_owner: 0", it seems as though everyone IS interested in the chunk the rank 0 is now working on, but now I'm more confident that at some point either the messages may have failed to send or rank 0 is having problems finding the messages.

Since in the unfiltered case it won't hit this particular code path, I'm not surprised that that case succeeds. If I had to make another guess based on this, I would be inclined to think that rank 0 must be hanging on the MPI_Mprobe due to a mismatch in the "tag" field. I use the index of the chunk as the tag for the message in order to funnel specific messages to the correct rank for the correct chunk during the last part of the chunk redistribution and if rank 0 can't match the tag it of course won't find the message. Why this might be happening, I'm not entirely certain currently.

I see that you're re-sorting by owner using a comparator called
H5D__cmp_filtered_collective_io_info_entry_owner() which does not sort
by a secondary key within items with equal owners. That, together
with a sort that isn't stable (which HDqsort() probably isn't on most
platforms; quicksort/introsort is not stable), will scramble the order
in which different ranks traverse their local chunk arrays. That will
cause deadly embraces between ranks that are waiting for each other's
chunks to be sent. To fix that, it's probably sufficient to use the
chunk offset as a secondary sort key in that comparator.

That's not the root cause of the hang I'm currently experiencing,
though. Still digging into that.

···

On Wed, Nov 8, 2017 at 1:50 PM, Dana Robinson <derobins@hdfgroup.org> wrote:

Yes. All outside code that frees, allocates, or reallocates memory created
inside the library (or that will be passed back into the library, where it
could be freed or reallocated) should use these functions. This includes
filters.

Dana

From: Jordan Henderson <jhenderson@hdfgroup.org>
Date: Wednesday, November 8, 2017 at 13:46
To: Dana Robinson <derobins@hdfgroup.org>, "M.K.Edwards@gmail.com"
<m.k.edwards@gmail.com>, HDF List <hdf-forum@lists.hdfgroup.org>
Subject: Re: [Hdf-forum] Collective IO and filters

Dana,

would it then make sense for all outside filters to use these routines? Due
to Parallel Compression's internal nature, it uses buffers allocated via
H5MM_ routines to collect and scatter data, which works fine for the
internal filters like deflate, since they use these as well. However, since
some of the outside filters use the raw malloc/free routines, causing
issues, I'm wondering if having all outside filters use the H5_ routines is
the cleanest solution..

Michael,

Based on the "num_writers: 4" field, the NULL "receive_requests_array" and
the fact that for the same chunk, rank 0 shows "original owner: 0, new
owner: 0" and rank 3 shows "original owner: 3, new_owner: 0", it seems as
though everyone IS interested in the chunk the rank 0 is now working on, but
now I'm more confident that at some point either the messages may have
failed to send or rank 0 is having problems finding the messages.

Since in the unfiltered case it won't hit this particular code path, I'm not
surprised that that case succeeds. If I had to make another guess based on
this, I would be inclined to think that rank 0 must be hanging on the
MPI_Mprobe due to a mismatch in the "tag" field. I use the index of the
chunk as the tag for the message in order to funnel specific messages to the
correct rank for the correct chunk during the last part of the chunk
redistribution and if rank 0 can't match the tag it of course won't find the
message. Why this might be happening, I'm not entirely certain currently.

A bit of historical note:

The H5*_memory() API calls were added primarily to help Windows users. In Windows, the C run-time is implemented in shared libraries tied to a particular version of Visual Studio. Even for a given version of Visual Studio, there are independent debug and release libraries. This caused problems when people allocated memory in, say, a release version of the HDF5 library and freed it in their debug version application since the different C runtimes have different memory allocator state that is not shared. Users on other systems care less about this problem since there is rarely a plethora of C libraries to link to (though it can be a problem when people use debug memory allocators).

H5free_memory() was initially introduced because a few of our API calls return buffers that the user must free. The allocate/reallocate calls came later, for use in filters on Windows. It's interesting that those functions will now be needed for parallel compression.

Dana

···

From: Hdf-forum <hdf-forum-bounces@lists.hdfgroup.org> on behalf of Dana Robinson <derobins@hdfgroup.org>
Reply-To: HDF List <hdf-forum@lists.hdfgroup.org>
Date: Wednesday, November 8, 2017 at 13:52
To: Jordan Henderson <jhenderson@hdfgroup.org>, "M.K.Edwards@gmail.com" <m.k.edwards@gmail.com>, HDF List <hdf-forum@lists.hdfgroup.org>
Subject: Re: [Hdf-forum] Collective IO and filters

Yes. All outside code that frees, allocates, or reallocates memory created inside the library (or that will be passed back into the library, where it could be freed or reallocated) should use these functions. This includes filters.

Dana

From: Jordan Henderson <jhenderson@hdfgroup.org>
Date: Wednesday, November 8, 2017 at 13:46
To: Dana Robinson <derobins@hdfgroup.org>, "M.K.Edwards@gmail.com" <m.k.edwards@gmail.com>, HDF List <hdf-forum@lists.hdfgroup.org>
Subject: Re: [Hdf-forum] Collective IO and filters

Dana,

would it then make sense for all outside filters to use these routines? Due to Parallel Compression's internal nature, it uses buffers allocated via H5MM_ routines to collect and scatter data, which works fine for the internal filters like deflate, since they use these as well. However, since some of the outside filters use the raw malloc/free routines, causing issues, I'm wondering if having all outside filters use the H5_ routines is the cleanest solution..

Michael,

Based on the "num_writers: 4" field, the NULL "receive_requests_array" and the fact that for the same chunk, rank 0 shows "original owner: 0, new owner: 0" and rank 3 shows "original owner: 3, new_owner: 0", it seems as though everyone IS interested in the chunk the rank 0 is now working on, but now I'm more confident that at some point either the messages may have failed to send or rank 0 is having problems finding the messages.

Since in the unfiltered case it won't hit this particular code path, I'm not surprised that that case succeeds. If I had to make another guess based on this, I would be inclined to think that rank 0 must be hanging on the MPI_Mprobe due to a mismatch in the "tag" field. I use the index of the chunk as the tag for the message in order to funnel specific messages to the correct rank for the correct chunk during the last part of the chunk redistribution and if rank 0 can't match the tag it of course won't find the message. Why this might be happening, I'm not entirely certain currently.