Hi Michael,
during the design phase of this feature I tried to both account for and test
the case where some of the writers do not have any data to contribute.
However, it seems like your use case falls outside of what I have tested
(perhaps I have not used enough ranks?). In particular my test cases were
small and simply had some of the ranks call H5Sselect_none(), which doesn't
seem to trigger this particular assertion failure. Is this how you're
approaching these particular ranks in your code or is there a different way
you are having them participate in the write operation?
As for the hanging issue, it looks as though rank 0 is waiting to receive
some modification data from another rank for a particular chunk. Whether or
not there is actually valid data that rank 0 should be waiting for, I cannot
easily tell without being able to trace it through. As the other ranks have
finished modifying their particular sets of chunks, they have moved on and
are waiting for everyone to get together and broadcast their new chunk sizes
so that free space in the file can be collectively re-allocated, but of
course rank 0 is not proceeding forward. My best guess is that either:
The "num_writers" field for the chunk struct corresponding to the particular
chunk that rank 0 is working on has been incorrectly set, causing rank 0 to
think that there are more ranks writing to the chunk than the actual amount
and consequently causing rank 0 to wait forever for a non-existent MPI
message
or
The "new_owner" field of the chunk struct for this chunk was incorrectly set
on the other ranks, causing them to never issue an MPI_Isend to rank 0, also
causing rank 0 to wait for a non-existent MPI message
This feature should still be regarded as being in beta and its complexity
can lead to difficult to track down bugs such as the ones you are currently
encountering. That being said, your feedback is very useful and will help to
push this feature towards a production-ready level of quality. Also, if it
is feasible to come up with a minimal example that reproduces this issue, it
would be very helpful and would make it much easier to diagnose why exactly
these failures are occurring.
Thanks,
Jordan
________________________________
From: Hdf-forum <hdf-forum-bounces@lists.hdfgroup.org> on behalf of Michael
K. Edwards <m.k.edwards@gmail.com>
Sent: Wednesday, November 8, 2017 11:23 AM
To: Miller, Mark C.
Cc: HDF Users Discussion List
Subject: Re: [Hdf-forum] Collective IO and filters
Closer to 1000 ranks initially. There's a bug in handling the case
where some of the writers don't have any data to contribute (because
there's a dimension smaller than the number of ranks), which I have
worked around like this:
diff --git a/src/H5Dchunk.c b/src/H5Dchunk.c
index af6599a..9522478 100644
--- a/src/H5Dchunk.c
+++ b/src/H5Dchunk.c
@@ -1836,6 +1836,9 @@ H5D__create_chunk_mem_map_hyper(const H5D_chunk_map_t
*fm)
/* Indicate that the chunk's memory space is shared */
chunk_info->mspace_shared = TRUE;
} /* end if */
+ else if(H5SL_count(fm->sel_chunks)==0) {
+ /* No chunks, because no local data; avoid
HDassert(fm->m_ndims==fm->f_ndims) on null mem_space */
+ } /* end else if */
else {
/* Get bounding box for file selection */
if(H5S_SELECT_BOUNDS(fm->file_space, file_sel_start, file_sel_end)
< 0)
That makes the assert go away. Now I'm investigating a hang in the
chunk redistribution logic in rank 0, with a backtrace that looks like
this:
#0 0x00007f4bd456a6c6 in psm2_mq_ipeek2 () from /lib64/libpsm2.so.2
#1 0x00007f4bd5d3b341 in psm_progress_wait () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#2 0x00007f4bd5d3012d in MPID_Mprobe () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#3 0x00007f4bd5cbeeb4 in PMPI_Mprobe () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#4 0x00007f4bd81aadf6 in H5D__chunk_redistribute_shared_chunks
(io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, fm=0x17eeec0,
local_chunk_array=0x17f0f80,
local_chunk_array_num_entries=0x7ffdfb83d9f8) at H5Dmpio.c:3041
#5 0x00007f4bd81a9696 in H5D__construct_filtered_io_info_list
(io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, fm=0x17eeec0,
chunk_list=0x7ffdfb83daf0, num_entries=0x7ffdfb83db00)
at H5Dmpio.c:2794
#6 0x00007f4bd81a2d58 in H5D__link_chunk_filtered_collective_io
(io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, fm=0x17eeec0,
dx_plist=0x16f7230) at H5Dmpio.c:1447
#7 0x00007f4bd81a027d in H5D__chunk_collective_io
(io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, fm=0x17eeec0) at
H5Dmpio.c:933
#8 0x00007f4bd81a0968 in H5D__chunk_collective_write
(io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, nelmts=104,
file_space=0x17e2dc0, mem_space=0x17dc770, fm=0x17eeec0) at
H5Dmpio.c:1018
#9 0x00007f4bd7ce3d63 in H5D__write (dataset=0x17e0010,
mem_type_id=216172782113783851, mem_space=0x17dc770,
file_space=0x17e2dc0, dxpl_id=720575940379279384, buf=0x17d6240) at
H5Dio.c:835
#10 0x00007f4bd7ce181c in H5D__pre_write (dset=0x17e0010,
direct_write=false, mem_type_id=216172782113783851,
mem_space=0x17dc770, file_space=0x17e2dc0, dxpl_id=720575940379279384,
buf=0x17d6240)
at H5Dio.c:394
#11 0x00007f4bd7ce0fd1 in H5Dwrite (dset_id=360287970189639680,
mem_type_id=216172782113783851, mem_space_id=288230376151711749,
file_space_id=288230376151711750, dxpl_id=720575940379279384,
buf=0x17d6240) at H5Dio.c:318
The other ranks have moved past this and are hanging here:
#0 0x00007feb6e6546c6 in psm2_mq_ipeek2 () from /lib64/libpsm2.so.2
#1 0x00007feb6fe25341 in psm_progress_wait () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#2 0x00007feb6fdd8975 in MPIC_Wait () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#3 0x00007feb6fdd918b in MPIC_Sendrecv () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#4 0x00007feb6fcf0fda in MPIR_Allreduce_pt2pt_rd_MV2 () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#5 0x00007feb6fcf48ef in MPIR_Allreduce_index_tuned_intra_MV2 () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#6 0x00007feb6fca1534 in MPIR_Allreduce_impl () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#7 0x00007feb6fca1b93 in PMPI_Allreduce () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#8 0x00007feb72287c2a in H5D__mpio_array_gatherv
(local_array=0x125f2d0, local_array_num_entries=0,
array_entry_size=368, _gathered_array=0x7ffff083f1d8,
_gathered_array_num_entries=0x7ffff083f1e8, nprocs=4,
allgather=true, root=0, comm=-1006632952, sort_func=0x0) at
H5Dmpio.c:479
#9 0x00007feb7228cfb8 in H5D__link_chunk_filtered_collective_io
(io_info=0x7ffff083f540, type_info=0x7ffff083f4c0, fm=0x125d280,
dx_plist=0x11cf240) at H5Dmpio.c:1479
#10 0x00007feb7228a27d in H5D__chunk_collective_io
(io_info=0x7ffff083f540, type_info=0x7ffff083f4c0, fm=0x125d280) at
H5Dmpio.c:933
#11 0x00007feb7228a968 in H5D__chunk_collective_write
(io_info=0x7ffff083f540, type_info=0x7ffff083f4c0, nelmts=74,
file_space=0x12514e0, mem_space=0x124b450, fm=0x125d280) at
H5Dmpio.c:1018
#12 0x00007feb71dcdd63 in H5D__write (dataset=0x124e7d0,
mem_type_id=216172782113783851, mem_space=0x124b450,
file_space=0x12514e0, dxpl_id=720575940379279384, buf=0x1244e80) at
H5Dio.c:835
#13 0x00007feb71dcb81c in H5D__pre_write (dset=0x124e7d0,
direct_write=false, mem_type_id=216172782113783851,
mem_space=0x124b450, file_space=0x12514e0, dxpl_id=720575940379279384,
buf=0x1244e80)
at H5Dio.c:394
#14 0x00007feb71dcafd1 in H5Dwrite (dset_id=360287970189639680,
mem_type_id=216172782113783851, mem_space_id=288230376151711749,
file_space_id=288230376151711750, dxpl_id=720575940379279384,
buf=0x1244e80) at H5Dio.c:318
(I'm currently running with this patch atop commit bf570b1, on an
earlier theory that the crashing bug may have crept in after Jordan's
big merge. I'll rebase on current develop but I doubt that'll change
much.)
The hang may or may not be directly related to the workaround being a
bit of a hack. I can set you up with full reproduction details if you
like; I seem to be getting some traction on it, but more eyeballs are
always good, especially if they're better set up for MPI tracing than
I am right now.
On Wed, Nov 8, 2017 at 8:48 AM, Miller, Mark C. <miller86@llnl.gov> wrote:
Hi Michael,
I have not tried this in parallel yet. That said, what scale are you
trying
to do this at? 1000 ranks or 1,000,000 ranks? Something in between?
My understanding is that there are some known scaling issues out past
maybe
10,000 ranks. Not heard of outright assertion failures there though.
Mark
"Hdf-forum on behalf of Michael K. Edwards" wrote:
I'm trying to write an HDF5 file with dataset compression from an MPI
job. (Using PETSc 3.8 compiled against MVAPICH2, if that matters.)
After running into the "Parallel I/O does not support filters yet"
error message in release versions of HDF5, I have turned to the
develop branch. Clearly there has been much work towards collective
filtered IO in the run-up to a 1.11 (1.12?) release; equally clearly
it is not quite ready for prime time yet. So far I've encountered a
livelock scenario with ZFP, reproduced it with SZIP, and, with no
filters at all, obtained this nifty error message:
ex12: H5Dchunk.c:1849: H5D__create_chunk_mem_map_hyper: Assertion
`fm->m_ndims==fm->f_ndims' failed.
Has anyone on this list been able to write parallel HDF5 using a
recent state of the develop branch, with or without filters
configured?
Thanks,
- Michael
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
The HDF Group (@hdf5) | Twitter
twitter.com
The latest Tweets from The HDF Group (@hdf5). Technologies and supporting
services that make possible the management of large, complex data
collections. Support ...