H5DRead crashes on Bluegene/Q with "out of memory"

Dear forum members,

this may be too specialized a problem, but maybe somebody still has some
insights.

Our code (running on an IBM BlueGene/Q machine) reads in some data,
using HDF5. This is done collectively, on each core (everyone reads in
the same data, at the same time). It is not known a priori which
processor owns which part of the data, they have to compute this
themselves and discard the data they don't own. The data file is ~9.4MB
in a simple test case. The data is a custom data type of a nested struct
with two 32-bit integers and two 64-bit doubles that form a complex
number, with a total of 192 bits.

If I use less than 1024 cores, there is no problem. However, for >=1024
cores, I get a crash with the error

"Out of memory in file
/bgsys/source/srcV1R2M3.12428/comm/lib/dev/mpich2/src/mpi/romio/adio/ad_bg/ad_bg_rdcoll.c,
line 1073"

We use parallel HDF5 1.8.15; I've also tried 1.8.14. Another library
dependence is FFTW 3.3.3, but that should not really matter.

I traced the crash with Totalview to the call of H5Dread(). The
second-to-last call in the crash trace is MPIDO_Alltoallv, the last one
is PAMI_Context_trylock_advancev. I don't have exact calls nor line
numbers since the HDF5 library was not compiled with debug symbols. [the
file mentioned in the error message is not accessible]

Is this an HDF5 problem, or a problem with IBM's MPI implementation?
Might it be an MPI buffer overflow?!? Or is there maybe a problem with
data contiguity in the struct?

The problem disappears if I read in the file in chunks of less than
192kiB at a time. A more workable workaround is to replace collective
communication by independent communication, in which case, the problem
disappears.
H5Pset_dxpl_mpio(plist_id, H5FD_MPIO_COLLECTIVE); -->
H5Pset_dxpl_mpio(plist_id, H5FD_MPIO_INDEPENDENT);

Since this data file is quite small (usually not larger than a few
hundred megabytes at most), reading in the file independently is not a
huge performance problem at this stage, but for very large simulations
it might be.

In other, older parts of the code, we're (successfully!) reading in (up
to) 256 GiB of data in predefined data types (double, float) using
H5FD_MPIO_COLLECTIVE without any problem, so I'm thinking this problem
is connected with the user-defined data type in some way.

I attach some condensed code with all calls to the HDF5 library; I'm not
sure anyone is in the position to actually reproduce this problem, so
the main() routine and the data file are probably unnecessary. However,
I'd be happy to also send those if need be.

Thanks in advance for any hints.

Best regards,
Wolf

contMech-9.hdf5.cpp (6.26 KB)

···

--

I probably have a number of silly/dumb questions but someone is bound to ask. . .

···

From: Hdf-forum <hdf-forum-bounces@lists.hdfgroup.org<mailto:hdf-forum-bounces@lists.hdfgroup.org>> on behalf of Wolf Dapp <wolf.dapp@gmail.com<mailto:wolf.dapp@gmail.com>>
Reply-To: HDF Users Discussion List <hdf-forum@lists.hdfgroup.org<mailto:hdf-forum@lists.hdfgroup.org>>
Date: Tuesday, September 1, 2015 7:34 AM
To: HDF Users Discussion List <hdf-forum@lists.hdfgroup.org<mailto:hdf-forum@lists.hdfgroup.org>>
Cc: Wolf Dapp <wolf.dapp@gmail.com<mailto:wolf.dapp@gmail.com>>
Subject: [Hdf-forum] H5DRead crashes on Bluegene/Q with "out of memory"

Dear forum members,

this may be too specialized a problem, but maybe somebody still has some
insights.

Our code (running on an IBM BlueGene/Q machine) reads in some data,
using HDF5. This is done collectively, on each core (everyone reads in
the same data, at the same time). It is not known a priori which
processor owns which part of the data, they have to compute this
themselves and discard the data they don't own.

Hmm. For this use case, I assume the data to be read is *always* small enough for a single core. Why not read it indepndently to one core and broadcast and/or MPI_Send yourself? I understand that is not your use case but what I suggest will very likely be much more scalable for large core counts vs. all cores attacking the filesystem for the same bunch of bytes.

The data file is ~9.4MB
in a simple test case.

That certainly sounds small enough that what you describe should work at any core count.

The data is a custom data type of a nested struct
with two 32-bit integers and two 64-bit doubles that form a complex
number, with a total of 192 bits.

Do you happen to know if any HDF5 'filters' are involved in reading this data (compression, custom conversion, etc.)?

If I use less than 1024 cores, there is no problem. However, for >=1024
cores, I get a crash with the error

Is there really 'no problem' or is the problem really happening but its just not bad enough to cause OOM? I mean maybe 512 cores fails on files of 18.8MB size?

"Out of memory in file
/bgsys/source/srcV1R2M3.12428/comm/lib/dev/mpich2/src/mpi/romio/adio/ad_bg/ad_bg_rdcoll.c,
line 1073"

Thats only where the last allocation failed causing OOM, right? Can you run a small problem with valgrind (maybe with massif heap sizing tool) to see whats happening as far as mallocs and frees?

We use parallel HDF5 1.8.15; I've also tried 1.8.14. Another library
dependence is FFTW 3.3.3, but that should not really matter.

I traced the crash with Totalview to the call of H5Dread(). The
second-to-last call in the crash trace is MPIDO_Alltoallv, the last one
is PAMI_Context_trylock_advancev. I don't have exact calls nor line
numbers since the HDF5 library was not compiled with debug symbols. [the
file mentioned in the error message is not accessible]

Again, thats only getting you to the last malloc that failed. You need to use some kind of tool to find out where all the memory is getting allocated like valgrind or memtrace or something.

Is this an HDF5 problem, or a problem with IBM's MPI implementation?
Might it be an MPI buffer overflow?!? Or is there maybe a problem with
data contiguity in the struct?

Its not possible to say at this point.

The problem disappears if I read in the file in chunks of less than
192kiB at a time. A more workable workaround is to replace collective
communication by independent communication, in which case, the problem
disappears.
H5Pset_dxpl_mpio(plist_id, H5FD_MPIO_COLLECTIVE); -->
H5Pset_dxpl_mpio(plist_id, H5FD_MPIO_INDEPENDENT);

Since this data file is quite small (usually not larger than a few
hundred megabytes at most), reading in the file independently is not a
huge performance problem at this stage, but for very large simulations
it might be.

In other, older parts of the code, we're (successfully!) reading in (up
to) 256 GiB of data in predefined data types (double, float) using
H5FD_MPIO_COLLECTIVE without any problem, so I'm thinking this problem
is connected with the user-defined data type in some way.

Are these collective calls all reading the same part of a dataset or all reading different parts? The use-case you described above sounded like all cores were reading the same part (whole) dataset. And, 256GiB is large enough that no one core could hold all of that so each core here *must* be reading a different part of a dataset. perhaps that is relevant?

I attach some condensed code with all calls to the HDF5 library; I'm not
sure anyone is in the position to actually reproduce this problem, so
the main() routine and the data file are probably unnecessary. However,
I'd be happy to also send those if need be.

If you cannot run valgrind/massif on BG/Q, try running the same code on another machine where you *can* run valgrind/massif. If non-system code is the culprit, you will be able to reproduce memory growth elsewhere. OTOH, if there is a problem down in BG/Q system code, moving to another machine would hide the problem.

Sorry I have only questions but its worth asking.

Thanks in advance for any hints.

Best regards,
Wolf

--

I was also getting the same error with MOAB from ANL when we were benchmarking small mesh reads with large number of processors. When I ran on 16384 processes the job would terminate with:

Out of memory in file /bgsys/source/srcV1R2M1.17463/comm/lib/dev/mpich2/src/mpi/romio/adio/ad_bg/ad_bg_rdcoll.c, line 1073

A semi-discussion about the problem can be found here:

http://lists.mpich.org/pipermail/devel/2013-May/000154.html

We did not have time in the project to look into the problem any further.

Scot

···

On Sep 1, 2015, at 9:34 AM, Wolf Dapp <wolf.dapp@gmail.com<mailto:wolf.dapp@gmail.com>> wrote:

Dear forum members,

this may be too specialized a problem, but maybe somebody still has some
insights.

Our code (running on an IBM BlueGene/Q machine) reads in some data,
using HDF5. This is done collectively, on each core (everyone reads in
the same data, at the same time). It is not known a priori which
processor owns which part of the data, they have to compute this
themselves and discard the data they don't own. The data file is ~9.4MB
in a simple test case. The data is a custom data type of a nested struct
with two 32-bit integers and two 64-bit doubles that form a complex
number, with a total of 192 bits.

If I use less than 1024 cores, there is no problem. However, for >=1024
cores, I get a crash with the error

"Out of memory in file
/bgsys/source/srcV1R2M3.12428/comm/lib/dev/mpich2/src/mpi/romio/adio/ad_bg/ad_bg_rdcoll.c,
line 1073"

We use parallel HDF5 1.8.15; I've also tried 1.8.14. Another library
dependence is FFTW 3.3.3, but that should not really matter.

I traced the crash with Totalview to the call of H5Dread(). The
second-to-last call in the crash trace is MPIDO_Alltoallv, the last one
is PAMI_Context_trylock_advancev. I don't have exact calls nor line
numbers since the HDF5 library was not compiled with debug symbols. [the
file mentioned in the error message is not accessible]

Is this an HDF5 problem, or a problem with IBM's MPI implementation?
Might it be an MPI buffer overflow?!? Or is there maybe a problem with
data contiguity in the struct?

The problem disappears if I read in the file in chunks of less than
192kiB at a time. A more workable workaround is to replace collective
communication by independent communication, in which case, the problem
disappears.
H5Pset_dxpl_mpio(plist_id, H5FD_MPIO_COLLECTIVE); -->
H5Pset_dxpl_mpio(plist_id, H5FD_MPIO_INDEPENDENT);

Since this data file is quite small (usually not larger than a few
hundred megabytes at most), reading in the file independently is not a
huge performance problem at this stage, but for very large simulations
it might be.

In other, older parts of the code, we're (successfully!) reading in (up
to) 256 GiB of data in predefined data types (double, float) using
H5FD_MPIO_COLLECTIVE without any problem, so I'm thinking this problem
is connected with the user-defined data type in some way.

I attach some condensed code with all calls to the HDF5 library; I'm not
sure anyone is in the position to actually reproduce this problem, so
the main() routine and the data file are probably unnecessary. However,
I'd be happy to also send those if need be.

Thanks in advance for any hints.

Best regards,
Wolf

--

<contMech-9.hdf5.cpp>_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org<mailto:Hdf-forum@lists.hdfgroup.org>
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

    Our code (running on an IBM BlueGene/Q machine) reads in some data,
    using HDF5. This is done collectively, on each core (everyone reads in
    the same data, at the same time). It is not known a priori which
    processor owns which part of the data, they have to compute this
    themselves and discard the data they don't own.

Hmm. For this use case, I assume the data to be read is *always* small
enough for a single core. Why not read it indepndently to one core and
broadcast and/or MPI_Send yourself? I understand that is not your use
case but what I suggest will very likely be much more scalable for large
core counts vs. all cores attacking the filesystem for the same bunch of
bytes.

Thanks for your reply, Mark.

In the code snippet I sent, we read in the data in chunks if it is not
sufficiently small -- we limit each chunk to ~48MiB that should always
fit. We indeed considered reading with one process and then broadcasting
(or some mildly parallel version of the same procedure), but decided
against it. However, we might reconsider if the current way proves not
feasible or a performance issue.

    The data file is ~9.4MB
    in a simple test case.

That certainly sounds small enough that what you describe should work at
any core count.

precisely. There's nothing else in memory in the test case, either, and
it crashes even if each core has 2GiB memory.

    The data is a custom data type of a nested struct
    with two 32-bit integers and two 64-bit doubles that form a complex
    number, with a total of 192 bits.

Do you happen to know if any HDF5 'filters' are involved in reading this
data (compression, custom conversion, etc.)?

No compression or further conversions, just nested structs of native
datatypes, as indicated in the sample code.

    If I use less than 1024 cores, there is no problem. However, for >=1024
    cores, I get a crash with the error

Is there really 'no problem' or is the problem really happening but its
just not bad enough to cause OOM? I mean maybe 512 cores fails on files
of 18.8MB size?

There really seems to be 'no problem' in that case. As I mentioned, for
fewer cores the problem goes away if the chunk size is 192kiB. In a
different test, 512 ranks worked fine with much larger data sets, while
1024 failed in every case.

    "Out of memory in file
    /bgsys/source/srcV1R2M3.12428/comm/lib/dev/mpich2/src/mpi/romio/adio/ad_bg/ad_bg_rdcoll.c,
    line 1073"

Thats only where the last allocation failed causing OOM, right? Can you
run a small problem with valgrind (maybe with massif heap sizing tool)
to see whats happening as far as mallocs and frees?

[...]

Again, thats only getting you to the last malloc that failed. You need
to use some kind of tool to find out where all the memory is getting
allocated like valgrind or memtrace or something.

Haven't done that, but will try (if possible). On a standard cluster, it
worked fine for tests up to 128 ranks (couldn't try with 1024 ranks).

    The problem disappears if I read in the file in chunks of less than
    192kiB at a time. A more workable workaround is to replace collective
    communication by independent communication, in which case, the problem
    disappears.
    H5Pset_dxpl_mpio(plist_id, H5FD_MPIO_COLLECTIVE); -->
    H5Pset_dxpl_mpio(plist_id, H5FD_MPIO_INDEPENDENT);

    Since this data file is quite small (usually not larger than a few
    hundred megabytes at most), reading in the file independently is not a
    huge performance problem at this stage, but for very large simulations
    it might be.

    In other, older parts of the code, we're (successfully!) reading in (up
    to) 256 GiB of data in predefined data types (double, float) using
    H5FD_MPIO_COLLECTIVE without any problem, so I'm thinking this problem
    is connected with the user-defined data type in some way.

Are these collective calls all reading the same part of a dataset or all
reading different parts? The use-case you described above sounded like
all cores were reading the same part (whole) dataset. And, 256GiB is
large enough that no one core could hold all of that so each core here
*must* be reading a different part of a dataset. perhaps that is relevant?

Yes, indeed, it's the part that every rank reads /everything/ that
causes the problem. If everyone reads part of the file, there's no
problem, but that's not our use case here.

I'm asking the hdf-forum because the problem occured using H5DRead, and
it seems innocent and simple (and small) enough that this shouldn't be
the case. But (see Scot's email in this thread), it seems that MPICH is
the real culprit, and no solution has been put forth in the last 2
years, only workarounds.

Thanks again for your questions and suggestions.
Wolf

···

Am 01.09.2015 um 17:14 schrieb Miller, Mark C.:

--

Thanks for pointing out this discussion, Scot. It seems that not only
you did not have time to investigate the problem further, but neither
IBM nor MPICH did :slight_smile:

I guess this indicates that it's not an HDF5 problem but an MPICH
problem, at heart, and that there's some memory allocations that scale
with the number of ranks.

Though it seems your team hit the "invisible barrier" much later than we
did.

Cheers,
Wolf

···

Am 01.09.2015 um 17:43 schrieb Scot Breitenfeld:

I was also getting the same error with MOAB from ANL when we were
benchmarking small mesh reads with large number of processors. When I
ran on 16384 processes the job would terminate with:

Out of memory in file
/bgsys/source/srcV1R2M1.17463/comm/lib/dev/mpich2/src/mpi/romio/adio/ad_bg/ad_bg_rdcoll.c,
line 1073

A semi-discussion about the problem can be found here:

http://lists.mpich.org/pipermail/devel/2013-May/000154.html

We did not have time in the project to look into the problem any further.

Scot

--

Hello! I'm pleased to see another Blue Gene user.

MPI collective I/O works at Blue Gene scale -- most of the time. The exception appears to be when the distribution of data among processes is lumpy; e.g. everyone reads the exact same data, or some processes have more to write than others. In those cases, some internal memory allocations end up exhausting Blue Gene's memory.

You can limit the size of the intermediate buffer by setting the "cb_buffer_size" hint. Doing this splits up the read or write into more rounds and so indirectly limits the total memory used. It's only a band-aide, though.

The read-and-broadcast approach is the best for your workload, and the one I end up suggesting any time this comes up.

Why don't we do this inside the MPI-IO library? Glad you asked! it turns out for a lot of reasons (file views, etypes, ftypes, and the fact that different datatypes may have identical type maps and yet there's no good way to compare types in that way) answering "did you all want to read the same data" is actually kind of challenging in the MPI-IO library.

It's easier to detect identical reads in HDF5, because one need only look at the (hyperslab) selection: To determine "you are all asking for the entire dataset" or "you are all asking for one row of this 3d variable" requires only comparing two N-d arrays. This comparison is likely expensive at scale, though, so "easier" does not necessarily mean "good idea" -- I don't think we'd want this turned on for every access.

so that leaves the application, which indeed knows everyone is reading the same data. It sort of sounds like passing the buck, and perhaps it is, but not for lack of effort from the other layers of the software stack.

==rob

···

On 09/01/2015 02:01 PM, Wolf Dapp wrote:

Am 01.09.2015 um 17:43 schrieb Scot Breitenfeld:

I was also getting the same error with MOAB from ANL when we were
benchmarking small mesh reads with large number of processors. When I
ran on 16384 processes the job would terminate with:

Out of memory in file
/bgsys/source/srcV1R2M1.17463/comm/lib/dev/mpich2/src/mpi/romio/adio/ad_bg/ad_bg_rdcoll.c,
line 1073

A semi-discussion about the problem can be found here:

http://lists.mpich.org/pipermail/devel/2013-May/000154.html

We did not have time in the project to look into the problem any further.

Scot

Thanks for pointing out this discussion, Scot. It seems that not only
you did not have time to investigate the problem further, but neither
IBM nor MPICH did :slight_smile:

I guess this indicates that it's not an HDF5 problem but an MPICH
problem, at heart, and that there's some memory allocations that scale
with the number of ranks.

Though it seems your team hit the "invisible barrier" much later than we
did.

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA