Slow H5Ocopy using parallel HDF5

peter · June 28, 2015, 11:38pm

Dear HDF developers,

I have stumbled upon a grave performance bug in H5Ocopy when using
parallel HDF5. Please see the attached test programs for reproducing
the issue.

In my MPI program I achieve collective write speeds from 16 nodes of
2000 MB/s on a GPFS filesystem, so parallel HDF5 is working fine in
general. However, when copying datasets between two parallel files,
the copy time increases roughly linearly with the number of nodes.

Following, each test was repeated 10 times, and the smallest time was
chosen. The environment was Parallel HDF5 1.8.14, Intel MPI 4.1.2.040,
GPFS 3.5.0 and CentOS 6.4 on Linux x86_64.

Consider first a small compact dataset (32K):

# mpirun -np 1 -ppn 1 ./h5copy_mpio_compact
0.0292 s
# mpirun -np 2 -ppn 1 ./h5copy_mpio_compact
0.0343 s
# mpirun -np 4 -ppn 1 ./h5copy_mpio_compact
0.0411 s
# mpirun -np 8 -ppn 1 ./h5copy_mpio_compact
0.0409 s
# mpirun -np 16 -ppn 1 ./h5copy_mpio_compact
0.0407 s

The copy time is constant with the number of MPI nodes. The dataset
has a compact layout, thus it consists purely of metadata. This test
indicates that metadata copying is working fine.

Now consider a larger contiguous dataset (32M):

# mpirun -np 1 -ppn 1 ./h5copy_mpio
0.0723 s
# mpirun -np 2 -ppn 1 ./h5copy_mpio
0.371 s
# mpirun -np 4 -ppn 1 ./h5copy_mpio
1.91 s
# mpirun -np 8 -ppn 1 ./h5copy_mpio
4.02 s
# mpirun -np 16 -ppn 1 ./h5copy_mpio
9.49 s

The copy time increases roughly linearly with the number of MPI nodes,
even though the size of the raw data being copied is the same for all
cases. Could it be that all processes are trying to write the same raw
data to the destination object, causing serious write contention?

I would expect that while all processes copy the metadata to their
respective metadata cache, only one process copies the raw data to
the output file. However, while trying to understand the source code
of H5Ocopy, I could not find any special handling of the MPIO case.

Can you reproduce the issue on your parallel filesystem?

Which part of H5Ocopy might be causing the issue?

Regards,
Peter

h5copy_mpio.c (1.35 KB)

h5copy_mpio_compact.c (1.44 KB)

David_Schneider · June 29, 2015, 3:33pm

I ran into something similar which turned out to be an issue with the parallel file system we have - a version of lustre < 2.7. The lustre clients in these versions lock the output file, so the multiple ranks on the same compute node get serialized. Here is the start of that thread: https://www.mail-archive.com/lustre-discuss@lists.lustre.org/msg11807.html. This sounds different, but thought I'd mention it in case it helps with diagnosing the issue. It might be interesting to spread your job across multiple compute nodes and see how it scales.

best,

David Schneider

···

________________________________________
From: Hdf-forum [hdf-forum-bounces@lists.hdfgroup.org] on behalf of Peter Colberg [peter@colberg.org]
Sent: Sunday, June 28, 2015 4:38 PM
To: hdf-forum@lists.hdfgroup.org
Subject: [Hdf-forum] Slow H5Ocopy using parallel HDF5

Dear HDF developers,

I have stumbled upon a grave performance bug in H5Ocopy when using
parallel HDF5. Please see the attached test programs for reproducing
the issue.

In my MPI program I achieve collective write speeds from 16 nodes of
2000 MB/s on a GPFS filesystem, so parallel HDF5 is working fine in
general. However, when copying datasets between two parallel files,
the copy time increases roughly linearly with the number of nodes.

Following, each test was repeated 10 times, and the smallest time was
chosen. The environment was Parallel HDF5 1.8.14, Intel MPI 4.1.2.040,
GPFS 3.5.0 and CentOS 6.4 on Linux x86_64.

Consider first a small compact dataset (32K):

# mpirun -np 1 -ppn 1 ./h5copy_mpio_compact
0.0292 s
# mpirun -np 2 -ppn 1 ./h5copy_mpio_compact
0.0343 s
# mpirun -np 4 -ppn 1 ./h5copy_mpio_compact
0.0411 s
# mpirun -np 8 -ppn 1 ./h5copy_mpio_compact
0.0409 s
# mpirun -np 16 -ppn 1 ./h5copy_mpio_compact
0.0407 s

The copy time is constant with the number of MPI nodes. The dataset
has a compact layout, thus it consists purely of metadata. This test
indicates that metadata copying is working fine.

Now consider a larger contiguous dataset (32M):

# mpirun -np 1 -ppn 1 ./h5copy_mpio
0.0723 s
# mpirun -np 2 -ppn 1 ./h5copy_mpio
0.371 s
# mpirun -np 4 -ppn 1 ./h5copy_mpio
1.91 s
# mpirun -np 8 -ppn 1 ./h5copy_mpio
4.02 s
# mpirun -np 16 -ppn 1 ./h5copy_mpio
9.49 s

The copy time increases roughly linearly with the number of MPI nodes,
even though the size of the raw data being copied is the same for all
cases. Could it be that all processes are trying to write the same raw
data to the destination object, causing serious write contention?

I would expect that while all processes copy the metadata to their
respective metadata cache, only one process copies the raw data to
the output file. However, while trying to understand the source code
of H5Ocopy, I could not find any special handling of the MPIO case.

Can you reproduce the issue on your parallel filesystem?

Which part of H5Ocopy might be causing the issue?

Regards,
Peter

Mohamad_Chaarawi · July 2, 2015, 1:56pm

Hi Peter,

H5Ocopy was not really intended to be used in parallel. It was intended as a support routine for the tool h5copy to work in serial mode.
So yes the performance hit that you see is just that every process is doing a copy of the same data, i.e. we don't support the parallel use case.
But you can open the file with process 0, and copy, and close the file and reopen it with all processes if that helps. You can also use the tool h5copy to copy it outside your program.

I entered an issue for parallel support and improvements for H5Ocopy() in our Jira database (HDFFV-9435), but to be honest, I am not sure if we will have time to fix it for parallel unless someone funds it, since this isn't a high priority feature at the moment.

Thanks,
Mohamad

···

-----Original Message-----
From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of Peter Colberg
Sent: Sunday, June 28, 2015 6:38 PM
To: hdf-forum@lists.hdfgroup.org
Subject: [Hdf-forum] Slow H5Ocopy using parallel HDF5

Dear HDF developers,

I have stumbled upon a grave performance bug in H5Ocopy when using parallel HDF5. Please see the attached test programs for reproducing the issue.

In my MPI program I achieve collective write speeds from 16 nodes of
2000 MB/s on a GPFS filesystem, so parallel HDF5 is working fine in general. However, when copying datasets between two parallel files, the copy time increases roughly linearly with the number of nodes.

Following, each test was repeated 10 times, and the smallest time was chosen. The environment was Parallel HDF5 1.8.14, Intel MPI 4.1.2.040, GPFS 3.5.0 and CentOS 6.4 on Linux x86_64.

Consider first a small compact dataset (32K):

# mpirun -np 1 -ppn 1 ./h5copy_mpio_compact
0.0292 s
# mpirun -np 2 -ppn 1 ./h5copy_mpio_compact
0.0343 s
# mpirun -np 4 -ppn 1 ./h5copy_mpio_compact
0.0411 s
# mpirun -np 8 -ppn 1 ./h5copy_mpio_compact
0.0409 s
# mpirun -np 16 -ppn 1 ./h5copy_mpio_compact
0.0407 s

The copy time is constant with the number of MPI nodes. The dataset has a compact layout, thus it consists purely of metadata. This test indicates that metadata copying is working fine.

Now consider a larger contiguous dataset (32M):

# mpirun -np 1 -ppn 1 ./h5copy_mpio
0.0723 s
# mpirun -np 2 -ppn 1 ./h5copy_mpio
0.371 s
# mpirun -np 4 -ppn 1 ./h5copy_mpio
1.91 s
# mpirun -np 8 -ppn 1 ./h5copy_mpio
4.02 s
# mpirun -np 16 -ppn 1 ./h5copy_mpio
9.49 s

The copy time increases roughly linearly with the number of MPI nodes, even though the size of the raw data being copied is the same for all cases. Could it be that all processes are trying to write the same raw data to the destination object, causing serious write contention?

I would expect that while all processes copy the metadata to their respective metadata cache, only one process copies the raw data to the output file. However, while trying to understand the source code of H5Ocopy, I could not find any special handling of the MPIO case.

Can you reproduce the issue on your parallel filesystem?

Which part of H5Ocopy might be causing the issue?

Regards,
Peter

peter · July 5, 2015, 12:05am

Hi Mohamad,

I entered an issue for parallel support and improvements for
H5Ocopy() in our Jira database (HDFFV-9435), but to be honest, I am
not sure if we will have time to fix it for parallel unless someone
funds it, since this isn't a high priority feature at the moment.

Thank you for confirming my guess. I will keep this in mind in case
I acquire funding of my own for a project using parallel HDF5.

I use H5Ocopy to make atomic snapshots of output datasets during a
simulation. The datasets have chunked layout with time-varying
data and grow over the course of the simulation. If the simulation
is interrupted, the output file is unreadable since HDF5 does not
implement metadata journaling (yet?).

To make a consistent snapshot, I create another HDF5 file with a
temporary filename. All output datasets are copied to that snapshot
file. Then the file is flushed to storage with H5Fflush. When using
MPI, this implicitly invokes MPI_File_sync. Otherwise, in the serial
case, fsync must be invoked on the file descriptor retrieved with
H5Fget_vfd_handle. After the data has been written to storage, the
snapshot file is renamed to a non-temporary filename, which overwrites
the previous snapshot file.

Since H5Ocopy is a collective call, if the output file is opened by
all processes, so must the snapshot file. For now I worked around the
issue by keeping the per-node output data in memory until the end of
the simulation, thus avoiding H5Ocopy entirely.

Regards,
Peter

···

On Thu, Jul 02, 2015 at 01:56:40PM +0000, Mohamad Chaarawi wrote:

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Slow H5Ocopy using parallel HDF5