size of "write operation" with pHDF5

What is the size of a "write operation" with parallel hdf5? That
terminology comes up a lot on my sole source of guidance for lustre on
the machine I'm running on ( http://www.nics.tennessee.edu/io-tips )

I am trying to choose ideal parameters for the lustre file system.

I experienced abysmal performance with my first attempt at writing 1
file containing 3D data with 30,000 cores, and I want to choose better
parameters. After 11 minutes 62 GB had been written, and I killed the
job.

Each 3D array that I write from a core is 435,600 bytes. I have my
chunk dimensions the same as my array dimension. Does that mean that
each core writes a chunk of data 435,600 bytes long? Would I therefore
wish to set my stripe size to 435,600 bytes? That is smaller than the
default of 1 MB.

It seems that lustre performs best when each "write operation" is
large (say 32 MB) and the stripe size matches it. However our cores
each are writing comparatively much smaller chunks of data.

I am going to see if the folks on the kraken machine can help me with
optimizing lustre, but want to understand as much as possible about
how pHDF5 works before I do.

Thanks,

Leigh

···

--
Leigh Orf
Associate Professor of Atmospheric Science
Department of Geology and Meteorology
Central Michigan University
Currently on sabbatical at the National Center for Atmospheric
Research in Boulder, CO
NCAR office phone: (303) 497-8200

I'm not up on hdf5 internals, but I can't imagine any API would effectively deal with such small writes, because the os/disks aren't going to cope with them effectively.

If hdf5 can coalesce writes, try enabling that. Otherwise, forward your data to a subset of nodes for writing, such that each write is large. Generally larger is better, but I would say shoot for 16 megs per write.

-tom

···

Am Mar 4, 2011 um 5:03 PM schrieb Leigh Orf <leigh.orf@gmail.com>:

What is the size of a "write operation" with parallel hdf5? That
terminology comes up a lot on my sole source of guidance for lustre on
the machine I'm running on ( http://www.nics.tennessee.edu/io-tips )

I am trying to choose ideal parameters for the lustre file system.

I experienced abysmal performance with my first attempt at writing 1
file containing 3D data with 30,000 cores, and I want to choose better
parameters. After 11 minutes 62 GB had been written, and I killed the
job.

Each 3D array that I write from a core is 435,600 bytes. I have my
chunk dimensions the same as my array dimension. Does that mean that
each core writes a chunk of data 435,600 bytes long? Would I therefore
wish to set my stripe size to 435,600 bytes? That is smaller than the
default of 1 MB.

It seems that lustre performs best when each "write operation" is
large (say 32 MB) and the stripe size matches it. However our cores
each are writing comparatively much smaller chunks of data.

I am going to see if the folks on the kraken machine can help me with
optimizing lustre, but want to understand as much as possible about
how pHDF5 works before I do.

Thanks,

Leigh

--
Leigh Orf
Associate Professor of Atmospheric Science
Department of Geology and Meteorology
Central Michigan University
Currently on sabbatical at the National Center for Atmospheric
Research in Boulder, CO
NCAR office phone: (303) 497-8200

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

I'm not up on hdf5 internals, but I can't imagine any API would effectively deal with such small writes, because the os/disks aren't going to cope with them effectively.

If hdf5 can coalesce writes, try enabling that. Otherwise, forward your data to a subset of nodes for writing, such that each write is large. Generally larger is better, but I would say shoot for 16 megs per write.

As I understand from Mark & Quincey when you write in collective mode,
it assigns writers and collects data to the writers so that the chunks
are larger, and aligns the data to the underlying FS stripe size (at
least with lustre, what I am using). However the details of this are a
mystery to me.

Leigh

···

On Sat, Mar 5, 2011 at 12:14 PM, Tjf (mobile) <tfogal@sci.utah.edu> wrote:

-tom

Am Mar 4, 2011 um 5:03 PM schrieb Leigh Orf <leigh.orf@gmail.com>:

What is the size of a "write operation" with parallel hdf5? That
terminology comes up a lot on my sole source of guidance for lustre on
the machine I'm running on ( http://www.nics.tennessee.edu/io-tips )

I am trying to choose ideal parameters for the lustre file system.

I experienced abysmal performance with my first attempt at writing 1
file containing 3D data with 30,000 cores, and I want to choose better
parameters. After 11 minutes 62 GB had been written, and I killed the
job.

Each 3D array that I write from a core is 435,600 bytes. I have my
chunk dimensions the same as my array dimension. Does that mean that
each core writes a chunk of data 435,600 bytes long? Would I therefore
wish to set my stripe size to 435,600 bytes? That is smaller than the
default of 1 MB.

It seems that lustre performs best when each "write operation" is
large (say 32 MB) and the stripe size matches it. However our cores
each are writing comparatively much smaller chunks of data.

I am going to see if the folks on the kraken machine can help me with
optimizing lustre, but want to understand as much as possible about
how pHDF5 works before I do.

Thanks,

Leigh

--
Leigh Orf
Associate Professor of Atmospheric Science
Department of Geology and Meteorology
Central Michigan University
Currently on sabbatical at the National Center for Atmospheric
Research in Boulder, CO
NCAR office phone: (303) 497-8200

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

--
Leigh Orf
Associate Professor of Atmospheric Science
Department of Geology and Meteorology
Central Michigan University
Currently on sabbatical at the National Center for Atmospheric
Research in Boulder, CO
NCAR office phone: (303) 497-8200

Hi Leigh,

So, I don't use pHDF5 that much and have no experience with pHDF5 at
scales you are working with here. For the record, we use serial HDF5 and
'Poor Man's Parallel I/O'
(http://visitbugs.ornl.gov/projects/hpc-hdf5/wiki/Poor_Mans_vs_Rich_Mans_Parallel_IO)
to achieve scalable I/O.

That said, I do know of some HDF5 properties you might try fiddling with
to affect aggregation (coalescing) of small writes. These, of course,
some at the expense of larger buffer allocations in HDF5 lib to
accommodate the aggregation...

First, try calling this...

H5Pset_small_data_block_size(hid_t fapl_id, hsize_t size )

with a large size, say 1-16 Megabytes.

Also, you might try calling this...

herr_t H5Pset_meta_block_size( hid_t fapl_id, hsize_t size )

with a large size. The value of it will really depend on how many
objects (e.g. datasets+groups+types+attributes+b-trees) you are creating
in a file. Again, play with value but maybe something on order of 1-4
megabytes would be good to try.

Looks like all the chunk 'caching' features are turned off when using
pHDF5. So, those functions can't help you.

Are there any MPI 'hints' that your system understands that maybe need
to get passed down to MPI to affect aggregation behavior of MPI-IO? I
can't recall specifically but I think you pass those to HDF5 via the
MPI_Info arg in HP5set_fapl_mpio.

Have you tried switching between fapl_mpio and fapl_mpiposix? The
difference is that fapl_mpio uses MPI-IO for actual disk I/O while
mpiposix uses sec2 I/O routines. In theory, MPI-IO is designed to do
just the kind of I/O aggregation you need. Sometimes taking MPI-IO out
of the equation (by using fapl_mpiposix) can help to shake out issues in
layers above, or below.

Note, all these are all file-access-property list settings found here..

http://www.hdfgroup.org/HDF5/doc/RM/RM_H5P.html

Good luck. Very interested to hear what you learn from trying any of
these.

Mark

···

On Sat, 2011-03-05 at 12:56, Leigh Orf wrote:

On Sat, Mar 5, 2011 at 12:14 PM, Tjf (mobile) <tfogal@sci.utah.edu> wrote:
> I'm not up on hdf5 internals, but I can't imagine any API would effectively deal with such small writes, because the os/disks aren't going to cope with them effectively.
>
> If hdf5 can coalesce writes, try enabling that. Otherwise, forward your data to a subset of nodes for writing, such that each write is large. Generally larger is better, but I would say shoot for 16 megs per write.

As I understand from Mark & Quincey when you write in collective mode,
it assigns writers and collects data to the writers so that the chunks
are larger, and aligns the data to the underlying FS stripe size (at
least with lustre, what I am using). However the details of this are a
mystery to me.

Leigh

>
> -tom
>
> Am Mar 4, 2011 um 5:03 PM schrieb Leigh Orf <leigh.orf@gmail.com>:
>
>> What is the size of a "write operation" with parallel hdf5? That
>> terminology comes up a lot on my sole source of guidance for lustre on
>> the machine I'm running on ( http://www.nics.tennessee.edu/io-tips )
>>
>> I am trying to choose ideal parameters for the lustre file system.
>>
>> I experienced abysmal performance with my first attempt at writing 1
>> file containing 3D data with 30,000 cores, and I want to choose better
>> parameters. After 11 minutes 62 GB had been written, and I killed the
>> job.
>>
>> Each 3D array that I write from a core is 435,600 bytes. I have my
>> chunk dimensions the same as my array dimension. Does that mean that
>> each core writes a chunk of data 435,600 bytes long? Would I therefore
>> wish to set my stripe size to 435,600 bytes? That is smaller than the
>> default of 1 MB.
>>
>> It seems that lustre performs best when each "write operation" is
>> large (say 32 MB) and the stripe size matches it. However our cores
>> each are writing comparatively much smaller chunks of data.
>>
>> I am going to see if the folks on the kraken machine can help me with
>> optimizing lustre, but want to understand as much as possible about
>> how pHDF5 works before I do.
>>
>> Thanks,
>>
>> Leigh
>>
>> --
>> Leigh Orf
>> Associate Professor of Atmospheric Science
>> Department of Geology and Meteorology
>> Central Michigan University
>> Currently on sabbatical at the National Center for Atmospheric
>> Research in Boulder, CO
>> NCAR office phone: (303) 497-8200
>>
>> _______________________________________________
>> Hdf-forum is for HDF software users discussion.
>> Hdf-forum@hdfgroup.org
>> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> Hdf-forum@hdfgroup.org
> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
>

--
Mark C. Miller, Lawrence Livermore National Laboratory
================!!LLNL BUSINESS ONLY!!================
miller86@llnl.gov urgent: miller86@pager.llnl.gov
T:8-6 (925)-423-5901 M/W/Th:7-12,2-7 (530)-753-8511

Hi Leigh,

I'm not up on hdf5 internals, but I can't imagine any API would effectively deal with such small writes, because the os/disks aren't going to cope with them effectively.

If hdf5 can coalesce writes, try enabling that. Otherwise, forward your data to a subset of nodes for writing, such that each write is large. Generally larger is better, but I would say shoot for 16 megs per write.

As I understand from Mark & Quincey when you write in collective mode,
it assigns writers and collects data to the writers so that the chunks
are larger, and aligns the data to the underlying FS stripe size (at
least with lustre, what I am using). However the details of this are a
mystery to me.

  No, this isn't quite accurate. The chunks in the file are always set at the size you use when creating the dataset, even when collective I/O is used. You should use H5Pset_alignment() (as you mentioned in your other email) to align the chunks on a "good" boundary for Lustre. Also, if the datasets are fixed size, you can compute the number of chunks that will be produced and use H5Pset_istore_k() to be 1/2 of that value, so that there is only one B-tree node for the chunked dataset's index, which will speed up metadata operations for the dataset (this is being addressed with new chunk indexing methods in the next major release of HDF5 - 1.10.0). Also, you should move up to the recently released 1.8.6 release, which has all the performance improvements that we implemented for the paper that Mark Howison wrote with us last year.

  Quincey

···

On Mar 5, 2011, at 2:56 PM, Leigh Orf wrote:

On Sat, Mar 5, 2011 at 12:14 PM, Tjf (mobile) <tfogal@sci.utah.edu> wrote:

Leigh

-tom

Am Mar 4, 2011 um 5:03 PM schrieb Leigh Orf <leigh.orf@gmail.com>:

What is the size of a "write operation" with parallel hdf5? That
terminology comes up a lot on my sole source of guidance for lustre on
the machine I'm running on ( http://www.nics.tennessee.edu/io-tips )

I am trying to choose ideal parameters for the lustre file system.

I experienced abysmal performance with my first attempt at writing 1
file containing 3D data with 30,000 cores, and I want to choose better
parameters. After 11 minutes 62 GB had been written, and I killed the
job.

Each 3D array that I write from a core is 435,600 bytes. I have my
chunk dimensions the same as my array dimension. Does that mean that
each core writes a chunk of data 435,600 bytes long? Would I therefore
wish to set my stripe size to 435,600 bytes? That is smaller than the
default of 1 MB.

It seems that lustre performs best when each "write operation" is
large (say 32 MB) and the stripe size matches it. However our cores
each are writing comparatively much smaller chunks of data.

I am going to see if the folks on the kraken machine can help me with
optimizing lustre, but want to understand as much as possible about
how pHDF5 works before I do.

Thanks,

Leigh

--
Leigh Orf
Associate Professor of Atmospheric Science
Department of Geology and Meteorology
Central Michigan University
Currently on sabbatical at the National Center for Atmospheric
Research in Boulder, CO
NCAR office phone: (303) 497-8200

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

--
Leigh Orf
Associate Professor of Atmospheric Science
Department of Geology and Meteorology
Central Michigan University
Currently on sabbatical at the National Center for Atmospheric
Research in Boulder, CO
NCAR office phone: (303) 497-8200

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Hi Leigh,

I'm not up on hdf5 internals, but I can't imagine any API would effectively deal with such small writes, because the os/disks aren't going to cope with them effectively.

If hdf5 can coalesce writes, try enabling that. Otherwise, forward your data to a subset of nodes for writing, such that each write is large. Generally larger is better, but I would say shoot for 16 megs per write.

As I understand from Mark & Quincey when you write in collective mode,
it assigns writers and collects data to the writers so that the chunks
are larger, and aligns the data to the underlying FS stripe size (at
least with lustre, what I am using). However the details of this are a
mystery to me.

   No, this isn&#39;t quite accurate\.  The chunks in the file are always set at the size you use when creating the dataset, even when collective I/O is used\.  You should use H5Pset\_alignment\(\) \(as you mentioned in your other email\) to align the chunks on a &quot;good&quot; boundary for Lustre\.  Also, if the datasets are fixed size, you can compute the number of chunks that will be produced and use H5Pset\_istore\_k\(\) to be 1/2 of that value, so that there is only one B\-tree node for the chunked dataset&#39;s index, which will speed up metadata operations for the dataset \(this is being addressed with new chunk indexing methods in the next major release of HDF5 \- 1\.10\.0\)\.  Also, you should move up to the recently released 1\.8\.6 release, which has all the performance improvements that we implemented for the paper that Mark Howison wrote with us last year\.

That is very useful information. I assumed the H5Pset_alignment was
done "under the hood." Clearly I am therefore doing unaligned writes
which is causing bad performance. I will follow your suggestions and
let you know how it turns out.

Did not version 1.8.5 have the performance improvements? You do
mention that version in the paper. Regardless, I will ask to have
1.8.6 built on krraken as well.

Thanks,

Leigh

···

On Mon, Mar 7, 2011 at 9:28 AM, Quincey Koziol <koziol@hdfgroup.org> wrote:

On Mar 5, 2011, at 2:56 PM, Leigh Orf wrote:

On Sat, Mar 5, 2011 at 12:14 PM, Tjf (mobile) <tfogal@sci.utah.edu> wrote:

   Quincey

Leigh

-tom

Am Mar 4, 2011 um 5:03 PM schrieb Leigh Orf <leigh.orf@gmail.com>:

What is the size of a "write operation" with parallel hdf5? That
terminology comes up a lot on my sole source of guidance for lustre on
the machine I'm running on ( http://www.nics.tennessee.edu/io-tips )

I am trying to choose ideal parameters for the lustre file system.

I experienced abysmal performance with my first attempt at writing 1
file containing 3D data with 30,000 cores, and I want to choose better
parameters. After 11 minutes 62 GB had been written, and I killed the
job.

Each 3D array that I write from a core is 435,600 bytes. I have my
chunk dimensions the same as my array dimension. Does that mean that
each core writes a chunk of data 435,600 bytes long? Would I therefore
wish to set my stripe size to 435,600 bytes? That is smaller than the
default of 1 MB.

It seems that lustre performs best when each "write operation" is
large (say 32 MB) and the stripe size matches it. However our cores
each are writing comparatively much smaller chunks of data.

I am going to see if the folks on the kraken machine can help me with
optimizing lustre, but want to understand as much as possible about
how pHDF5 works before I do.

Thanks,

Leigh

--
Leigh Orf
Associate Professor of Atmospheric Science
Department of Geology and Meteorology
Central Michigan University
Currently on sabbatical at the National Center for Atmospheric
Research in Boulder, CO
NCAR office phone: (303) 497-8200

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

--
Leigh Orf
Associate Professor of Atmospheric Science
Department of Geology and Meteorology
Central Michigan University
Currently on sabbatical at the National Center for Atmospheric
Research in Boulder, CO
NCAR office phone: (303) 497-8200

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

--
Leigh Orf
Associate Professor of Atmospheric Science
Department of Geology and Meteorology
Central Michigan University
Currently on sabbatical at the National Center for Atmospheric
Research in Boulder, CO
NCAR office phone: (303) 497-8200

Hi Leigh,

Hi Leigh,

I'm not up on hdf5 internals, but I can't imagine any API would effectively deal with such small writes, because the os/disks aren't going to cope with them effectively.

If hdf5 can coalesce writes, try enabling that. Otherwise, forward your data to a subset of nodes for writing, such that each write is large. Generally larger is better, but I would say shoot for 16 megs per write.

As I understand from Mark & Quincey when you write in collective mode,
it assigns writers and collects data to the writers so that the chunks
are larger, and aligns the data to the underlying FS stripe size (at
least with lustre, what I am using). However the details of this are a
mystery to me.

      No, this isn't quite accurate. The chunks in the file are always set at the size you use when creating the dataset, even when collective I/O is used. You should use H5Pset_alignment() (as you mentioned in your other email) to align the chunks on a "good" boundary for Lustre. Also, if the datasets are fixed size, you can compute the number of chunks that will be produced and use H5Pset_istore_k() to be 1/2 of that value, so that there is only one B-tree node for the chunked dataset's index, which will speed up metadata operations for the dataset (this is being addressed with new chunk indexing methods in the next major release of HDF5 - 1.10.0). Also, you should move up to the recently released 1.8.6 release, which has all the performance improvements that we implemented for the paper that Mark Howison wrote with us last year.

That is very useful information. I assumed the H5Pset_alignment was
done "under the hood." Clearly I am therefore doing unaligned writes
which is causing bad performance. I will follow your suggestions and
let you know how it turns out.

Did not version 1.8.5 have the performance improvements? You do
mention that version in the paper. Regardless, I will ask to have
1.8.6 built on krraken as well.

  Some of the improvements made it into 1.8.5, but some took longer and only made it into the 1.8.6 release.

  Quincey

···

On Mar 7, 2011, at 2:24 PM, Leigh Orf wrote:

On Mon, Mar 7, 2011 at 9:28 AM, Quincey Koziol <koziol@hdfgroup.org> wrote:

On Mar 5, 2011, at 2:56 PM, Leigh Orf wrote:

On Sat, Mar 5, 2011 at 12:14 PM, Tjf (mobile) <tfogal@sci.utah.edu> wrote:

Thanks,

Leigh

      Quincey

Leigh

-tom

Am Mar 4, 2011 um 5:03 PM schrieb Leigh Orf <leigh.orf@gmail.com>:

What is the size of a "write operation" with parallel hdf5? That
terminology comes up a lot on my sole source of guidance for lustre on
the machine I'm running on ( http://www.nics.tennessee.edu/io-tips )

I am trying to choose ideal parameters for the lustre file system.

I experienced abysmal performance with my first attempt at writing 1
file containing 3D data with 30,000 cores, and I want to choose better
parameters. After 11 minutes 62 GB had been written, and I killed the
job.

Each 3D array that I write from a core is 435,600 bytes. I have my
chunk dimensions the same as my array dimension. Does that mean that
each core writes a chunk of data 435,600 bytes long? Would I therefore
wish to set my stripe size to 435,600 bytes? That is smaller than the
default of 1 MB.

It seems that lustre performs best when each "write operation" is
large (say 32 MB) and the stripe size matches it. However our cores
each are writing comparatively much smaller chunks of data.

I am going to see if the folks on the kraken machine can help me with
optimizing lustre, but want to understand as much as possible about
how pHDF5 works before I do.

Thanks,

Leigh

--
Leigh Orf
Associate Professor of Atmospheric Science
Department of Geology and Meteorology
Central Michigan University
Currently on sabbatical at the National Center for Atmospheric
Research in Boulder, CO
NCAR office phone: (303) 497-8200

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

--
Leigh Orf
Associate Professor of Atmospheric Science
Department of Geology and Meteorology
Central Michigan University
Currently on sabbatical at the National Center for Atmospheric
Research in Boulder, CO
NCAR office phone: (303) 497-8200

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

--
Leigh Orf
Associate Professor of Atmospheric Science
Department of Geology and Meteorology
Central Michigan University
Currently on sabbatical at the National Center for Atmospheric
Research in Boulder, CO
NCAR office phone: (303) 497-8200