Serial writes to many files at once on lustre file system

Leigh_Orf · March 4, 2011, 9:48pm

This is somewhat related to my earlier queries about pHDF5 performance
on a lustre filesystem, but different enough to merit a new thread.

The cloud model I am using outputs two main types of 3D data:
snapshots and restart files.

The snapshot files are used for post-processing, visualization,
analysis etc. and I am using pHDF5 to write few (maybe even 1) file
containing 30,000 ranks of data. I am still working on testing this.

However, restart files (sometimes called checkpoint files), which
contain the minimum amount of data required to start the model up from
a particular state in time, are disposable and hence my main concern
with restart files is they be written and read as quickly as possible.
But I don't care what format they're in.

Because of issues with ghost zones and other factors which make arrays
overlap in model space, it is much simpler to have each rank write its
own restart file rather than trying to merge them together using
pHDF5. I started going down the pHDF5 route and decided it wasn't
worth it.

Currently, the model defaults to one file per rank, but uses
unformatted fortran writes, i.e.:

open(unit=50,file=trim(filename),form='unformatted',status='unknown')
write(50) ua
write(50) va
write(50) wa
write(50) ppi
write(50) tha

etc. where each array is 3d (some 1d arrays are written as well).

With the current configuration I have, each restart file is approximately 12MB.

After reading through what literature I could find on lustre, I
decided that I would write no more than 3,000 files at one time, have
3,000 files per unique directory, and that I would set the stripe
count (the number of OSTs to stripe over) to 1. I set the strip size
to 32 MB which, in retrospect, was probably not an ideal choice given
each restart file size.

With this configuration, I wrote 353 GB of data, spanning 30,000
files, in about 6 minutes, getting an effective write bandwidth of
1.09 GB/s. A second try got better performance for no obvious reason,
getting 2.8 GB/s. However this is still much lower than ~ 10 GB/s
which from http://www.nics.tennessee.edu/io-tips is the maximum
(presumably aligned) expected performance.

I am assuming I am getting less than optimal results primarily because
the writes are unaligned. This brings me to my question: Should I
bother to rewrite the checkpoint writing/reading code using hdf5 in
order to increase performance? I understand with pHDF5 and collective
I/O, it automatically does aligned writes, presumably being able to
detect the strip size on the lustre filesystem (is this true?).

With serial HDF5, I see there is a H5Pset_alignment command. I also
assume that with serial HDF5, I would need to manually set the
alignment, as it defaults to unaligned writes. Would I benefit from
using H5Pset_alignment to the stripe size on the lustre filesystem?

My arrays are roughly 28x21x330 with some slight variation. 16 these 4
byte floating point arrays are written, giving approximately 12 MB per
file.

So, as a rough guess, I am thinking of trying the following:

Set stripe size to 4 MB (4194304 bytes)

Try something like:

H5Pset_alignment(fapl, 1000, 4194304)

(I didn't set the second argument to 0 because I really don't want to
align my 4 byte integers etc, that comprise some of the restart data,
right?)

Chunk in Z only, so my chunk dimensions would be something like
28x21x30 (it's never been clear to me what chunk size to pick to
optimize I/O).

And keep the other parameters the same (1 stripe, and 3,000 files per
directory).

I guess what I'm mostly looking for is assurance that I will get
faster I/O going down this kind of route than the current way I am
doing unformatted I/O.

Thanks as always,

Leigh

···

--
Leigh Orf
Associate Professor of Atmospheric Science
Department of Geology and Meteorology
Central Michigan University
Currently on sabbatical at the National Center for Atmospheric
Research in Boulder, CO
NCAR office phone: (303) 497-8200

Quincey_Koziol · March 7, 2011, 4:16pm

Hi Leigh,

This is somewhat related to my earlier queries about pHDF5 performance
on a lustre filesystem, but different enough to merit a new thread.

The cloud model I am using outputs two main types of 3D data:
snapshots and restart files.

The snapshot files are used for post-processing, visualization,
analysis etc. and I am using pHDF5 to write few (maybe even 1) file
containing 30,000 ranks of data. I am still working on testing this.

However, restart files (sometimes called checkpoint files), which
contain the minimum amount of data required to start the model up from
a particular state in time, are disposable and hence my main concern
with restart files is they be written and read as quickly as possible.
But I don't care what format they're in.

Because of issues with ghost zones and other factors which make arrays
overlap in model space, it is much simpler to have each rank write its
own restart file rather than trying to merge them together using
pHDF5. I started going down the pHDF5 route and decided it wasn't
worth it.

Currently, the model defaults to one file per rank, but uses
unformatted fortran writes, i.e.:

open(unit=50,file=trim(filename),form='unformatted',status='unknown')
      write(50) ua
      write(50) va
      write(50) wa
      write(50) ppi
      write(50) tha

etc. where each array is 3d (some 1d arrays are written as well).

Yow! Very 70's...

With the current configuration I have, each restart file is approximately 12MB.

After reading through what literature I could find on lustre, I
decided that I would write no more than 3,000 files at one time, have
3,000 files per unique directory, and that I would set the stripe
count (the number of OSTs to stripe over) to 1. I set the strip size
to 32 MB which, in retrospect, was probably not an ideal choice given
each restart file size.

With this configuration, I wrote 353 GB of data, spanning 30,000
files, in about 6 minutes, getting an effective write bandwidth of
1.09 GB/s. A second try got better performance for no obvious reason,
getting 2.8 GB/s. However this is still much lower than ~ 10 GB/s
which from http://www.nics.tennessee.edu/io-tips is the maximum
(presumably aligned) expected performance.

I am assuming I am getting less than optimal results primarily because
the writes are unaligned. This brings me to my question: Should I
bother to rewrite the checkpoint writing/reading code using hdf5 in
order to increase performance? I understand with pHDF5 and collective
I/O, it automatically does aligned writes, presumably being able to
detect the strip size on the lustre filesystem (is this true?).

Unless the MPI-IO layer is doing this, HDF5 doesn't do this by default.

With serial HDF5, I see there is a H5Pset_alignment command. I also
assume that with serial HDF5, I would need to manually set the
alignment, as it defaults to unaligned writes. Would I benefit from
using H5Pset_alignment to the stripe size on the lustre filesystem?

Yes, almost certainly. This is one of the ways that Mark Howison and I worked out to improve the I/O performance on the NERSC machines.

My arrays are roughly 28x21x330 with some slight variation. 16 these 4
byte floating point arrays are written, giving approximately 12 MB per
file.

So, as a rough guess, I am thinking of trying the following:

Set stripe size to 4 MB (4194304 bytes)

Try something like:

H5Pset_alignment(fapl, 1000, 4194304)

(I didn't set the second argument to 0 because I really don't want to
align my 4 byte integers etc, that comprise some of the restart data,
right?)

Yes, that looks fine.

Chunk in Z only, so my chunk dimensions would be something like
28x21x30 (it's never been clear to me what chunk size to pick to
optimize I/O).

And keep the other parameters the same (1 stripe, and 3,000 files per
directory).

I guess what I'm mostly looking for is assurance that I will get
faster I/O going down this kind of route than the current way I am
doing unformatted I/O.

This looks like a fruitful direction to go it. Do you really need chunking though?

Quincey

···

On Mar 4, 2011, at 3:48 PM, Leigh Orf wrote:

Mark_Howison · March 8, 2011, 2:32pm

Hi Leigh, I'm not sure of the origin of the "3000" limit, but it is
true that if you are going to write file-per-processor on lustre, it
can be much better if you use a "sqrt(n)" file layout, meaning you
create sqrt(n) directories and fill each with sqrt(n) files. This
tends to alleviate pressure on the metadata server, which is often the
bottleneck when working with large numbers of files on a lustre FS.
Mark

···

On Fri, Mar 4, 2011 at 4:48 PM, Leigh Orf <leigh.orf@gmail.com> wrote:

This is somewhat related to my earlier queries about pHDF5 performance
on a lustre filesystem, but different enough to merit a new thread.

The cloud model I am using outputs two main types of 3D data:
snapshots and restart files.

The snapshot files are used for post-processing, visualization,
analysis etc. and I am using pHDF5 to write few (maybe even 1) file
containing 30,000 ranks of data. I am still working on testing this.

However, restart files (sometimes called checkpoint files), which
contain the minimum amount of data required to start the model up from
a particular state in time, are disposable and hence my main concern
with restart files is they be written and read as quickly as possible.
But I don't care what format they're in.

Because of issues with ghost zones and other factors which make arrays
overlap in model space, it is much simpler to have each rank write its
own restart file rather than trying to merge them together using
pHDF5. I started going down the pHDF5 route and decided it wasn't
worth it.

Currently, the model defaults to one file per rank, but uses
unformatted fortran writes, i.e.:

open(unit=50,file=trim(filename),form='unformatted',status='unknown')
write(50) ua
write(50) va
write(50) wa
write(50) ppi
write(50) tha

etc. where each array is 3d (some 1d arrays are written as well).

With the current configuration I have, each restart file is approximately 12MB.

After reading through what literature I could find on lustre, I
decided that I would write no more than 3,000 files at one time, have
3,000 files per unique directory, and that I would set the stripe
count (the number of OSTs to stripe over) to 1. I set the strip size
to 32 MB which, in retrospect, was probably not an ideal choice given
each restart file size.

With this configuration, I wrote 353 GB of data, spanning 30,000
files, in about 6 minutes, getting an effective write bandwidth of
1.09 GB/s. A second try got better performance for no obvious reason,
getting 2.8 GB/s. However this is still much lower than ~ 10 GB/s
which from http://www.nics.tennessee.edu/io-tips is the maximum
(presumably aligned) expected performance.

I am assuming I am getting less than optimal results primarily because
the writes are unaligned. This brings me to my question: Should I
bother to rewrite the checkpoint writing/reading code using hdf5 in
order to increase performance? I understand with pHDF5 and collective
I/O, it automatically does aligned writes, presumably being able to
detect the strip size on the lustre filesystem (is this true?).

With serial HDF5, I see there is a H5Pset_alignment command. I also
assume that with serial HDF5, I would need to manually set the
alignment, as it defaults to unaligned writes. Would I benefit from
using H5Pset_alignment to the stripe size on the lustre filesystem?

My arrays are roughly 28x21x330 with some slight variation. 16 these 4
byte floating point arrays are written, giving approximately 12 MB per
file.

So, as a rough guess, I am thinking of trying the following:

Set stripe size to 4 MB (4194304 bytes)

Try something like:

H5Pset_alignment(fapl, 1000, 4194304)

(I didn't set the second argument to 0 because I really don't want to
align my 4 byte integers etc, that comprise some of the restart data,
right?)

Chunk in Z only, so my chunk dimensions would be something like
28x21x30 (it's never been clear to me what chunk size to pick to
optimize I/O).

And keep the other parameters the same (1 stripe, and 3,000 files per
directory).

I guess what I'm mostly looking for is assurance that I will get
faster I/O going down this kind of route than the current way I am
doing unformatted I/O.

Thanks as always,

Leigh

--
Leigh Orf
Associate Professor of Atmospheric Science
Department of Geology and Meteorology
Central Michigan University
Currently on sabbatical at the National Center for Atmospheric
Research in Boulder, CO
NCAR office phone: (303) 497-8200

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Leigh_Orf · March 7, 2011, 9:01pm

Hi Leigh,

open(unit=50,file=trim(filename),form='unformatted',status='unknown')
write(50) ua
write(50) va
write(50) wa
write(50) ppi
write(50) tha

etc. where each array is 3d (some 1d arrays are written as well).
   Yow\!  Very 70&#39;s\.\.\. :\-\)

Don't blame me, it's not my code

I am assuming I am getting less than optimal results primarily because
the writes are unaligned. This brings me to my question: Should I
bother to rewrite the checkpoint writing/reading code using hdf5 in
order to increase performance? I understand with pHDF5 and collective
I/O, it automatically does aligned writes, presumably being able to
detect the strip size on the lustre filesystem (is this true?).
   Unless the MPI\-IO layer is doing this, HDF5 doesn&#39;t do this by default\.

Good to know, I am going to set alignment manually from now on.

With serial HDF5, I see there is a H5Pset_alignment command. I also
assume that with serial HDF5, I would need to manually set the
alignment, as it defaults to unaligned writes. Would I benefit from
using H5Pset_alignment to the stripe size on the lustre filesystem?
   Yes, almost certainly\.  This is one of the ways that Mark Howison and I worked out to improve the I/O performance on the NERSC machines\.

Good....

My arrays are roughly 28x21x330 with some slight variation. 16 these 4
byte floating point arrays are written, giving approximately 12 MB per
file.

So, as a rough guess, I am thinking of trying the following:

Set stripe size to 4 MB (4194304 bytes)

Try something like:

H5Pset_alignment(fapl, 1000, 4194304)

(I didn't set the second argument to 0 because I really don't want to
align my 4 byte integers etc, that comprise some of the restart data,
right?)
   Yes, that looks fine\.
Chunk in Z only, so my chunk dimensions would be something like
28x21x30 (it's never been clear to me what chunk size to pick to
optimize I/O).

And keep the other parameters the same (1 stripe, and 3,000 files per
directory).

I guess what I'm mostly looking for is assurance that I will get
faster I/O going down this kind of route than the current way I am
doing unformatted I/O.
   This looks like a fruitful direction to go it\.  Do you really need chunking though?

Not sure, It's never been super clear to me what chunking gets you
beyond (1) the ability to do compression (2) faster seeking through
large datasets when you want to access space towards the end of the
file. I may just forego chunking and see where that gets me first.

Thanks again for your help, I am more optimistic today...

Leigh

···

On Mon, Mar 7, 2011 at 9:16 AM, Quincey Koziol <koziol@hdfgroup.org> wrote:

   Quincey
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

--
Leigh Orf
Associate Professor of Atmospheric Science
Department of Geology and Meteorology
Central Michigan University
Currently on sabbatical at the National Center for Atmospheric
Research in Boulder, CO
NCAR office phone: (303) 497-8200

Quincey_Koziol · March 8, 2011, 3:56am

Hi Leigh,

···

On Mar 7, 2011, at 3:01 PM, Leigh Orf wrote:

On Mon, Mar 7, 2011 at 9:16 AM, Quincey Koziol <koziol@hdfgroup.org> wrote:

Hi Leigh,

Chunk in Z only, so my chunk dimensions would be something like
28x21x30 (it's never been clear to me what chunk size to pick to
optimize I/O).

And keep the other parameters the same (1 stripe, and 3,000 files per
directory).

I guess what I'm mostly looking for is assurance that I will get
faster I/O going down this kind of route than the current way I am
doing unformatted I/O.

This looks like a fruitful direction to go it. Do you really need chunking though?

Not sure, It's never been super clear to me what chunking gets you
beyond (1) the ability to do compression (2) faster seeking through
large datasets when you want to access space towards the end of the
file. I may just forego chunking and see where that gets me first.

Chunking is required if you want to have unlimited dimensions on your dataset's dataspace. I would rephrase (2) above as "faster I/O when your selection is a good match for the chunk size", which could be an exact match for the chunk size, or a selection with a well-aligned, good multiple or fraction of the chunk size. If you aren't using compression, don't need unlimited dimensions and aren't performing I/O on selections of the dataset, contiguous storage is probably a better fit.

Quincey

Mark_Howison · March 8, 2011, 2:36pm

Hi Leigh,

Just to echo Quincey here, you will see optimal performance when the
chunk dimensions evenly divides the dataset dimensions, so that there
are no "partial" chunks. Although, there was some work done recently
to the HDF5 library to better detect chunk decomposition. Quincey can
speak more to that.

So the ideal chunk dimensions would be ones that evenly divide the
dataset and are close to a multiple of 1MB in terms of total data (so
that they need minimal padding when aligning to lustre stripe width).

Mark

···

On Mon, Mar 7, 2011 at 10:56 PM, Quincey Koziol <koziol@hdfgroup.org> wrote:

Hi Leigh,

On Mar 7, 2011, at 3:01 PM, Leigh Orf wrote:
On Mon, Mar 7, 2011 at 9:16 AM, Quincey Koziol <koziol@hdfgroup.org> wrote:
Hi Leigh,

Chunk in Z only, so my chunk dimensions would be something like
28x21x30 (it's never been clear to me what chunk size to pick to
optimize I/O).

And keep the other parameters the same (1 stripe, and 3,000 files per
directory).

I guess what I'm mostly looking for is assurance that I will get
faster I/O going down this kind of route than the current way I am
doing unformatted I/O.
   This looks like a fruitful direction to go it\.  Do you really need chunking though?
Not sure, It's never been super clear to me what chunking gets you
beyond (1) the ability to do compression (2) faster seeking through
large datasets when you want to access space towards the end of the
file. I may just forego chunking and see where that gets me first.
   Chunking is required if you want to have unlimited dimensions on your dataset&#39;s dataspace\.  I would rephrase \(2\) above as &quot;faster I/O when your selection is a good match for the chunk size&quot;, which could be an exact match for the chunk size, or a selection with a well\-aligned, good multiple or fraction of the chunk size\.  If you aren&#39;t using compression, don&#39;t need unlimited dimensions and aren&#39;t performing I/O on selections of the dataset, contiguous storage is probably a better fit\.

   Quincey
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org