Poor write performance with 30,000 MPI ranks (pHDF5)

Leigh_Orf · February 17, 2011, 8:49pm

Some background before I get to the problem:

I am recently attempting the largest simulations I have ever done, so this
is uncharted territory for me. I am running on the kraken teragrid resource.
The application is a 3D cloud model, and the output consists mostly of 2D
and 3D floating point fields.

Each MPI rank runs on a core. I am not using any OpenMP/threads. This is not
an option right now with the way the model is written.

The full problem size is 3300x3000x350 and I'm using a 2D parallel
decomposition, dividing the problem into 30,000 ranks (150x200 ranks, with
each rank having 22x15x350 points). This type of geometry is likely what we
are 'stuck' with unless we go with a 3D parallel decomposition, and that is
not an attractive option.

I have created a few different MPI communicators to handle I/O. The model
writes one single hdf5 file full of 2D and 1D floating point data, as well
as a tiny bit of metadata in the form of integers and attributes (I will
call this the 2D file). The 2D file is accessed through the MPI_COMM_WORLD
communicator - so each of the 30,000 ranks writes to this file. I would
prefer not to split this 2D file (which is about 1 GB in size) up, as it's
used for a quick look at how the simulation is progressing, and can be
visualized directly with software I wrote. For this file, each rank is
writing a 22x15 'patch' of floating point data for each field.

With the files containing the 3D floating point arrays (call them the 3D
files), I have it set up such that a flexible number of ranks can each write
to a HDF5 file, so long as the numbers divide evenly into the full problem.
For instance, I currently have it set up such that each 3D HDF5 file is
written by 15x20 (300) ranks and therefore a total of 100 3D HDF5 files are
written for a history dump. So each file contains 3D arrays of size
330x300x330. Hence, these 3D hdf5 files are using a different communicator
than MPI_COMM_WORLD that I assemble before any I/O occurs.

The 2D and 3D files are written at the same time (within the same routine).
For each field, I either write 2D and 3D data, or just 2D data. I can turn
off writing the 3D data and just write the 2D data, but not the other way
around (I could change this and may do so). I currently have a run in the
queue where only 2D data is written so I can determine whether the
bottleneck is with that file as opposed to the 3D files.

The problem I am having is abysmal I/O performance, and I am hoping that
maybe I can get some pointers. I fully realize that the lustre file system
on the kraken teragrid machine is not perfect and has its quirks. However,
after 10 minutes of writing the 2D file and the 3D files, I had only output
about 10 GB of data.

Questions:

1. Should I expect poor performance with 30,000 cores writing tiny 2D
patches to one file? I have considered creating another communicator and
doing MPI_GATHER on this communicator, reassembling the 2D data, and then
opening the 2D file using the communicator - this way fewer ranks would be
accessing at once. Since I am not familiar with the internals of
parallelHDF5, I don't know if doing that is necessary or recommended.

2. Since I have flexibility with the number of 3D files, should I create
fewer? More?

3. There is a command (lfs) on kraken which controls striping patterns.
Could I perhaps see better performance by mucking with striping? I have
looked through http://www.nics.tennessee.edu/io-tips "I/O Tips - Lustre
Striping and Parallel I/O" but did not come back with any clear message
about how I should modify the default settings.

4. I am doing collective writes (H5FD_MPIO_COLLECTIVE). Should I try
independent (H5FD_MPIO_INDEPENDENT)?

Since I am unsure where the bottleneck is, I'm asking the hdf5 list first,
and as I understand it some of the folks here are familiar with the kraken
resoruce and have used parallel HDF5 with very large numbers of ranks. Any
tips or suggestions for how to wrestle this problem are greatly appreciated.

Thanks,

Leigh

···

--
Leigh Orf
Associate Professor of Atmospheric Science
Department of Geology and Meteorology
Central Michigan University
Currently on sabbatical at the National Center for Atmospheric Research
in Boulder, CO
NCAR office phone: (303) 497-8200

Mark_Howison · February 17, 2011, 9:00pm

Hi Leigh,

3. There is a command (lfs) on kraken which controls striping patterns.
Could I perhaps see better performance by mucking with striping? I have
looked through http://www.nics.tennessee.edu/io-tips "I/O Tips - Lustre
Striping and Parallel I/O" but did not come back with any clear message
about how I should modify the default settings.

Yes, this is the first thing you should check. I don't have an account
on Kraken and can't verify, but my guess is that the default is only a
few a stripes out of hundreds of available OSTs (= Lustre I/O
servers). For large paralel I/O you want as many stripes as possible,
since this will aggregate the bandwidth of the individual OSTs. There
is a hard limit of 160 stripes in Lustre, even though some I/O systems
(like the one serving JaguarPF at ORNL) have hundreds of OSTs.

So I would start by setting your striping to 160 (I'm assuming Kraken
has >=160 OSTs). You may also see some benefit to increasing the
stripe size from 1mb (default on most Lustre file systems) to 4mb or
8mb.

I will try to respond to your other questions later tonight...

Mark

···

On Thu, Feb 17, 2011 at 3:49 PM, Leigh Orf <leigh.orf@gmail.com> wrote:

Quincey_Koziol · February 21, 2011, 12:38pm

Hi Leigh,

Some background before I get to the problem:

I am recently attempting the largest simulations I have ever done, so this is uncharted territory for me. I am running on the kraken teragrid resource. The application is a 3D cloud model, and the output consists mostly of 2D and 3D floating point fields.

Each MPI rank runs on a core. I am not using any OpenMP/threads. This is not an option right now with the way the model is written.

The full problem size is 3300x3000x350 and I'm using a 2D parallel decomposition, dividing the problem into 30,000 ranks (150x200 ranks, with each rank having 22x15x350 points). This type of geometry is likely what we are 'stuck' with unless we go with a 3D parallel decomposition, and that is not an attractive option.

I have created a few different MPI communicators to handle I/O. The model writes one single hdf5 file full of 2D and 1D floating point data, as well as a tiny bit of metadata in the form of integers and attributes (I will call this the 2D file). The 2D file is accessed through the MPI_COMM_WORLD communicator - so each of the 30,000 ranks writes to this file. I would prefer not to split this 2D file (which is about 1 GB in size) up, as it's used for a quick look at how the simulation is progressing, and can be visualized directly with software I wrote. For this file, each rank is writing a 22x15 'patch' of floating point data for each field.

With the files containing the 3D floating point arrays (call them the 3D files), I have it set up such that a flexible number of ranks can each write to a HDF5 file, so long as the numbers divide evenly into the full problem. For instance, I currently have it set up such that each 3D HDF5 file is written by 15x20 (300) ranks and therefore a total of 100 3D HDF5 files are written for a history dump. So each file contains 3D arrays of size 330x300x330. Hence, these 3D hdf5 files are using a different communicator than MPI_COMM_WORLD that I assemble before any I/O occurs.

Excellent description, thanks!

The 2D and 3D files are written at the same time (within the same routine). For each field, I either write 2D and 3D data, or just 2D data. I can turn off writing the 3D data and just write the 2D data, but not the other way around (I could change this and may do so). I currently have a run in the queue where only 2D data is written so I can determine whether the bottleneck is with that file as opposed to the 3D files.

The problem I am having is abysmal I/O performance, and I am hoping that maybe I can get some pointers. I fully realize that the lustre file system on the kraken teragrid machine is not perfect and has its quirks. However, after 10 minutes of writing the 2D file and the 3D files, I had only output about 10 GB of data.

That's definitely not a good I/O rate. :-/

Questions:

1. Should I expect poor performance with 30,000 cores writing tiny 2D patches to one file? I have considered creating another communicator and doing MPI_GATHER on this communicator, reassembling the 2D data, and then opening the 2D file using the communicator - this way fewer ranks would be accessing at once. Since I am not familiar with the internals of parallelHDF5, I don't know if doing that is necessary or recommended.

I don't know if this would help, but I'm definitely interested in knowing what happens if you do it.

2. Since I have flexibility with the number of 3D files, should I create fewer? More?

Ditto here.

3. There is a command (lfs) on kraken which controls striping patterns. Could I perhaps see better performance by mucking with striping? I have looked through http://www.nics.tennessee.edu/io-tips "I/O Tips - Lustre Striping and Parallel I/O" but did not come back with any clear message about how I should modify the default settings.

Ditto here.

4. I am doing collective writes (H5FD_MPIO_COLLECTIVE). Should I try independent (H5FD_MPIO_INDEPENDENT)?

This should be easy to experiment with, but I don't think it'll help.

Since I am unsure where the bottleneck is, I'm asking the hdf5 list first, and as I understand it some of the folks here are familiar with the kraken resoruce and have used parallel HDF5 with very large numbers of ranks. Any tips or suggestions for how to wrestle this problem are greatly appreciated.

I've got some followup questions, which might help future optimizations: Are you chunking the datasets, or are they contiguous? How many datasets are you creating each timestep? How many timesteps are going into each file?

Quincey

···

On Feb 17, 2011, at 2:49 PM, Leigh Orf wrote:

robl · February 21, 2011, 3:02pm

Some background before I get to the problem:

I have created a few different MPI communicators to handle I/O. The model
writes one single hdf5 file full of 2D and 1D floating point data, as well
as a tiny bit of metadata in the form of integers and attributes (I will
call this the 2D file). The 2D file is accessed through the MPI_COMM_WORLD
communicator - so each of the 30,000 ranks writes to this file. I would
prefer not to split this 2D file (which is about 1 GB in size) up, as it's
used for a quick look at how the simulation is progressing, and can be
visualized directly with software I wrote. For this file, each rank is
writing a 22x15 'patch' of floating point data for each field.

One big file, collectively accessed. Sounds great to me. What is the
version of MPT (the cray MPI library) on kraken? At this point I
would be shocked if it's older than 3.2 but since you are using
collective I/O (yay!) make sure you are using MPT 3.2 or newer. ( i
think the old ones are kept around)

With the files containing the 3D floating point arrays (call them the 3D
files), I have it set up such that a flexible number of ranks can each write
to a HDF5 file, so long as the numbers divide evenly into the full problem.
For instance, I currently have it set up such that each 3D HDF5 file is
written by 15x20 (300) ranks and therefore a total of 100 3D HDF5 files are
written for a history dump. So each file contains 3D arrays of size
330x300x330. Hence, these 3D hdf5 files are using a different communicator
than MPI_COMM_WORLD that I assemble before any I/O occurs.

This is clever, and does let you tune at the application level, but I
don't think it's necessary. Often the MPI-IO hints are better suited
for such tuning, but once you've upped the striping factor (mark's
email) I don't think you'll need those either.

1. Should I expect poor performance with 30,000 cores writing tiny 2D
patches to one file? I have considered creating another communicator and
doing MPI_GATHER on this communicator, reassembling the 2D data, and then
opening the 2D file using the communicator - this way fewer ranks would be
accessing at once. Since I am not familiar with the internals of
parallelHDF5, I don't know if doing that is necessary or recommended.

This workload (30,000 cores writing a tiny patch) is perfect for
collective I/O. This gather idea you have is kind of like what will
happen inside the MPI-IO library, except the MPI-IO library will
reduce the operation to several writers, not just one. Also, it's
been tested and debugged for you.

2. Since I have flexibility with the number of 3D files, should I create
fewer? More?

The usual parallel I/O advice is to do collective I/O to a single
shared file. Lustre does tend to perform better when more files are
used but for the sake of your post-processing sanity, let's see what
happens if we keep a single file for now.

3. There is a command (lfs) on kraken which controls striping patterns.
Could I perhaps see better performance by mucking with striping? I have
looked through http://www.nics.tennessee.edu/io-tips "I/O Tips - Lustre
Striping and Parallel I/O" but did not come back with any clear message
about how I should modify the default settings.

4. I am doing collective writes (H5FD_MPIO_COLLECTIVE). Should I try
independent (H5FD_MPIO_INDEPENDENT)?

I would suggest keeping HDF5 collective I/O enabled at all times. If
you find each process is writing on the order of 4 MiB of data, you
might want to force, at the MPI-IO level, independent I/O. Here again
you do so with MPI-IO tuning parameters. We can go into more detail
later, if it's even needed.

==rob

···

On Thu, Feb 17, 2011 at 01:49:16PM -0700, Leigh Orf wrote:

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

Leigh_Orf · February 23, 2011, 12:07am

Quincey,

Answer to your questions below... no top-posting today

Hi Leigh,

Some background before I get to the problem:

I am recently attempting the largest simulations I have ever done, so this
is uncharted territory for me. I am running on the kraken teragrid resource.
The application is a 3D cloud model, and the output consists mostly of 2D
and 3D floating point fields.

Each MPI rank runs on a core. I am not using any OpenMP/threads. This is
not an option right now with the way the model is written.

The full problem size is 3300x3000x350 and I'm using a 2D parallel
decomposition, dividing the problem into 30,000 ranks (150x200 ranks, with
each rank having 22x15x350 points). This type of geometry is likely what we
are 'stuck' with unless we go with a 3D parallel decomposition, and that is
not an attractive option.

I have created a few different MPI communicators to handle I/O. The model
writes one single hdf5 file full of 2D and 1D floating point data, as well
as a tiny bit of metadata in the form of integers and attributes (I will
call this the 2D file). The 2D file is accessed through the MPI_COMM_WORLD
communicator - so each of the 30,000 ranks writes to this file. I would
prefer not to split this 2D file (which is about 1 GB in size) up, as it's
used for a quick look at how the simulation is progressing, and can be
visualized directly with software I wrote. For this file, each rank is
writing a 22x15 'patch' of floating point data for each field.

With the files containing the 3D floating point arrays (call them the 3D
files), I have it set up such that a flexible number of ranks can each write
to a HDF5 file, so long as the numbers divide evenly into the full problem.
For instance, I currently have it set up such that each 3D HDF5 file is
written by 15x20 (300) ranks and therefore a total of 100 3D HDF5 files are
written for a history dump. So each file contains 3D arrays of size
330x300x330. Hence, these 3D hdf5 files are using a different communicator
than MPI_COMM_WORLD that I assemble before any I/O occurs.

Excellent description, thanks!

The 2D and 3D files are written at the same time (within the same routine).
For each field, I either write 2D and 3D data, or just 2D data. I can turn
off writing the 3D data and just write the 2D data, but not the other way
around (I could change this and may do so). I currently have a run in the
queue where only 2D data is written so I can determine whether the
bottleneck is with that file as opposed to the 3D files.

The problem I am having is abysmal I/O performance, and I am hoping that
maybe I can get some pointers. I fully realize that the lustre file system
on the kraken teragrid machine is not perfect and has its quirks. However,
after 10 minutes of writing the 2D file and the 3D files, I had only output
about 10 GB of data.

That's definitely not a good I/O rate. :-/

Questions:

1. Should I expect poor performance with 30,000 cores writing tiny 2D
patches to one file? I have considered creating another communicator and
doing MPI_GATHER on this communicator, reassembling the 2D data, and then
opening the 2D file using the communicator - this way fewer ranks would be
accessing at once. Since I am not familiar with the internals of
parallelHDF5, I don't know if doing that is necessary or recommended.

I don't know if this would help, but I'm definitely interested in knowing
what happens if you do it.

2. Since I have flexibility with the number of 3D files, should I create
fewer? More?

Ditto here.

3. There is a command (lfs) on kraken which controls striping patterns.
Could I perhaps see better performance by mucking with striping? I have
looked through http://www.nics.tennessee.edu/io-tips "I/O Tips - Lustre
Striping and Parallel I/O" but did not come back with any clear message
about how I should modify the default settings.

Ditto here.

4. I am doing collective writes (H5FD_MPIO_COLLECTIVE). Should I try
independent (H5FD_MPIO_INDEPENDENT)?

This should be easy to experiment with, but I don't think it'll help.

Since I am unsure where the bottleneck is, I'm asking the hdf5 list first,
and as I understand it some of the folks here are familiar with the kraken
resoruce and have used parallel HDF5 with very large numbers of ranks. Any
tips or suggestions for how to wrestle this problem are greatly appreciated.

I've got some followup questions, which might help future optimizations:
Are you chunking the datasets, or are they contiguous?

I am chunking the datasets by the dimensions of the what is running on each
core (each MPI rank runs on 1 core). So, if I have 3d arrays dimensioned by
15x15x200 on each core, and 4x4 cores on each MPI communicator, the chunk
dimensions are 15x15x200 and the array dimension written to each HDF5 file
is 60x60x200.

A snippet from my code follows. The core dimensions are ni x nj x nk. The
file dimensions are ionumi x ionumj x nk. ionumi = ni * corex, where corex
is the number of cores in the x direction spanning 1 file, same for y. Since
I am only doing a 2d parallel decomposition, nk spans the full vertical
extent.

mygroupi goes from 0 to corex-1, mygroupj goes from 0 to corey-1.

      dims(1)=ionumi
      dims(2)=ionumj
      dims(3)=nk

      chunkdims(1)=ni
      chunkdims(2)=nj
      chunkdims(3)=nk

      count(1)=1
      count(2)=1
      count(3)=1

      offset(1) = mygroupi * chunkdims(1)
      offset(2) = mygroupj * chunkdims(2)
      offset(3) = 0

      stride(1) = 1
      stride(2) = 1
      stride(3) = 1

      block(1) = chunkdims(1)
      block(2) = chunkdims(2)
      block(3) = chunkdims(3)

      call h5screate_simple_f(rank,dims,filespace_id,ierror)
      call h5screate_simple_f(rank,chunkdims,memspace_id,ierror)
      call h5pcreate_f(H5P_DATASET_CREATE_F,chunk_id,ierror)
      call h5pset_chunk_f(chunk_id,rank,chunkdims,ierror)
      call
h5dcreate_f(file_id,trim(varname),H5T_NATIVE_REAL,filespace_id,dset_id,ierror,chunk_id)
      call h5sclose_f(filespace_id,ierror)

      call h5dget_space_f(dset_id, filespace_id, ierror)
      call h5sselect_hyperslab_f (filespace_id, H5S_SELECT_SET_F, offset,
count, ierror,stride,block)
      call h5pcreate_f(H5P_DATASET_XFER_F, plist_id, ierror)
      call h5pset_dxpl_mpio_f(plist_id, MPIO, ierror)
      call h5dwrite_f(dset_id, H5T_NATIVE_REAL, core3d(1:ni,1:nj,1:nk),
dims, ierror, &
                 file_space_id = filespace_id, mem_space_id = memspace_id,
xfer_prp = plist_id)

>How many datasets are you creating each timestep?

This is a selectable option. Here is a typical scenario. In this case, just
for some background, corex=4, corey=6 (16 cores per file) and there are 16
files per full domain write. So each .cm1.hdf5 file contains 1/16th of the
full domain. the .2Dcm1hdf5 file contains primarily 2D slices of the full
domain. It is written by *ALL* cores (and performance to this file is good,
even on 30,000 cores writing to it on kraken).

bp-login1: /scr/orf/Lnew/L500ang120_0.010_1000.0m.00000.cdir % ls -l

-rw-r--r-- 1 orf jmd 58393968 Feb 22 17:57
L500ang120_0.010_1000.0m.03600_0000.2Dcm1hdf5
-rw-r--r-- 1 orf jmd 342313312 Feb 22 17:57
L500ang120_0.010_1000.0m.03600_0000.cm1hdf5
-rw-r--r-- 1 orf jmd 342313312 Feb 22 17:57
L500ang120_0.010_1000.0m.03600_0001.cm1hdf5
-rw-r--r-- 1 orf jmd 342313312 Feb 22 17:57
L500ang120_0.010_1000.0m.03600_0002.cm1hdf5
-rw-r--r-- 1 orf jmd 342313312 Feb 22 17:57
L500ang120_0.010_1000.0m.03600_0003.cm1hdf5
-rw-r--r-- 1 orf jmd 342313312 Feb 22 17:57
L500ang120_0.010_1000.0m.03600_0004.cm1hdf5
-rw-r--r-- 1 orf jmd 342313312 Feb 22 17:57
L500ang120_0.010_1000.0m.03600_0005.cm1hdf5
-rw-r--r-- 1 orf jmd 342313312 Feb 22 17:57
L500ang120_0.010_1000.0m.03600_0006.cm1hdf5
-rw-r--r-- 1 orf jmd 342313312 Feb 22 17:57
L500ang120_0.010_1000.0m.03600_0007.cm1hdf5
-rw-r--r-- 1 orf jmd 342313312 Feb 22 17:57
L500ang120_0.010_1000.0m.03600_0008.cm1hdf5
-rw-r--r-- 1 orf jmd 342313312 Feb 22 17:57
L500ang120_0.010_1000.0m.03600_0009.cm1hdf5
-rw-r--r-- 1 orf jmd 342313312 Feb 22 17:57
L500ang120_0.010_1000.0m.03600_0010.cm1hdf5
-rw-r--r-- 1 orf jmd 342313312 Feb 22 17:57
L500ang120_0.010_1000.0m.03600_0011.cm1hdf5
-rw-r--r-- 1 orf jmd 342313312 Feb 22 17:57
L500ang120_0.010_1000.0m.03600_0012.cm1hdf5
-rw-r--r-- 1 orf jmd 342313312 Feb 22 17:57
L500ang120_0.010_1000.0m.03600_0013.cm1hdf5
-rw-r--r-- 1 orf jmd 342313312 Feb 22 17:57
L500ang120_0.010_1000.0m.03600_0014.cm1hdf5
-rw-r--r-- 1 orf jmd 342313312 Feb 22 17:57
L500ang120_0.010_1000.0m.03600_0015.cm1hdf5

bp-login1: /scr/orf/Lnew/L500ang120_0.010_1000.0m.00000.cdir % h5ls -rv
L500ang120_0.010_1000.0m.03600_0009.cm1hdf5 | grep Dataset
/2d/cpc Dataset {176/176, 140/140}
/2d/cph Dataset {176/176, 140/140}
/2d/cref Dataset {176/176, 140/140}
/2d/maxsgs Dataset {176/176, 140/140}
/2d/maxshs Dataset {176/176, 140/140}
/2d/maxsrs Dataset {176/176, 140/140}
/2d/maxsus Dataset {176/176, 140/140}
/2d/maxsvs Dataset {176/176, 140/140}
/2d/maxsws Dataset {176/176, 140/140}
/2d/minsps Dataset {176/176, 140/140}
/2d/sfcrain Dataset {176/176, 140/140}
/2d/uh Dataset {176/176, 140/140}
/3d/dbz Dataset {96/96, 176/176, 140/140}
/3d/khh Dataset {96/96, 176/176, 140/140}
/3d/khv Dataset {96/96, 176/176, 140/140}
/3d/kmh Dataset {96/96, 176/176, 140/140}
/3d/kmv Dataset {96/96, 176/176, 140/140}
/3d/ncg Dataset {96/96, 176/176, 140/140}
/3d/nci Dataset {96/96, 176/176, 140/140}
/3d/ncr Dataset {96/96, 176/176, 140/140}
/3d/ncs Dataset {96/96, 176/176, 140/140}
/3d/p Dataset {96/96, 176/176, 140/140}
/3d/pi Dataset {96/96, 176/176, 140/140}
/3d/pipert Dataset {96/96, 176/176, 140/140}
/3d/ppert Dataset {96/96, 176/176, 140/140}
/3d/qc Dataset {96/96, 176/176, 140/140}
/3d/qg Dataset {96/96, 176/176, 140/140}
/3d/qi Dataset {96/96, 176/176, 140/140}
/3d/qr Dataset {96/96, 176/176, 140/140}
/3d/qs Dataset {96/96, 176/176, 140/140}
/3d/qv Dataset {96/96, 176/176, 140/140}
/3d/qvpert Dataset {96/96, 176/176, 140/140}
/3d/rho Dataset {96/96, 176/176, 140/140}
/3d/rhopert Dataset {96/96, 176/176, 140/140}
/3d/th Dataset {96/96, 176/176, 140/140}
/3d/thpert Dataset {96/96, 176/176, 140/140}
/3d/tke Dataset {96/96, 176/176, 140/140}
/3d/u Dataset {96/96, 176/176, 140/140}
/3d/u_yzlast Dataset {96/96, 176/176}
/3d/uinterp Dataset {96/96, 176/176, 140/140}
/3d/upert Dataset {96/96, 176/176, 140/140}
/3d/upert_yzlast Dataset {96/96, 176/176}
/3d/v Dataset {96/96, 176/176, 140/140}
/3d/v_xzlast Dataset {96/96, 140/140}
/3d/vinterp Dataset {96/96, 176/176, 140/140}
/3d/vpert Dataset {96/96, 176/176, 140/140}
/3d/vpert_xzlast Dataset {96/96, 140/140}
/3d/w Dataset {97/97, 176/176, 140/140}
/3d/winterp Dataset {96/96, 176/176, 140/140}
/3d/xvort Dataset {96/96, 176/176, 140/140}
/3d/yvort Dataset {96/96, 176/176, 140/140}
/3d/zvort Dataset {96/96, 176/176, 140/140}
/basestate/pi0 Dataset {96/96}
/basestate/pres0 Dataset {96/96}
/basestate/qv0 Dataset {96/96}
/basestate/rh0 Dataset {96/96}
/basestate/th0 Dataset {96/96}
/basestate/u0 Dataset {96/96}
/basestate/v0 Dataset {96/96}
/grid/myi Dataset {1/1}
/grid/myj Dataset {1/1}
/grid/ni Dataset {1/1}
/grid/nj Dataset {1/1}
/grid/nodex Dataset {1/1}
/grid/nodey Dataset {1/1}
/grid/nx Dataset {1/1}
/grid/ny Dataset {1/1}
/grid/nz Dataset {1/1}
/grid/x0 Dataset {1/1}
/grid/x1 Dataset {1/1}
/grid/y0 Dataset {1/1}
/grid/y1 Dataset {1/1}
/mesh/dx Dataset {1/1}
/mesh/dy Dataset {1/1}
/mesh/xf Dataset {140/140}
/mesh/xh Dataset {140/140}
/mesh/yf Dataset {176/176}
/mesh/yh Dataset {176/176}
/mesh/zf Dataset {97/97}
/mesh/zh Dataset {96/96}
/time Dataset {1/1}

bp-login1: /scr/orf/Lnew/L500ang120_0.010_1000.0m.00000.cdir % h5ls -rv
L500ang120_0.010_1000.0m.03600_0009.cm1hdf5 | grep 3d | grep -v zlast | wc
-l
44

So there are 44 3D fields in this case. That's pretty much the kitchen sink,
normally I'd probably be writing half as many datasets.

Notice also I've got a bunch of tiny bits which serve as metadata (for
stitching things back together for analysis), some small 1d arrays, some 2d
arrays, and then the big 3d arrays. Except for the *zlast arrays, all of the
stuff in /3d is three-dimensional as you can see. The *zlast stuff is
because some of the variables have an extra point in the x or y direction
(GRR staggered grids) and I just write the last planes out in a separate
dataset. This is because I am splitting up the writes into separate hdf5
files. Were I writing only one file, it would be easier.

As far as the dimensions of the arrays you see here, don't take them too
seriously, this was from a run on another machine. I am holding off on
kraken until I can get at least a decent idea of what to try to improve I/O.

How many timesteps are going into each file?

Only one.

···

On Mon, Feb 21, 2011 at 5:38 AM, Quincey Koziol <koziol@hdfgroup.org> wrote:

On Feb 17, 2011, at 2:49 PM, Leigh Orf wrote:

Quincey

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

--
Leigh Orf
Associate Professor of Atmospheric Science
Department of Geology and Meteorology
Central Michigan University
Currently on sabbatical at the National Center for Atmospheric Research
in Boulder, CO
NCAR office phone: (303) 497-8200

Leigh_Orf · February 23, 2011, 12:36am

> Some background before I get to the problem:
>
> I have created a few different MPI communicators to handle I/O. The model
> writes one single hdf5 file full of 2D and 1D floating point data, as
well
> as a tiny bit of metadata in the form of integers and attributes (I will
> call this the 2D file). The 2D file is accessed through the
MPI_COMM_WORLD
> communicator - so each of the 30,000 ranks writes to this file. I would
> prefer not to split this 2D file (which is about 1 GB in size) up, as
it's
> used for a quick look at how the simulation is progressing, and can be
> visualized directly with software I wrote. For this file, each rank is
> writing a 22x15 'patch' of floating point data for each field.

One big file, collectively accessed. Sounds great to me. What is the
version of MPT (the cray MPI library) on kraken? At this point I
would be shocked if it's older than 3.2 but since you are using
collective I/O (yay!) make sure you are using MPT 3.2 or newer. ( i
think the old ones are kept around)

module avail sez, among many other things:

xt-mpt/5.0.0(default)

There are other versions available up to xt-mpt/5.2.0

> With the files containing the 3D floating point arrays (call them the 3D
> files), I have it set up such that a flexible number of ranks can each
write
> to a HDF5 file, so long as the numbers divide evenly into the full
problem.
> For instance, I currently have it set up such that each 3D HDF5 file is
> written by 15x20 (300) ranks and therefore a total of 100 3D HDF5 files
are
> written for a history dump. So each file contains 3D arrays of size
> 330x300x330. Hence, these 3D hdf5 files are using a different
communicator
> than MPI_COMM_WORLD that I assemble before any I/O occurs.

This is clever, and does let you tune at the application level, but I
don't think it's necessary. Often the MPI-IO hints are better suited
for such tuning, but once you've upped the striping factor (mark's
email) I don't think you'll need those either.

Now you tell me Well, since I'm preparing for 100,000 cores on the
upcoming blue waters machine, having one gazillobyte file containing the
full domain is not an attractive option... for several reasons. I have
always assumed we'd end up writing multiple files per history file dump, and
have been under the impression from the blue waters folks that something
between 1 and numcores files is probably going to provide the best
performance. So that's why I went down this path. Another logistical reason
is because only a small portion of the full model domain typically has the
interesting bits and rather than writing code to extract from the monster
file, it's easier to just pick the files I need. And, finally, we can toss
out parts of the domain we don't need (like along the edges). Etc. Our
proposed simulations are going to produce PB of data. Incidentally I did
receive an email from one of the blue waters folks who wants to work with me
on optimizing I/O (they have seen my code). So I will happily share anything
I learn from them.

> 1. Should I expect poor performance with 30,000 cores writing tiny 2D
> patches to one file? I have considered creating another communicator and
> doing MPI_GATHER on this communicator, reassembling the 2D data, and then
> opening the 2D file using the communicator - this way fewer ranks would
be
> accessing at once. Since I am not familiar with the internals of
> parallelHDF5, I don't know if doing that is necessary or recommended.

This workload (30,000 cores writing a tiny patch) is perfect for
collective I/O. This gather idea you have is kind of like what will
happen inside the MPI-IO library, except the MPI-IO library will
reduce the operation to several writers, not just one. Also, it's
been tested and debugged for you.

Indeed, after my initial email on this problem, I saw that the 2D files were
written quite quickly on kraken - it only took say 10 seconds to write, with
30,000 cores using MPI_COMM_WORLD. That made me happy. So I think I'm good
with the all-to-one 2d file. On to 3d...

> 2. Since I have flexibility with the number of 3D files, should I create
> fewer? More?

The usual parallel I/O advice is to do collective I/O to a single
shared file. Lustre does tend to perform better when more files are
used but for the sake of your post-processing sanity, let's see what
happens if we keep a single file for now.

I am nervous about doing that. Partly because I am trying to make sense of
this:

http://www.nics.tennessee.edu/io-tips

which has some confusing information, at least to me. They claim performance
degradation with many cores when you are doing single-shared-file (your
suggestion) and also 1 file per process (what I used to which is fine for
fewer cores).

There is also the issue of somehow mapping your writes to the stripe size,
which is an option you can set with lfs. Check out the figure caption to
figure 3 which states:

"Write Performance for serial I/O at various Lustre stripe counts. File size
is 32 MB per OST utilized and write operations are 32 MB in size. Utilizing
more OSTs does not increase write performance. * The Best performance is
seen by utilizing a stripe size which matches the size of write operations.
*"

I have no idea how to control the size of "write operations" whatever they
are. Maybe there is a way to set this with hdf5?

> 3. There is a command (lfs) on kraken which controls striping patterns.
> Could I perhaps see better performance by mucking with striping? I have
> looked through http://www.nics.tennessee.edu/io-tips "I/O Tips - Lustre
> Striping and Parallel I/O" but did not come back with any clear message
> about how I should modify the default settings.
>
> 4. I am doing collective writes (H5FD_MPIO_COLLECTIVE). Should I try
> independent (H5FD_MPIO_INDEPENDENT)?

I would suggest keeping HDF5 collective I/O enabled at all times. If
you find each process is writing on the order of 4 MiB of data, you
might want to force, at the MPI-IO level, independent I/O. Here again
you do so with MPI-IO tuning parameters. We can go into more detail
later, if it's even needed.

Roger that. I kind of figured collective is always better if it's available.
Oddly enough, I *have* to use independent for blueprint (AIX machine) or it
barfs. But at least I can still do phdf5.

I have time on kraken and wish to do some very large but very short
"simulations" where I just start the model, run one time step, and dump
files, and do timings. My head hurts right now with too many knobs to turn
(#of phdf5 files, lfs options, etc.).

==rob

···

On Mon, Feb 21, 2011 at 8:02 AM, Rob Latham <robl@mcs.anl.gov> wrote:

On Thu, Feb 17, 2011 at 01:49:16PM -0700, Leigh Orf wrote:

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

--
Leigh Orf
Associate Professor of Atmospheric Science
Department of Geology and Meteorology
Central Michigan University
Currently on sabbatical at the National Center for Atmospheric Research
in Boulder, CO
NCAR office phone: (303) 497-8200

Mark_Howison · February 23, 2011, 12:49am

Hi Leigh,

It is true that you need to align writes to Lustre stripe boundaries
to get reasonable performance to a single shared file. If you use
collective I/O, as Rob and Quincey have suggested, it will handles
this automatically (since mpt/3.2) by aggregating your data on a
subset of "writer" MPI tasks, then packaging the data into
stripe-sized writes. It will also try to set the number of writers to
the number of stripes.

Alternatively, if you are writing the same amount of data from every
task, you can use an independent I/O approach that combines the HDF5
chunking and alignment properties to guarantee stripe-sized writes.
The caveat is that your chunks will be padded with empty data out to
the stripe-size, so this potentially wastes space on disk. In some
cases, though, we have seen very good performance with independent I/O
even with up to thousands of tasks, for instance with our GCRM I/O
benchmark (based on a climate code) on Franklin and Jaguar (both Cray
XTs). You can read more about that in our "Tuning HDF5 for Lustre"
paper that you referenced in a previous email. If you go this route,
you will also want to use two other optimizations we describe in that
paper: disabling an ftruncate() call at file close that leads to
catastrophic delays on Lustre, and suspending metadata flushes until
file close (since the chunk indexing will generate considerable
metadata activity).

Mark

···

On Tue, Feb 22, 2011 at 7:36 PM, Leigh Orf <leigh.orf@gmail.com> wrote:

On Mon, Feb 21, 2011 at 8:02 AM, Rob Latham <robl@mcs.anl.gov> wrote:

On Thu, Feb 17, 2011 at 01:49:16PM -0700, Leigh Orf wrote:
> Some background before I get to the problem:
>
> I have created a few different MPI communicators to handle I/O. The
> model
> writes one single hdf5 file full of 2D and 1D floating point data, as
> well
> as a tiny bit of metadata in the form of integers and attributes (I will
> call this the 2D file). The 2D file is accessed through the
> MPI_COMM_WORLD
> communicator - so each of the 30,000 ranks writes to this file. I would
> prefer not to split this 2D file (which is about 1 GB in size) up, as
> it's
> used for a quick look at how the simulation is progressing, and can be
> visualized directly with software I wrote. For this file, each rank is
> writing a 22x15 'patch' of floating point data for each field.

One big file, collectively accessed. Sounds great to me. What is the
version of MPT (the cray MPI library) on kraken? At this point I
would be shocked if it's older than 3.2 but since you are using
collective I/O (yay!) make sure you are using MPT 3.2 or newer. ( i
think the old ones are kept around)

module avail sez, among many other things:

xt-mpt/5.0.0(default)

There are other versions available up to xt-mpt/5.2.0

> With the files containing the 3D floating point arrays (call them the 3D
> files), I have it set up such that a flexible number of ranks can each
> write
> to a HDF5 file, so long as the numbers divide evenly into the full
> problem.
> For instance, I currently have it set up such that each 3D HDF5 file is
> written by 15x20 (300) ranks and therefore a total of 100 3D HDF5 files
> are
> written for a history dump. So each file contains 3D arrays of size
> 330x300x330. Hence, these 3D hdf5 files are using a different
> communicator
> than MPI_COMM_WORLD that I assemble before any I/O occurs.

This is clever, and does let you tune at the application level, but I
don't think it's necessary. Often the MPI-IO hints are better suited
for such tuning, but once you've upped the striping factor (mark's
email) I don't think you'll need those either.

Now you tell me Well, since I'm preparing for 100,000 cores on the
upcoming blue waters machine, having one gazillobyte file containing the
full domain is not an attractive option... for several reasons. I have
always assumed we'd end up writing multiple files per history file dump, and
have been under the impression from the blue waters folks that something
between 1 and numcores files is probably going to provide the best
performance. So that's why I went down this path. Another logistical reason
is because only a small portion of the full model domain typically has the
interesting bits and rather than writing code to extract from the monster
file, it's easier to just pick the files I need. And, finally, we can toss
out parts of the domain we don't need (like along the edges). Etc. Our
proposed simulations are going to produce PB of data. Incidentally I did
receive an email from one of the blue waters folks who wants to work with me
on optimizing I/O (they have seen my code). So I will happily share anything
I learn from them.

> 1. Should I expect poor performance with 30,000 cores writing tiny 2D
> patches to one file? I have considered creating another communicator and
> doing MPI_GATHER on this communicator, reassembling the 2D data, and
> then
> opening the 2D file using the communicator - this way fewer ranks would
> be
> accessing at once. Since I am not familiar with the internals of
> parallelHDF5, I don't know if doing that is necessary or recommended.

This workload (30,000 cores writing a tiny patch) is perfect for
collective I/O. This gather idea you have is kind of like what will
happen inside the MPI-IO library, except the MPI-IO library will
reduce the operation to several writers, not just one. Also, it's
been tested and debugged for you.

Indeed, after my initial email on this problem, I saw that the 2D files were
written quite quickly on kraken - it only took say 10 seconds to write, with
30,000 cores using MPI_COMM_WORLD. That made me happy. So I think I'm good
with the all-to-one 2d file. On to 3d...

> 2. Since I have flexibility with the number of 3D files, should I create
> fewer? More?

The usual parallel I/O advice is to do collective I/O to a single
shared file. Lustre does tend to perform better when more files are
used but for the sake of your post-processing sanity, let's see what
happens if we keep a single file for now.

I am nervous about doing that. Partly because I am trying to make sense of
this:

http://www.nics.tennessee.edu/io-tips

which has some confusing information, at least to me. They claim performance
degradation with many cores when you are doing single-shared-file (your
suggestion) and also 1 file per process (what I used to which is fine for
fewer cores).

There is also the issue of somehow mapping your writes to the stripe size,
which is an option you can set with lfs. Check out the figure caption to
figure 3 which states:

"Write Performance for serial I/O at various Lustre stripe counts. File size
is 32 MB per OST utilized and write operations are 32 MB in size. Utilizing
more OSTs does not increase write performance. The Best performance is seen
by utilizing a stripe size which matches the size of write operations. "

I have no idea how to control the size of "write operations" whatever they
are. Maybe there is a way to set this with hdf5?

> 3. There is a command (lfs) on kraken which controls striping patterns.
> Could I perhaps see better performance by mucking with striping? I have
> looked through http://www.nics.tennessee.edu/io-tips "I/O Tips - Lustre
> Striping and Parallel I/O" but did not come back with any clear message
> about how I should modify the default settings.
>
> 4. I am doing collective writes (H5FD_MPIO_COLLECTIVE). Should I try
> independent (H5FD_MPIO_INDEPENDENT)?

I would suggest keeping HDF5 collective I/O enabled at all times. If
you find each process is writing on the order of 4 MiB of data, you
might want to force, at the MPI-IO level, independent I/O. Here again
you do so with MPI-IO tuning parameters. We can go into more detail
later, if it's even needed.

Roger that. I kind of figured collective is always better if it's available.
Oddly enough, I *have* to use independent for blueprint (AIX machine) or it
barfs. But at least I can still do phdf5.

I have time on kraken and wish to do some very large but very short
"simulations" where I just start the model, run one time step, and dump
files, and do timings. My head hurts right now with too many knobs to turn
(#of phdf5 files, lfs options, etc.).

==rob

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

--
Leigh Orf
Associate Professor of Atmospheric Science
Department of Geology and Meteorology
Central Michigan University
Currently on sabbatical at the National Center for Atmospheric Research
in Boulder, CO
NCAR office phone: (303) 497-8200

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Leigh_Orf · February 23, 2011, 5:16pm

Do I assume correctly that using collective I/O that (quoting the "tuning
hdf5 for lustre" document) phdf5 will both "select the correct stripe count"
and also "align operations to stripe boundaries"? Will this apply even if I
use subcommunicators to write several (or hundreds) of hdf5 files at the
same time? I just want to be sure.

It seems that collective I/O is the easy way to go if it takes care of the
underlying decisions to optimize writing. However, do any assumptions go
into this, or is HDF able to query the lfs parameters? On kraken, you can
set the following parameters: number of bytes on each OST, index of the
first stripe, and the number of OSTs to stripe. Seems the only parameter in
question is the number of bytes per OST, and that the OST index of the first
stripe should just be set to the default and that the number of OSTs should
be set to the maximum value (160 on kraken).

What strategy should I use to decide the number of bytes per OST? Should I
try to make it roughly the chunk size I am using for 3D data? Or... ? You
can set it anywhere from the kB to GB range.

Leigh

···

On Tue, Feb 22, 2011 at 5:49 PM, Mark Howison <mark.howison@gmail.com>wrote:

Hi Leigh,

It is true that you need to align writes to Lustre stripe boundaries
to get reasonable performance to a single shared file. If you use
collective I/O, as Rob and Quincey have suggested, it will handles
this automatically (since mpt/3.2) by aggregating your data on a
subset of "writer" MPI tasks, then packaging the data into
stripe-sized writes. It will also try to set the number of writers to
the number of stripes.

Alternatively, if you are writing the same amount of data from every
task, you can use an independent I/O approach that combines the HDF5
chunking and alignment properties to guarantee stripe-sized writes.
The caveat is that your chunks will be padded with empty data out to
the stripe-size, so this potentially wastes space on disk. In some
cases, though, we have seen very good performance with independent I/O
even with up to thousands of tasks, for instance with our GCRM I/O
benchmark (based on a climate code) on Franklin and Jaguar (both Cray
XTs). You can read more about that in our "Tuning HDF5 for Lustre"
paper that you referenced in a previous email. If you go this route,
you will also want to use two other optimizations we describe in that
paper: disabling an ftruncate() call at file close that leads to
catastrophic delays on Lustre, and suspending metadata flushes until
file close (since the chunk indexing will generate considerable
metadata activity).

--
Leigh Orf
Associate Professor of Atmospheric Science
Department of Geology and Meteorology
Central Michigan University
Currently on sabbatical at the National Center for Atmospheric Research
in Boulder, CO
NCAR office phone: (303) 497-8200

Leigh_Orf · March 5, 2011, 12:15am

OK, you answered one of the questions I just posted in another email.
There is aggregation going on with collective I/O.

My awful (60 GB in 11 minutes) performance occurred with 160 OSTs and
128 MB stripes. I chose 128 MB thinking 'bigger is better' and the
resultant file will be on the order of 500 GB or so. Maybe I went too
large...

Leigh

···

On Tue, Feb 22, 2011 at 5:49 PM, Mark Howison <mark.howison@gmail.com> wrote:

Hi Leigh,

It is true that you need to align writes to Lustre stripe boundaries
to get reasonable performance to a single shared file. If you use
collective I/O, as Rob and Quincey have suggested, it will handles
this automatically (since mpt/3.2) by aggregating your data on a
subset of "writer" MPI tasks, then packaging the data into
stripe-sized writes. It will also try to set the number of writers to
the number of stripes.

--
Leigh Orf
Associate Professor of Atmospheric Science
Department of Geology and Meteorology
Central Michigan University
Currently on sabbatical at the National Center for Atmospheric
Research in Boulder, CO
NCAR office phone: (303) 497-8200

Quincey_Koziol · February 24, 2011, 12:37pm

Hi Leigh,

···

On Feb 23, 2011, at 11:16 AM, Leigh Orf wrote:

On Tue, Feb 22, 2011 at 5:49 PM, Mark Howison <mark.howison@gmail.com> wrote:
Hi Leigh,

It is true that you need to align writes to Lustre stripe boundaries
to get reasonable performance to a single shared file. If you use
collective I/O, as Rob and Quincey have suggested, it will handles
this automatically (since mpt/3.2) by aggregating your data on a
subset of "writer" MPI tasks, then packaging the data into
stripe-sized writes. It will also try to set the number of writers to
the number of stripes.

Alternatively, if you are writing the same amount of data from every
task, you can use an independent I/O approach that combines the HDF5
chunking and alignment properties to guarantee stripe-sized writes.
The caveat is that your chunks will be padded with empty data out to
the stripe-size, so this potentially wastes space on disk. In some
cases, though, we have seen very good performance with independent I/O
even with up to thousands of tasks, for instance with our GCRM I/O
benchmark (based on a climate code) on Franklin and Jaguar (both Cray
XTs). You can read more about that in our "Tuning HDF5 for Lustre"
paper that you referenced in a previous email. If you go this route,
you will also want to use two other optimizations we describe in that
paper: disabling an ftruncate() call at file close that leads to
catastrophic delays on Lustre, and suspending metadata flushes until
file close (since the chunk indexing will generate considerable
metadata activity).

Do I assume correctly that using collective I/O that (quoting the "tuning hdf5 for lustre" document) phdf5 will both "select the correct stripe count" and also "align operations to stripe boundaries"? Will this apply even if I use subcommunicators to write several (or hundreds) of hdf5 files at the same time? I just want to be sure.

It seems that collective I/O is the easy way to go if it takes care of the underlying decisions to optimize writing. However, do any assumptions go into this, or is HDF able to query the lfs parameters? On kraken, you can set the following parameters: number of bytes on each OST, index of the first stripe, and the number of OSTs to stripe. Seems the only parameter in question is the number of bytes per OST, and that the OST index of the first stripe should just be set to the default and that the number of OSTs should be set to the maximum value (160 on kraken).

What strategy should I use to decide the number of bytes per OST? Should I try to make it roughly the chunk size I am using for 3D data? Or... ? You can set it anywhere from the kB to GB range.

Is there any possibility of digging up some funding to support helping to optimize your code? I'd really like to put some resources toward helping you, but it's going to take some hands on effort, not just email messages. :-/

Quincey

Leigh_Orf · March 5, 2011, 6:49pm

Did another test on kraken, ended up with approximately 270 MB/s
performance. This appears to be in line with the "baseline" results of
the "Tuning HDF5 for Lustre File Systems" paper. I double checked to
verify that I was using version 1.8.5 and have H5FD_MPIO_COLLECTIVE
set.

For this particular test, I wrote 12 files spanning 30,000 cores
simultaneously. I watched the data come in (ls -l every 5 seconds) and
noticed the data came in in 'fits and starts' and towards the end of
the writes, only a few hundreds of bytes remained to be written, and
it took a long time for those bytes to get written. Something is
weird.

I put some stuff on line if anyone wants to take a glance at it. The
output of h5stat and h5ls on one of the files is included, the output
of lfs getstripe is included, and a typescript file showing the ls -l
output every 5 seconds is included to show how the files grew over
time.

You can view the files here: http://orf5.com/hdf5/kraken

I have one question that may be at the root of this performance issue.
The Tuning paper talked about how chunks should be aligned. I have
chosen my own chunk dimensions, which are the same size as the array
dimensions. h5ls -rv shows that those chunk dimensions are preserved
(and these chunk dimensions are of course not aligned). Does this mean
I am overriding an internal mechanism in hdf5 which chooses its own
chunk dimensions based upon the lustre strip size? If I do not write
chunked data, will pHDF5 choose chunk dimensions for me which are
aligned?

Thanks,

Leigh

···

--
Leigh Orf
Associate Professor of Atmospheric Science
Department of Geology and Meteorology
Central Michigan University
Currently on sabbatical at the National Center for Atmospheric
Research in Boulder, CO
NCAR office phone: (303) 497-8200

Quincey_Koziol · March 8, 2011, 3:41am

Hi Leigh,

Did another test on kraken, ended up with approximately 270 MB/s
performance. This appears to be in line with the "baseline" results of
the "Tuning HDF5 for Lustre File Systems" paper. I double checked to
verify that I was using version 1.8.5 and have H5FD_MPIO_COLLECTIVE
set.

For this particular test, I wrote 12 files spanning 30,000 cores
simultaneously. I watched the data come in (ls -l every 5 seconds) and
noticed the data came in in 'fits and starts' and towards the end of
the writes, only a few hundreds of bytes remained to be written, and
it took a long time for those bytes to get written. Something is
weird.

I put some stuff on line if anyone wants to take a glance at it. The
output of h5stat and h5ls on one of the files is included, the output
of lfs getstripe is included, and a typescript file showing the ls -l
output every 5 seconds is included to show how the files grew over
time.

You can view the files here: orf5.com

Interesting... So, your datasets are all fixed size, with no filters. If you are writing the entire dataset in one I/O operation (via collective parallel I/O, or with serial I/O), you should try switching to using contiguous storage for all your datasets.

Quincey

···

On Mar 5, 2011, at 12:49 PM, Leigh Orf wrote:

I have one question that may be at the root of this performance issue.
The Tuning paper talked about how chunks should be aligned. I have
chosen my own chunk dimensions, which are the same size as the array
dimensions. h5ls -rv shows that those chunk dimensions are preserved
(and these chunk dimensions are of course not aligned). Does this mean
I am overriding an internal mechanism in hdf5 which chooses its own
chunk dimensions based upon the lustre strip size? If I do not write
chunked data, will pHDF5 choose chunk dimensions for me which are
aligned?

Thanks,

Leigh

--
Leigh Orf
Associate Professor of Atmospheric Science
Department of Geology and Meteorology
Central Michigan University
Currently on sabbatical at the National Center for Atmospheric
Research in Boulder, CO
NCAR office phone: (303) 497-8200

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Mark_Howison · March 8, 2011, 2:29pm

Hi Leigh,

I've actually never tried to use chunking in conjunction with
collective buffering, and it may be that it is interacting poorly with
the CB algorithm in the Cray library. I would also advise trying
contiguous storage like Quincey has suggested, or switching to
independent mode and using chunking with alignment to pad the chunks
out to stripe-width.

Mark

···

On Mon, Mar 7, 2011 at 10:41 PM, Quincey Koziol <koziol@hdfgroup.org> wrote:

Hi Leigh,

On Mar 5, 2011, at 12:49 PM, Leigh Orf wrote:

Did another test on kraken, ended up with approximately 270 MB/s
performance. This appears to be in line with the "baseline" results of
the "Tuning HDF5 for Lustre File Systems" paper. I double checked to
verify that I was using version 1.8.5 and have H5FD_MPIO_COLLECTIVE
set.

For this particular test, I wrote 12 files spanning 30,000 cores
simultaneously. I watched the data come in (ls -l every 5 seconds) and
noticed the data came in in 'fits and starts' and towards the end of
the writes, only a few hundreds of bytes remained to be written, and
it took a long time for those bytes to get written. Something is
weird.

I put some stuff on line if anyone wants to take a glance at it. The
output of h5stat and h5ls on one of the files is included, the output
of lfs getstripe is included, and a typescript file showing the ls -l
output every 5 seconds is included to show how the files grew over
time.

You can view the files here: orf5.com
   Interesting\.\.\.  So, your datasets are all fixed size, with no filters\.  If you are writing the entire dataset in one I/O operation \(via collective parallel I/O, or with serial I/O\), you should try switching to using contiguous storage for all your datasets\.

   Quincey
I have one question that may be at the root of this performance issue.
The Tuning paper talked about how chunks should be aligned. I have
chosen my own chunk dimensions, which are the same size as the array
dimensions. h5ls -rv shows that those chunk dimensions are preserved
(and these chunk dimensions are of course not aligned). Does this mean
I am overriding an internal mechanism in hdf5 which chooses its own
chunk dimensions based upon the lustre strip size? If I do not write
chunked data, will pHDF5 choose chunk dimensions for me which are
aligned?

Thanks,

Leigh

--
Leigh Orf
Associate Professor of Atmospheric Science
Department of Geology and Meteorology
Central Michigan University
Currently on sabbatical at the National Center for Atmospheric
Research in Boulder, CO
NCAR office phone: (303) 497-8200

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Poor write performance with 30,000 MPI ranks (pHDF5)