Quincey,
Answer to your questions below... no top-posting today 
Hi Leigh,
Some background before I get to the problem:
I am recently attempting the largest simulations I have ever done, so this
is uncharted territory for me. I am running on the kraken teragrid resource.
The application is a 3D cloud model, and the output consists mostly of 2D
and 3D floating point fields.
Each MPI rank runs on a core. I am not using any OpenMP/threads. This is
not an option right now with the way the model is written.
The full problem size is 3300x3000x350 and I'm using a 2D parallel
decomposition, dividing the problem into 30,000 ranks (150x200 ranks, with
each rank having 22x15x350 points). This type of geometry is likely what we
are 'stuck' with unless we go with a 3D parallel decomposition, and that is
not an attractive option.
I have created a few different MPI communicators to handle I/O. The model
writes one single hdf5 file full of 2D and 1D floating point data, as well
as a tiny bit of metadata in the form of integers and attributes (I will
call this the 2D file). The 2D file is accessed through the MPI_COMM_WORLD
communicator - so each of the 30,000 ranks writes to this file. I would
prefer not to split this 2D file (which is about 1 GB in size) up, as it's
used for a quick look at how the simulation is progressing, and can be
visualized directly with software I wrote. For this file, each rank is
writing a 22x15 'patch' of floating point data for each field.
With the files containing the 3D floating point arrays (call them the 3D
files), I have it set up such that a flexible number of ranks can each write
to a HDF5 file, so long as the numbers divide evenly into the full problem.
For instance, I currently have it set up such that each 3D HDF5 file is
written by 15x20 (300) ranks and therefore a total of 100 3D HDF5 files are
written for a history dump. So each file contains 3D arrays of size
330x300x330. Hence, these 3D hdf5 files are using a different communicator
than MPI_COMM_WORLD that I assemble before any I/O occurs.
Excellent description, thanks!
The 2D and 3D files are written at the same time (within the same routine).
For each field, I either write 2D and 3D data, or just 2D data. I can turn
off writing the 3D data and just write the 2D data, but not the other way
around (I could change this and may do so). I currently have a run in the
queue where only 2D data is written so I can determine whether the
bottleneck is with that file as opposed to the 3D files.
The problem I am having is abysmal I/O performance, and I am hoping that
maybe I can get some pointers. I fully realize that the lustre file system
on the kraken teragrid machine is not perfect and has its quirks. However,
after 10 minutes of writing the 2D file and the 3D files, I had only output
about 10 GB of data.
That's definitely not a good I/O rate. :-/
Questions:
1. Should I expect poor performance with 30,000 cores writing tiny 2D
patches to one file? I have considered creating another communicator and
doing MPI_GATHER on this communicator, reassembling the 2D data, and then
opening the 2D file using the communicator - this way fewer ranks would be
accessing at once. Since I am not familiar with the internals of
parallelHDF5, I don't know if doing that is necessary or recommended.
I don't know if this would help, but I'm definitely interested in knowing
what happens if you do it.
2. Since I have flexibility with the number of 3D files, should I create
fewer? More?
Ditto here.
3. There is a command (lfs) on kraken which controls striping patterns.
Could I perhaps see better performance by mucking with striping? I have
looked through http://www.nics.tennessee.edu/io-tips "I/O Tips - Lustre
Striping and Parallel I/O" but did not come back with any clear message
about how I should modify the default settings.
Ditto here.
4. I am doing collective writes (H5FD_MPIO_COLLECTIVE). Should I try
independent (H5FD_MPIO_INDEPENDENT)?
This should be easy to experiment with, but I don't think it'll help.
Since I am unsure where the bottleneck is, I'm asking the hdf5 list first,
and as I understand it some of the folks here are familiar with the kraken
resoruce and have used parallel HDF5 with very large numbers of ranks. Any
tips or suggestions for how to wrestle this problem are greatly appreciated.
I've got some followup questions, which might help future optimizations:
Are you chunking the datasets, or are they contiguous?
I am chunking the datasets by the dimensions of the what is running on each
core (each MPI rank runs on 1 core). So, if I have 3d arrays dimensioned by
15x15x200 on each core, and 4x4 cores on each MPI communicator, the chunk
dimensions are 15x15x200 and the array dimension written to each HDF5 file
is 60x60x200.
A snippet from my code follows. The core dimensions are ni x nj x nk. The
file dimensions are ionumi x ionumj x nk. ionumi = ni * corex, where corex
is the number of cores in the x direction spanning 1 file, same for y. Since
I am only doing a 2d parallel decomposition, nk spans the full vertical
extent.
mygroupi goes from 0 to corex-1, mygroupj goes from 0 to corey-1.
dims(1)=ionumi
dims(2)=ionumj
dims(3)=nk
chunkdims(1)=ni
chunkdims(2)=nj
chunkdims(3)=nk
count(1)=1
count(2)=1
count(3)=1
offset(1) = mygroupi * chunkdims(1)
offset(2) = mygroupj * chunkdims(2)
offset(3) = 0
stride(1) = 1
stride(2) = 1
stride(3) = 1
block(1) = chunkdims(1)
block(2) = chunkdims(2)
block(3) = chunkdims(3)
call h5screate_simple_f(rank,dims,filespace_id,ierror)
call h5screate_simple_f(rank,chunkdims,memspace_id,ierror)
call h5pcreate_f(H5P_DATASET_CREATE_F,chunk_id,ierror)
call h5pset_chunk_f(chunk_id,rank,chunkdims,ierror)
call
h5dcreate_f(file_id,trim(varname),H5T_NATIVE_REAL,filespace_id,dset_id,ierror,chunk_id)
call h5sclose_f(filespace_id,ierror)
call h5dget_space_f(dset_id, filespace_id, ierror)
call h5sselect_hyperslab_f (filespace_id, H5S_SELECT_SET_F, offset,
count, ierror,stride,block)
call h5pcreate_f(H5P_DATASET_XFER_F, plist_id, ierror)
call h5pset_dxpl_mpio_f(plist_id, MPIO, ierror)
call h5dwrite_f(dset_id, H5T_NATIVE_REAL, core3d(1:ni,1:nj,1:nk),
dims, ierror, &
file_space_id = filespace_id, mem_space_id = memspace_id,
xfer_prp = plist_id)
>How many datasets are you creating each timestep?
This is a selectable option. Here is a typical scenario. In this case, just
for some background, corex=4, corey=6 (16 cores per file) and there are 16
files per full domain write. So each .cm1.hdf5 file contains 1/16th of the
full domain. the .2Dcm1hdf5 file contains primarily 2D slices of the full
domain. It is written by *ALL* cores (and performance to this file is good,
even on 30,000 cores writing to it on kraken).
bp-login1: /scr/orf/Lnew/L500ang120_0.010_1000.0m.00000.cdir % ls -l
-rw-r--r-- 1 orf jmd 58393968 Feb 22 17:57
L500ang120_0.010_1000.0m.03600_0000.2Dcm1hdf5
-rw-r--r-- 1 orf jmd 342313312 Feb 22 17:57
L500ang120_0.010_1000.0m.03600_0000.cm1hdf5
-rw-r--r-- 1 orf jmd 342313312 Feb 22 17:57
L500ang120_0.010_1000.0m.03600_0001.cm1hdf5
-rw-r--r-- 1 orf jmd 342313312 Feb 22 17:57
L500ang120_0.010_1000.0m.03600_0002.cm1hdf5
-rw-r--r-- 1 orf jmd 342313312 Feb 22 17:57
L500ang120_0.010_1000.0m.03600_0003.cm1hdf5
-rw-r--r-- 1 orf jmd 342313312 Feb 22 17:57
L500ang120_0.010_1000.0m.03600_0004.cm1hdf5
-rw-r--r-- 1 orf jmd 342313312 Feb 22 17:57
L500ang120_0.010_1000.0m.03600_0005.cm1hdf5
-rw-r--r-- 1 orf jmd 342313312 Feb 22 17:57
L500ang120_0.010_1000.0m.03600_0006.cm1hdf5
-rw-r--r-- 1 orf jmd 342313312 Feb 22 17:57
L500ang120_0.010_1000.0m.03600_0007.cm1hdf5
-rw-r--r-- 1 orf jmd 342313312 Feb 22 17:57
L500ang120_0.010_1000.0m.03600_0008.cm1hdf5
-rw-r--r-- 1 orf jmd 342313312 Feb 22 17:57
L500ang120_0.010_1000.0m.03600_0009.cm1hdf5
-rw-r--r-- 1 orf jmd 342313312 Feb 22 17:57
L500ang120_0.010_1000.0m.03600_0010.cm1hdf5
-rw-r--r-- 1 orf jmd 342313312 Feb 22 17:57
L500ang120_0.010_1000.0m.03600_0011.cm1hdf5
-rw-r--r-- 1 orf jmd 342313312 Feb 22 17:57
L500ang120_0.010_1000.0m.03600_0012.cm1hdf5
-rw-r--r-- 1 orf jmd 342313312 Feb 22 17:57
L500ang120_0.010_1000.0m.03600_0013.cm1hdf5
-rw-r--r-- 1 orf jmd 342313312 Feb 22 17:57
L500ang120_0.010_1000.0m.03600_0014.cm1hdf5
-rw-r--r-- 1 orf jmd 342313312 Feb 22 17:57
L500ang120_0.010_1000.0m.03600_0015.cm1hdf5
bp-login1: /scr/orf/Lnew/L500ang120_0.010_1000.0m.00000.cdir % h5ls -rv
L500ang120_0.010_1000.0m.03600_0009.cm1hdf5 | grep Dataset
/2d/cpc Dataset {176/176, 140/140}
/2d/cph Dataset {176/176, 140/140}
/2d/cref Dataset {176/176, 140/140}
/2d/maxsgs Dataset {176/176, 140/140}
/2d/maxshs Dataset {176/176, 140/140}
/2d/maxsrs Dataset {176/176, 140/140}
/2d/maxsus Dataset {176/176, 140/140}
/2d/maxsvs Dataset {176/176, 140/140}
/2d/maxsws Dataset {176/176, 140/140}
/2d/minsps Dataset {176/176, 140/140}
/2d/sfcrain Dataset {176/176, 140/140}
/2d/uh Dataset {176/176, 140/140}
/3d/dbz Dataset {96/96, 176/176, 140/140}
/3d/khh Dataset {96/96, 176/176, 140/140}
/3d/khv Dataset {96/96, 176/176, 140/140}
/3d/kmh Dataset {96/96, 176/176, 140/140}
/3d/kmv Dataset {96/96, 176/176, 140/140}
/3d/ncg Dataset {96/96, 176/176, 140/140}
/3d/nci Dataset {96/96, 176/176, 140/140}
/3d/ncr Dataset {96/96, 176/176, 140/140}
/3d/ncs Dataset {96/96, 176/176, 140/140}
/3d/p Dataset {96/96, 176/176, 140/140}
/3d/pi Dataset {96/96, 176/176, 140/140}
/3d/pipert Dataset {96/96, 176/176, 140/140}
/3d/ppert Dataset {96/96, 176/176, 140/140}
/3d/qc Dataset {96/96, 176/176, 140/140}
/3d/qg Dataset {96/96, 176/176, 140/140}
/3d/qi Dataset {96/96, 176/176, 140/140}
/3d/qr Dataset {96/96, 176/176, 140/140}
/3d/qs Dataset {96/96, 176/176, 140/140}
/3d/qv Dataset {96/96, 176/176, 140/140}
/3d/qvpert Dataset {96/96, 176/176, 140/140}
/3d/rho Dataset {96/96, 176/176, 140/140}
/3d/rhopert Dataset {96/96, 176/176, 140/140}
/3d/th Dataset {96/96, 176/176, 140/140}
/3d/thpert Dataset {96/96, 176/176, 140/140}
/3d/tke Dataset {96/96, 176/176, 140/140}
/3d/u Dataset {96/96, 176/176, 140/140}
/3d/u_yzlast Dataset {96/96, 176/176}
/3d/uinterp Dataset {96/96, 176/176, 140/140}
/3d/upert Dataset {96/96, 176/176, 140/140}
/3d/upert_yzlast Dataset {96/96, 176/176}
/3d/v Dataset {96/96, 176/176, 140/140}
/3d/v_xzlast Dataset {96/96, 140/140}
/3d/vinterp Dataset {96/96, 176/176, 140/140}
/3d/vpert Dataset {96/96, 176/176, 140/140}
/3d/vpert_xzlast Dataset {96/96, 140/140}
/3d/w Dataset {97/97, 176/176, 140/140}
/3d/winterp Dataset {96/96, 176/176, 140/140}
/3d/xvort Dataset {96/96, 176/176, 140/140}
/3d/yvort Dataset {96/96, 176/176, 140/140}
/3d/zvort Dataset {96/96, 176/176, 140/140}
/basestate/pi0 Dataset {96/96}
/basestate/pres0 Dataset {96/96}
/basestate/qv0 Dataset {96/96}
/basestate/rh0 Dataset {96/96}
/basestate/th0 Dataset {96/96}
/basestate/u0 Dataset {96/96}
/basestate/v0 Dataset {96/96}
/grid/myi Dataset {1/1}
/grid/myj Dataset {1/1}
/grid/ni Dataset {1/1}
/grid/nj Dataset {1/1}
/grid/nodex Dataset {1/1}
/grid/nodey Dataset {1/1}
/grid/nx Dataset {1/1}
/grid/ny Dataset {1/1}
/grid/nz Dataset {1/1}
/grid/x0 Dataset {1/1}
/grid/x1 Dataset {1/1}
/grid/y0 Dataset {1/1}
/grid/y1 Dataset {1/1}
/mesh/dx Dataset {1/1}
/mesh/dy Dataset {1/1}
/mesh/xf Dataset {140/140}
/mesh/xh Dataset {140/140}
/mesh/yf Dataset {176/176}
/mesh/yh Dataset {176/176}
/mesh/zf Dataset {97/97}
/mesh/zh Dataset {96/96}
/time Dataset {1/1}
bp-login1: /scr/orf/Lnew/L500ang120_0.010_1000.0m.00000.cdir % h5ls -rv
L500ang120_0.010_1000.0m.03600_0009.cm1hdf5 | grep 3d | grep -v zlast | wc
-l
44
So there are 44 3D fields in this case. That's pretty much the kitchen sink,
normally I'd probably be writing half as many datasets.
Notice also I've got a bunch of tiny bits which serve as metadata (for
stitching things back together for analysis), some small 1d arrays, some 2d
arrays, and then the big 3d arrays. Except for the *zlast arrays, all of the
stuff in /3d is three-dimensional as you can see. The *zlast stuff is
because some of the variables have an extra point in the x or y direction
(GRR staggered grids) and I just write the last planes out in a separate
dataset. This is because I am splitting up the writes into separate hdf5
files. Were I writing only one file, it would be easier.
As far as the dimensions of the arrays you see here, don't take them too
seriously, this was from a run on another machine. I am holding off on
kraken until I can get at least a decent idea of what to try to improve I/O.
How many timesteps are going into each file?
Only one.
···
On Mon, Feb 21, 2011 at 5:38 AM, Quincey Koziol <koziol@hdfgroup.org> wrote:
On Feb 17, 2011, at 2:49 PM, Leigh Orf wrote:
Quincey
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
--
Leigh Orf
Associate Professor of Atmospheric Science
Department of Geology and Meteorology
Central Michigan University
Currently on sabbatical at the National Center for Atmospheric Research
in Boulder, CO
NCAR office phone: (303) 497-8200