Slow performance writing hyperslabs

Hello,

I am updating a Monte Carlo particle transport code written in Fortran to
be run on Titan (http://en.wikipedia.org/wiki/Titan_(supercomputer)),
and I'm having some trouble achieving good performance writing to many
small sections of large datasets with independent I/O. Monitoring with du,
it looks like I'm writing only about 50 MB/s while creating a 78GB file. I
was hoping to get much better from the Lustre filesystem.

For my current case I'm trying to write to 3000 1D datasets in the same
file containing 3,534,960 doubles each, where each individual write will
always be a contiguous 1D chunk of 206 doubles. These writes are coming
from 980 different MPI processes, and each chunk of 206 that needs to be
written could be in any of the 3000 datasets, anywhere within the large 1D
array. I'm doing things in batches, where on each MPI process I collect a
bunch of these chunks of writes on each process before doing a round of
writing.

It's trivial to calculate where in each of the large datasets each chunk
should be written to, so at first I just looped through each chunk to write
to and selected the appropriate hyperslab before writing. I figured this
might be slow because it was jumping all around the datasets for each
little write, so I switched to a scheme where I first sort the writes in
the order they would appear in the dataset, and then create a large union
of irregularly-spaced hyperslabs and do one big write, similar to the
difference discussed here:
http://hdf-forum.184993.n3.nabble.com/reading-multiple-irregularly-spaced-hyperslabs-td426929.html
.

This gave me about a 2 or 3x speedup from where I was, but I'm still pretty
unsatisfied with these write speeds. I'm still a little new to HDF5, so
I'm worried I might be missing something fundamental. I've done a little
playing around with setting the chunking on these datasets to multiples of
206, but I haven't had any success there.

Is there a way to accomplish the write pattern I described more efficiently
than what I'm doing? Example code below showing what I'm doing now. This
is with hdf5 1.8.11.

Thanks in advance for any advice!

Nick

···

---

integer(HID_T) :: file_id, group_id, dset, dspace, plist
integer(HSIZE_T) :: dims(1), block(1), count(1), start(1)

! All the datasets are created up front once

dims(1) = 3534960
do i = 1, 3000
  call h5fopen_f(filename, H5F_ACC_RDWR_F, file_id, hdf5_err)
  call h5gopen_f(file_id, groupname(i), group_id, hdf5_err)
  call h5screate_simple_f(1, dims, dspace, hdf5_err)
  call h5dcreate_f(group_id, 'data', H5T_NATIVE_DOUBLE, dspace, dset,
hdf5_err)
  call h5dclose_f(dset, hdf5_err)
  call h5sclose_f(dspace, hdf5_err)
  call h5gclose_f(group_id, hdf5_err)
  call h5fclose_f(file_id, hdf5_err)
end do

...

! The rest is run on each MPI process, which has n_chunks chunks of 206 to
write
! The starting index in the large dataset for each chunk j is stored in
chunk_starts(j)
! The chunks themselves are stored in one large contiguous array of
doubles: chunks

! Setup file access property list with parallel I/O access
call h5pcreate_f(H5P_FILE_ACCESS_F, plist, hdf5_err)
call h5pset_fapl_mpio_f(plist, MPI_COMM_WORLD, MPI_INFO_NULL, hdf5_err)

! Open the file
call h5fopen_f(filename, H5F_ACC_RDWR_F, file_id, hdf5_err, access_prp =
plist)

! Close property list
call h5pclose_f(plist, hdf5_err)

! Create the property list to describe independent parallel I/O
call h5pcreate_f(H5P_DATASET_XFER_F, plist, hdf5_err)
call h5pset_dxpl_mpio_f(plist, H5FD_MPIO_INDEPENDENT_F, hdf5_err)

do i = 1, 3000

  block = 206
  count = 1

  ! Open the group
  call h5gopen_f(file_id, groupname(i), group_id, hdf5_err)

  ! Open the dataset
  call h5dopen_f(group_id, 'data', dset, hdf5_err)

  ! Open the dataspace and memory space
  call h5dget_space_f(dset, dspace, hdf5_err)
  call h5screate_simple_f(1, block * n_chunks, memspace, hdf5_err)

  ! Select the irregular hyperslab
  call h5sselect_none_f(dspace, hdf5_err)
  do j = 1, n_chunks

    idx = chunk_start(j)
    start = (idx - 1) * block

    ! Select the hyperslab
    call h5sselect_hyperslab_f(dspace, H5S_SELECT_OR_F, start, &
        count, hdf5_err, block = block)

  end do

  ! Write the data
  chunks = chunks_for_group(i)
  f_ptr = c_loc(chunks(1))
  call h5dwrite_f(dset, H5T_NATIVE_DOUBLE, f_ptr, hdf5_err, &
      file_space_id = dspace, mem_space_id = memspace, &
      xfer_prp = plist)

  ! Close the dataspace and memory space
  call h5sclose_f(dspace, hdf5_err)
  call h5sclose_f(memspace, hdf5_err)

end do

! Close the group and file
call h5gclose_f(group_id, hdf5_err)
call h5fclose_f(file_id, hdf5_err)

Just to follow up, I tried switching to collective IO instead, and saw a 2x
speedup. I also played around with chunking and chunk cache parameters a
fair bit, and only ever got it to go maybe 1.5x faster by doing that.

After a little more searching I found this investigation (
http://confluence.diamond.ac.uk/display/HDF5DEV/HDF5+writer+performance+review)
that saw similarly bad performance for a few benchmarks when using MPI. I
also found this presentation (
https://www.nersc.gov/assets/Uploads/Readingand-WritingDataNUGFeb2012.ppt)
which observed similar performance with MPIIO. It also led me to adjust
the OST filesystem striping parameters for the directory I'm creating the
file in, which helped be up the performance to around 200 MB/s.

At this point I'm thinking that it's probably not HDF5 that is my problem,
but the MPIIO implementation and the filesystem itself.

Nick

···

On Sun, Dec 21, 2014 at 1:47 AM, Nick Horelik <nhorelik@mit.edu> wrote:

Hello,

I am updating a Monte Carlo particle transport code written in Fortran to
be run on Titan (http://en.wikipedia.org/wiki/Titan_(supercomputer)),
and I'm having some trouble achieving good performance writing to many
small sections of large datasets with independent I/O. Monitoring with du,
it looks like I'm writing only about 50 MB/s while creating a 78GB file. I
was hoping to get much better from the Lustre filesystem.

For my current case I'm trying to write to 3000 1D datasets in the same
file containing 3,534,960 doubles each, where each individual write will
always be a contiguous 1D chunk of 206 doubles. These writes are coming
from 980 different MPI processes, and each chunk of 206 that needs to be
written could be in any of the 3000 datasets, anywhere within the large 1D
array. I'm doing things in batches, where on each MPI process I collect a
bunch of these chunks of writes on each process before doing a round of
writing.

It's trivial to calculate where in each of the large datasets each chunk
should be written to, so at first I just looped through each chunk to write
to and selected the appropriate hyperslab before writing. I figured this
might be slow because it was jumping all around the datasets for each
little write, so I switched to a scheme where I first sort the writes in
the order they would appear in the dataset, and then create a large union
of irregularly-spaced hyperslabs and do one big write, similar to the
difference discussed here:
http://hdf-forum.184993.n3.nabble.com/reading-multiple-irregularly-spaced-hyperslabs-td426929.html
.

This gave me about a 2 or 3x speedup from where I was, but I'm still
pretty unsatisfied with these write speeds. I'm still a little new to
HDF5, so I'm worried I might be missing something fundamental. I've done a
little playing around with setting the chunking on these datasets to
multiples of 206, but I haven't had any success there.

Is there a way to accomplish the write pattern I described more
efficiently than what I'm doing? Example code below showing what I'm doing
now. This is with hdf5 1.8.11.

Thanks in advance for any advice!

Nick

---

integer(HID_T) :: file_id, group_id, dset, dspace, plist
integer(HSIZE_T) :: dims(1), block(1), count(1), start(1)

! All the datasets are created up front once

dims(1) = 3534960
do i = 1, 3000
  call h5fopen_f(filename, H5F_ACC_RDWR_F, file_id, hdf5_err)
  call h5gopen_f(file_id, groupname(i), group_id, hdf5_err)
  call h5screate_simple_f(1, dims, dspace, hdf5_err)
  call h5dcreate_f(group_id, 'data', H5T_NATIVE_DOUBLE, dspace, dset,
hdf5_err)
  call h5dclose_f(dset, hdf5_err)
  call h5sclose_f(dspace, hdf5_err)
  call h5gclose_f(group_id, hdf5_err)
  call h5fclose_f(file_id, hdf5_err)
end do

...

! The rest is run on each MPI process, which has n_chunks chunks of 206 to
write
! The starting index in the large dataset for each chunk j is stored in
chunk_starts(j)
! The chunks themselves are stored in one large contiguous array of
doubles: chunks

! Setup file access property list with parallel I/O access
call h5pcreate_f(H5P_FILE_ACCESS_F, plist, hdf5_err)
call h5pset_fapl_mpio_f(plist, MPI_COMM_WORLD, MPI_INFO_NULL, hdf5_err)

! Open the file
call h5fopen_f(filename, H5F_ACC_RDWR_F, file_id, hdf5_err, access_prp =
plist)

! Close property list
call h5pclose_f(plist, hdf5_err)

! Create the property list to describe independent parallel I/O
call h5pcreate_f(H5P_DATASET_XFER_F, plist, hdf5_err)
call h5pset_dxpl_mpio_f(plist, H5FD_MPIO_INDEPENDENT_F, hdf5_err)

do i = 1, 3000

  block = 206
  count = 1

  ! Open the group
  call h5gopen_f(file_id, groupname(i), group_id, hdf5_err)

  ! Open the dataset
  call h5dopen_f(group_id, 'data', dset, hdf5_err)

  ! Open the dataspace and memory space
  call h5dget_space_f(dset, dspace, hdf5_err)
  call h5screate_simple_f(1, block * n_chunks, memspace, hdf5_err)

  ! Select the irregular hyperslab
  call h5sselect_none_f(dspace, hdf5_err)
  do j = 1, n_chunks

    idx = chunk_start(j)
    start = (idx - 1) * block

    ! Select the hyperslab
    call h5sselect_hyperslab_f(dspace, H5S_SELECT_OR_F, start, &
        count, hdf5_err, block = block)

  end do

  ! Write the data
  chunks = chunks_for_group(i)
  f_ptr = c_loc(chunks(1))
  call h5dwrite_f(dset, H5T_NATIVE_DOUBLE, f_ptr, hdf5_err, &
      file_space_id = dspace, mem_space_id = memspace, &
      xfer_prp = plist)

  ! Close the dataspace and memory space
  call h5sclose_f(dspace, hdf5_err)
  call h5sclose_f(memspace, hdf5_err)

end do

! Close the group and file
call h5gclose_f(group_id, hdf5_err)
call h5fclose_f(file_id, hdf5_err)

--
774.208.2168

Hi Nick,

Check this FAQ out for some pointers on how to approach things:
http://www.hdfgroup.org/HDF5/faq/parallel.html#perf

It might be worth also upgrading to HDF5 1.8.14.

As you mentioned, you are writing small amount of data per write operation. You will not get good performance from HDF5/MPIIO/Lustre or using plain HDF5/POSIX IO/Lustre. You need to use collective I/O to combine in one write as much data as you can. I believe this is the key thing that you need to do to get good performance, other than tuning some parameters from the FAQ above. The more data/information you give to the lower layer, the better performance that you can achieve.

We are also working on a prototype for multi-dataset I/O that allows you to write to multiple datasets in one I/O call to give as much information as you can in one I/O operation. The prototype is in progress (i.e. might not be very stable), but should be wrapped up, merged and released into the 1.8 series soon. You can try it out with this branch:
https://svn.hdfgroup.uiuc.edu/hdf5/features/multi_rd_wd_coll_io/
The new functions are H5Dread_multi() and H5Dwrite_multi(). I attached an old RFC, but it might be a little out of date.

Thanks,
Mohamad

H5HPC_MultiDset_RW_IO_RFC_v4_20130320.docx (307 KB)

···

From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of Nick Horelik
Sent: Monday, December 22, 2014 3:48 AM
To: hdf-forum@lists.hdfgroup.org
Subject: Re: [Hdf-forum] Slow performance writing hyperslabs

Just to follow up, I tried switching to collective IO instead, and saw a 2x speedup. I also played around with chunking and chunk cache parameters a fair bit, and only ever got it to go maybe 1.5x faster by doing that.
After a little more searching I found this investigation (http://confluence.diamond.ac.uk/display/HDF5DEV/HDF5+writer+performance+review\) that saw similarly bad performance for a few benchmarks when using MPI. I also found this presentation (https://www.nersc.gov/assets/Uploads/Readingand-WritingDataNUGFeb2012.ppt\) which observed similar performance with MPIIO. It also led me to adjust the OST filesystem striping parameters for the directory I'm creating the file in, which helped be up the performance to around 200 MB/s.

At this point I'm thinking that it's probably not HDF5 that is my problem, but the MPIIO implementation and the filesystem itself.

Nick

On Sun, Dec 21, 2014 at 1:47 AM, Nick Horelik <nhorelik@mit.edu<mailto:nhorelik@mit.edu>> wrote:
Hello,
I am updating a Monte Carlo particle transport code written in Fortran to be run on Titan (http://en.wikipedia.org/wiki/Titan_(supercomputer)\), and I'm having some trouble achieving good performance writing to many small sections of large datasets with independent I/O. Monitoring with du, it looks like I'm writing only about 50 MB/s while creating a 78GB file. I was hoping to get much better from the Lustre filesystem.

For my current case I'm trying to write to 3000 1D datasets in the same file containing 3,534,960 doubles each, where each individual write will always be a contiguous 1D chunk of 206 doubles. These writes are coming from 980 different MPI processes, and each chunk of 206 that needs to be written could be in any of the 3000 datasets, anywhere within the large 1D array. I'm doing things in batches, where on each MPI process I collect a bunch of these chunks of writes on each process before doing a round of writing.

It's trivial to calculate where in each of the large datasets each chunk should be written to, so at first I just looped through each chunk to write to and selected the appropriate hyperslab before writing. I figured this might be slow because it was jumping all around the datasets for each little write, so I switched to a scheme where I first sort the writes in the order they would appear in the dataset, and then create a large union of irregularly-spaced hyperslabs and do one big write, similar to the difference discussed here: http://hdf-forum.184993.n3.nabble.com/reading-multiple-irregularly-spaced-hyperslabs-td426929.html\.

This gave me about a 2 or 3x speedup from where I was, but I'm still pretty unsatisfied with these write speeds. I'm still a little new to HDF5, so I'm worried I might be missing something fundamental. I've done a little playing around with setting the chunking on these datasets to multiples of 206, but I haven't had any success there.
Is there a way to accomplish the write pattern I described more efficiently than what I'm doing? Example code below showing what I'm doing now. This is with hdf5 1.8.11.

Thanks in advance for any advice!
Nick

---

integer(HID_T) :: file_id, group_id, dset, dspace, plist
integer(HSIZE_T) :: dims(1), block(1), count(1), start(1)

! All the datasets are created up front once

dims(1) = 3534960
do i = 1, 3000
  call h5fopen_f(filename, H5F_ACC_RDWR_F, file_id, hdf5_err)
  call h5gopen_f(file_id, groupname(i), group_id, hdf5_err)
  call h5screate_simple_f(1, dims, dspace, hdf5_err)
  call h5dcreate_f(group_id, 'data', H5T_NATIVE_DOUBLE, dspace, dset, hdf5_err)
  call h5dclose_f(dset, hdf5_err)
  call h5sclose_f(dspace, hdf5_err)
  call h5gclose_f(group_id, hdf5_err)
  call h5fclose_f(file_id, hdf5_err)
end do

...

! The rest is run on each MPI process, which has n_chunks chunks of 206 to write
! The starting index in the large dataset for each chunk j is stored in chunk_starts(j)
! The chunks themselves are stored in one large contiguous array of doubles: chunks

! Setup file access property list with parallel I/O access
call h5pcreate_f(H5P_FILE_ACCESS_F, plist, hdf5_err)
call h5pset_fapl_mpio_f(plist, MPI_COMM_WORLD, MPI_INFO_NULL, hdf5_err)

! Open the file
call h5fopen_f(filename, H5F_ACC_RDWR_F, file_id, hdf5_err, access_prp = plist)

! Close property list
call h5pclose_f(plist, hdf5_err)

! Create the property list to describe independent parallel I/O
call h5pcreate_f(H5P_DATASET_XFER_F, plist, hdf5_err)
call h5pset_dxpl_mpio_f(plist, H5FD_MPIO_INDEPENDENT_F, hdf5_err)

do i = 1, 3000

  block = 206
  count = 1

  ! Open the group
  call h5gopen_f(file_id, groupname(i), group_id, hdf5_err)

  ! Open the dataset
  call h5dopen_f(group_id, 'data', dset, hdf5_err)

  ! Open the dataspace and memory space
  call h5dget_space_f(dset, dspace, hdf5_err)
  call h5screate_simple_f(1, block * n_chunks, memspace, hdf5_err)

  ! Select the irregular hyperslab
  call h5sselect_none_f(dspace, hdf5_err)
  do j = 1, n_chunks

    idx = chunk_start(j)
    start = (idx - 1) * block

    ! Select the hyperslab
    call h5sselect_hyperslab_f(dspace, H5S_SELECT_OR_F, start, &
        count, hdf5_err, block = block)

  end do

  ! Write the data
  chunks = chunks_for_group(i)
  f_ptr = c_loc(chunks(1))
  call h5dwrite_f(dset, H5T_NATIVE_DOUBLE, f_ptr, hdf5_err, &
      file_space_id = dspace, mem_space_id = memspace, &
      xfer_prp = plist)

  ! Close the dataspace and memory space
  call h5sclose_f(dspace, hdf5_err)
  call h5sclose_f(memspace, hdf5_err)

end do
! Close the group and file
call h5gclose_f(group_id, hdf5_err)
call h5fclose_f(file_id, hdf5_err)

--
774.208.2168