Question re: Howison et al Lustre mdc_config tuning recommendations

John_Mainzer1 · February 14, 2011, 6:18pm

Hi Quincey,

Yes, it is 1.8.6.

Regards,

John

···

From hdf-forum-bounces@hdfgroup.org Fri Feb 11 18:41:43 2011
Date: Fri, 11 Feb 2011 16:42:30 -0800
To: HDF Users Discussion List <hdf-forum@hdfgroup.org>
Subject: Re: [Hdf-forum] Question re: Howison et al Lustre mdc_config tuning
recommendations

Hi John,

On Feb 10, 2011, at 5:33 PM, John Mainzer wrote:

From hdf-forum-bounces@hdfgroup.org Thu Feb 10 16:48:04 2011
From: Rhys Ulerich <rhys.ulerich@gmail.com>
Date: Thu, 10 Feb 2011 16:48:22 -0600
To: HDF Users Discussion List <hdf-forum@hdfgroup.org>
Subject: [Hdf-forum] Question re: Howison et al Lustre mdc_config tuning recommendations

Good day,

I decided to try some metadata caching parameters related to the
discussion on pages 3-4 of Howison et al
(http://www.hdfgroup.org/pubs/papers/howison_hdf5_lustre_iasds2010.pdf).
The paper gives sample code as follows in Figure 5:

  H5AC_cache_config_t mdc_config;
  hid_t file_id;
  file_id = H5Fopen("file.h5", H5ACC_RDWR, H5P_DEFAULT);
  mdc_config.version = H5AC__CURR_CACHE_CONFIG_VERSION;
  H5Pget_mdc_config(file_id, &mdc_config)
  mdc_config.evictions_enabled = 0 /* FALSE */;
  mdc_config.incr_mode = H5C_incr__off;
  mdc_config.decr_mode = H5C_decr__off;
  H5Pset_mdc_config(file_id, &mdc_config);

Attempting to directly implement this fails. Modifying the above so
the H5Pget_mdc_config/H5Pset_mdc_config operates on a file access
property list succeeds. However, I saw runtime errors like

HDF5-DIAG: Error detected in HDF5 (1.8.5-patch1) MPI-process 2:
#000: H5Pfapl.c line 1354 in H5Pset_mdc_config(): invalid metadata
cache configuration
  major: Invalid arguments to routine
  minor: Bad value
#001: H5AC.c line 2665 in H5AC_validate_config(): Can't disable
evictions while auto-resize is enabled.
  major: Invalid arguments to routine
  minor: Bad value

which I've remedied by glancing at H5AC.c:2665 and then adding the statement

  mdc_config.flash_incr_mode = H5C_flash_incr__off;

to the code block above. Is this modification in the spirit of what
Howison et al. suggests? Or is using H5C_flash_incr__off asking for
trouble in ways that the paper does not discuss?

Thanks for your time,
Rhys

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Hi Rhys,

  You did exactly as you should have.

  The metadata cache configuration in the paper worked due to an error
on my part when I implemented the flash cache size increment code. I have
since noticed and correct the error -- hence the failure you encountered
when you tried to duplicate the code from the paper.

Is this fix included in the 1.8.6 release?

Thanks,
  Quincey

  To give you some background:

  In its default configuration, the metadata cache in HDF5 will attempt to
automatically adapt to the current metadata working set size in real time.
While there is no fundamental reason why this feature can't be active when
evictions are disabled, I can't think of any circumstances in which it would
be useful. Further, writing the test code required to verify proper behavior
under these circumstances would require significant effort.

  Hence the decision to require that adaptive metadata cache resizing be
disabled when evictions are disabled. Needless to say, this decision will
be re-visited if anyone comes up with a plausible reason to do so.

                                              Best regards,

                                              John Mainzer

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

John_Biddiscombe · April 5, 2011, 8:21am

  H5AC_cache_config_t mdc_config;
  hid_t file_id;
  file_id = H5Fopen("file.h5", H5ACC_RDWR, H5P_DEFAULT);
  mdc_config.version = H5AC__CURR_CACHE_CONFIG_VERSION;
  H5Pget_mdc_config(file_id, &mdc_config)
  mdc_config.evictions_enabled = 0 /* FALSE */;
  mdc_config.incr_mode = H5C_incr__off;
  mdc_config.decr_mode = H5C_decr__off;
  H5Pset_mdc_config(file_id, &mdc_config);

I couldn't find fortran bindings for these. Do they exist in any recent releases or svn branches.

thanks

JB

epourmal · April 5, 2011, 3:00pm

Hi John,

H5AC_cache_config_t mdc_config;
hid_t file_id;
file_id = H5Fopen("file.h5", H5ACC_RDWR, H5P_DEFAULT);
mdc_config.version = H5AC__CURR_CACHE_CONFIG_VERSION;
H5Pget_mdc_config(file_id, &mdc_config)
mdc_config.evictions_enabled = 0 /* FALSE */;
mdc_config.incr_mode = H5C_incr__off;
mdc_config.decr_mode = H5C_decr__off;
H5Pset_mdc_config(file_id, &mdc_config);

I couldn't find fortran bindings for these. Do they exist in any recent releases or svn branches.

Fortran wrappers for do not exist. Please let us know which Fortran wrappers do you need and we will add them to our to-do list.

Elena

···

On Apr 5, 2011, at 4:21 AM, Biddiscombe, John A. wrote:

thanks

JB

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Kavalakuntla_Kailash · April 5, 2011, 12:28pm

Hi,
I'm trying to write a wrapper for H5TB and pfb the code. I'm able to call H5TBmake_table but getting a -1 as the result everytime :(. Any help is appreciated !

C++ :
Definition :
[DllImport("hdf5_hldll.dll",
           CharSet=CharSet::Auto,
           CallingConvention=CallingConvention::StdCall)]
extern "C"
herr_t _cdecl H5TBmake_table( [MarshalAs(UnmanagedType::LPStr)]
            String^ table_title,
            hid_t loc_id,

[MarshalAs(UnmanagedType::LPStr)]
String^ dset_name,

            hsize_t nfields,
            const hsize_t nrecords,
            size_t type_size,

            [MarshalAs(UnmanagedType::LPArray, ArraySubType=UnmanagedType::LPStr)]
            cli::array<String^>^ field_names,
            [MarshalAs(UnmanagedType::LPArray)]
            array<hsize_t>^ field_offset,
            [MarshalAs(UnmanagedType::LPArray)]
            array<hid_t>^ field_types,
            hsize_t chunk_size,
            void *fill_data,
            int compress,
            const void *data );

Signature definition :

namespace HDF5DotNet
{
  generic <class Type>
  herr_t H5TB::MakeTable(
      String^ title,
      H5LocId^ locId,
      String^ datasetName,
      hsize_t fieldsCount,
      hsize_t recordsCount,
      size_t typeSize,
      array<String^>^ fieldNames,
      array<hsize_t>^ fieldOffsets,
      array<hid_t>^ fieldTypes,
      hsize_t chunkSize,
      H5Array<Type>^ fillData,
      int compress,
      H5Array<Type>^ data
      );
}

Implementation :

namespace HDF5DotNet
{
  generic <class Type>
  herr_t H5TB::MakeTable(
      String^ title,
      H5LocId^ locId,
      String^ datasetName,
      hsize_t fieldsCount,
      hsize_t recordsCount,
      size_t typeSize,
      array<String^>^ fieldNames,
      array<hsize_t>^ fieldOffsets,
      array<hid_t>^ fieldTypes,
      hsize_t chunkSize,
      H5Array<Type>^ fillData,
      int compress,
      H5Array<Type>^ data
      )
  {
    pin_ptr<Type> _pin_fillData = fillData->getDataAddress();

void* _fillData = _pin_fillData ;

pin_ptr<Type> _pin_Data = data->getDataAddress();
void* _data = _pin_Data;

for(int i = 0; i < fieldNames->Length; i++)
Console::WriteLine( fieldNames[0]->ToString() + "------" );

    herr_t err = H5TBmake_table( title, locId->Id, datasetName, fieldsCount, recordsCount, typeSize,
      fieldNames, fieldOffsets, fieldTypes, chunkSize, fillData, compress, data );
    if( err < 0 )
      Console::WriteLine("Failed creating table");
    return err;
  }
}

Kindly let me know where am I going wrong

Thanks & Regards
Kailash K

This message contains information that may be privileged or confidential and is the property of the Capgemini Group. It is
intended only for the person to whom it is addressed. If you are not the intended recipient, you are not authorized to
read, print, retain, copy, disseminate, distribute, or use this message or any part thereof. If you receive this message
in error, please notify the sender immediately and delete all copies of this message.

John_Biddiscombe · April 5, 2011, 3:16pm

Elena,

I was just replying to myself when your email came in.
I knocked up a quick wrapper to enable testing of the cache eviction stuff so I'm happy (ish).

However : I'm seeing puzzling behaviour with chunking.
Using collective IO, (tweaking various params) I see transfer rates from 1GB/s to 3GB/s depending on stripe size, number of cb_nodes, etc.

However, when using chunking with independent IO, I set stripe size to 6MB, chunk dims to match the 6MB, each node is writing 6MB, alignment is set to 6MB intervals and I've followed all the tips I can find. I see (for 512 nodes writing 6MB each = 4GB total) a max throughput of around 150MB/s.

This is shockingly slow compared to collective IO and I'm quite surprised. I've been playing with this for a few days now and my general impression is that
chunking = rubbish
collective = nice

I did not expect chunking to be so bad compared to collective (which is a shame as I was hoping to use it for compression etc).

Can anyone suggest further tweaks that I should be looking out for to change. (one thing for example that seems to make no difference is the H5Pset_istore_k(fcpl, btree_ik); stuff. I still don't quite know what the correct value for btree_ik is. Ive read the man page, but I'm puzzled as to the correct meaning. if I know there will be 512 chunks, what is the 'right' value of btree_ik?

Any clues gratefully received for optimizing chunking. I hoped the thread about 30,000 processes would carry on as I found it interesting to follow.

ttfn

JB

···

-----Original Message-----
From: hdf-forum-bounces@hdfgroup.org [mailto:hdf-forum-bounces@hdfgroup.org] On Behalf Of Elena Pourmal
Sent: 05 April 2011 17:01
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] Question re: Howison et al Lustre mdc_config tuning recommendations

Hi John,
On Apr 5, 2011, at 4:21 AM, Biddiscombe, John A. wrote:

H5AC_cache_config_t mdc_config;
hid_t file_id;
file_id = H5Fopen("file.h5", H5ACC_RDWR, H5P_DEFAULT);
mdc_config.version = H5AC__CURR_CACHE_CONFIG_VERSION;
H5Pget_mdc_config(file_id, &mdc_config)
mdc_config.evictions_enabled = 0 /* FALSE */;
mdc_config.incr_mode = H5C_incr__off;
mdc_config.decr_mode = H5C_decr__off;
H5Pset_mdc_config(file_id, &mdc_config);

I couldn't find fortran bindings for these. Do they exist in any recent releases or svn branches.

Fortran wrappers for do not exist. Please let us know which Fortran wrappers do you need and we will add them to our to-do list.

Elena

thanks

JB

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Mark_Howison · April 5, 2011, 3:19pm

Hi John,

What platform and parallel file system is this on? Have you tried
using the MPI-POSIX VFD for independent access?

Thanks,
Mark

···

On Tue, Apr 5, 2011 at 11:16 AM, Biddiscombe, John A. <biddisco@cscs.ch> wrote:

Elena,

I was just replying to myself when your email came in.
I knocked up a quick wrapper to enable testing of the cache eviction stuff so I'm happy (ish).

However : I'm seeing puzzling behaviour with chunking.
Using collective IO, (tweaking various params) I see transfer rates from 1GB/s to 3GB/s depending on stripe size, number of cb_nodes, etc.

However, when using chunking with independent IO, I set stripe size to 6MB, chunk dims to match the 6MB, each node is writing 6MB, alignment is set to 6MB intervals and I've followed all the tips I can find. I see (for 512 nodes writing 6MB each = 4GB total) a max throughput of around 150MB/s.

This is shockingly slow compared to collective IO and I'm quite surprised. I've been playing with this for a few days now and my general impression is that
chunking = rubbish
collective = nice

I did not expect chunking to be so bad compared to collective (which is a shame as I was hoping to use it for compression etc).

Can anyone suggest further tweaks that I should be looking out for to change. (one thing for example that seems to make no difference is the H5Pset_istore_k(fcpl, btree_ik); stuff. I still don't quite know what the correct value for btree_ik is. Ive read the man page, but I'm puzzled as to the correct meaning. if I know there will be 512 chunks, what is the 'right' value of btree_ik?

Any clues gratefully received for optimizing chunking. I hoped the thread about 30,000 processes would carry on as I found it interesting to follow.

ttfn

JB

-----Original Message-----
From: hdf-forum-bounces@hdfgroup.org [mailto:hdf-forum-bounces@hdfgroup.org] On Behalf Of Elena Pourmal
Sent: 05 April 2011 17:01
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] Question re: Howison et al Lustre mdc_config tuning recommendations

Hi John,
On Apr 5, 2011, at 4:21 AM, Biddiscombe, John A. wrote:

H5AC_cache_config_t mdc_config;
hid_t file_id;
file_id = H5Fopen("file.h5", H5ACC_RDWR, H5P_DEFAULT);
mdc_config.version = H5AC__CURR_CACHE_CONFIG_VERSION;
H5Pget_mdc_config(file_id, &mdc_config)
mdc_config.evictions_enabled = 0 /* FALSE */;
mdc_config.incr_mode = H5C_incr__off;
mdc_config.decr_mode = H5C_decr__off;
H5Pset_mdc_config(file_id, &mdc_config);

I couldn't find fortran bindings for these. Do they exist in any recent releases or svn branches.

Fortran wrappers for do not exist. Please let us know which Fortran wrappers do you need and we will add them to our to-do list.

Elena

thanks

JB

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

John_Biddiscombe · April 5, 2011, 3:33pm

Mark

Newish Cray XE6, small machine with 4224 cores and 4 OSTs 28 OSS's (I think! We have two lustre filesystems mounted and I'm playing on the new one, so I may be wrong). It's Magny cours 24 cores per node and the code is not openMP enabled, so there are 24 mpi tasks on each node - which I'm sure is a problem as they're all essentially trying to write at once (no surprise that collective io helps there).

I have not tried the MPI_POSIX VFD, but I will look into it right away.

Thanks - I'll report back with further findings

JB

···

-----Original Message-----
From: hdf-forum-bounces@hdfgroup.org [mailto:hdf-forum-bounces@hdfgroup.org] On Behalf Of Mark Howison
Sent: 05 April 2011 17:19
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] Chunking : was Howison et al Lustre mdc_config

Hi John,

What platform and parallel file system is this on? Have you tried
using the MPI-POSIX VFD for independent access?

Thanks,
Mark

On Tue, Apr 5, 2011 at 11:16 AM, Biddiscombe, John A. <biddisco@cscs.ch> wrote:

Elena,

I was just replying to myself when your email came in.
I knocked up a quick wrapper to enable testing of the cache eviction stuff so I'm happy (ish).

However : I'm seeing puzzling behaviour with chunking.
Using collective IO, (tweaking various params) I see transfer rates from 1GB/s to 3GB/s depending on stripe size, number of cb_nodes, etc.

However, when using chunking with independent IO, I set stripe size to 6MB, chunk dims to match the 6MB, each node is writing 6MB, alignment is set to 6MB intervals and I've followed all the tips I can find. I see (for 512 nodes writing 6MB each = 4GB total) a max throughput of around 150MB/s.

This is shockingly slow compared to collective IO and I'm quite surprised. I've been playing with this for a few days now and my general impression is that
chunking = rubbish
collective = nice

I did not expect chunking to be so bad compared to collective (which is a shame as I was hoping to use it for compression etc).

Can anyone suggest further tweaks that I should be looking out for to change. (one thing for example that seems to make no difference is the H5Pset_istore_k(fcpl, btree_ik); stuff. I still don't quite know what the correct value for btree_ik is. Ive read the man page, but I'm puzzled as to the correct meaning. if I know there will be 512 chunks, what is the 'right' value of btree_ik?

Any clues gratefully received for optimizing chunking. I hoped the thread about 30,000 processes would carry on as I found it interesting to follow.

ttfn

JB

-----Original Message-----
From: hdf-forum-bounces@hdfgroup.org [mailto:hdf-forum-bounces@hdfgroup.org] On Behalf Of Elena Pourmal
Sent: 05 April 2011 17:01
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] Question re: Howison et al Lustre mdc_config tuning recommendations

Hi John,
On Apr 5, 2011, at 4:21 AM, Biddiscombe, John A. wrote:

H5AC_cache_config_t mdc_config;
hid_t file_id;
file_id = H5Fopen("file.h5", H5ACC_RDWR, H5P_DEFAULT);
mdc_config.version = H5AC__CURR_CACHE_CONFIG_VERSION;
H5Pget_mdc_config(file_id, &mdc_config)
mdc_config.evictions_enabled = 0 /* FALSE */;
mdc_config.incr_mode = H5C_incr__off;
mdc_config.decr_mode = H5C_decr__off;
H5Pset_mdc_config(file_id, &mdc_config);

I couldn't find fortran bindings for these. Do they exist in any recent releases or svn branches.

Fortran wrappers for do not exist. Please let us know which Fortran wrappers do you need and we will add them to our to-do list.

Elena

thanks

JB

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

John_Biddiscombe · April 5, 2011, 7:42pm

Have you tried using the MPI-POSIX VFD for independent access?

Thanks - I'll report back with further findings

Rubbish! I still only get 150MB/s with the mpiposix driver.

As Queen Victoria would have said "we are not amused"

I suspect I've got an error somewhere because something should have changed.

JB

···

-----Original Message-----
From: hdf-forum-bounces@hdfgroup.org [mailto:hdf-forum-bounces@hdfgroup.org] On Behalf Of Mark Howison
Sent: 05 April 2011 17:19
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] Chunking : was Howison et al Lustre mdc_config

Hi John,

What platform and parallel file system is this on? Have you tried
using the MPI-POSIX VFD for independent access?

Thanks,
Mark

On Tue, Apr 5, 2011 at 11:16 AM, Biddiscombe, John A. <biddisco@cscs.ch> wrote:

Elena,

I was just replying to myself when your email came in.
I knocked up a quick wrapper to enable testing of the cache eviction stuff so I'm happy (ish).

However : I'm seeing puzzling behaviour with chunking.
Using collective IO, (tweaking various params) I see transfer rates from 1GB/s to 3GB/s depending on stripe size, number of cb_nodes, etc.

However, when using chunking with independent IO, I set stripe size to 6MB, chunk dims to match the 6MB, each node is writing 6MB, alignment is set to 6MB intervals and I've followed all the tips I can find. I see (for 512 nodes writing 6MB each = 4GB total) a max throughput of around 150MB/s.

This is shockingly slow compared to collective IO and I'm quite surprised. I've been playing with this for a few days now and my general impression is that
chunking = rubbish
collective = nice

I did not expect chunking to be so bad compared to collective (which is a shame as I was hoping to use it for compression etc).

Can anyone suggest further tweaks that I should be looking out for to change. (one thing for example that seems to make no difference is the H5Pset_istore_k(fcpl, btree_ik); stuff. I still don't quite know what the correct value for btree_ik is. Ive read the man page, but I'm puzzled as to the correct meaning. if I know there will be 512 chunks, what is the 'right' value of btree_ik?

Any clues gratefully received for optimizing chunking. I hoped the thread about 30,000 processes would carry on as I found it interesting to follow.

ttfn

JB

-----Original Message-----
From: hdf-forum-bounces@hdfgroup.org [mailto:hdf-forum-bounces@hdfgroup.org] On Behalf Of Elena Pourmal
Sent: 05 April 2011 17:01
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] Question re: Howison et al Lustre mdc_config tuning recommendations

Hi John,
On Apr 5, 2011, at 4:21 AM, Biddiscombe, John A. wrote:

H5AC_cache_config_t mdc_config;
hid_t file_id;
file_id = H5Fopen("file.h5", H5ACC_RDWR, H5P_DEFAULT);
mdc_config.version = H5AC__CURR_CACHE_CONFIG_VERSION;
H5Pget_mdc_config(file_id, &mdc_config)
mdc_config.evictions_enabled = 0 /* FALSE */;
mdc_config.incr_mode = H5C_incr__off;
mdc_config.decr_mode = H5C_decr__off;
H5Pset_mdc_config(file_id, &mdc_config);

I couldn't find fortran bindings for these. Do they exist in any recent releases or svn branches.

Fortran wrappers for do not exist. Please let us know which Fortran wrappers do you need and we will add them to our to-do list.

Elena

thanks

JB

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Leigh_Orf · April 11, 2011, 8:32pm

If you look at some of my recent posts on this list, you'll find that
I am having the same problem with collective I/O with Lustre, trying
to have 30,000 ranks write to one file using collective pHDF5 I/O
(with or without chunking, I still get bad performance).

In fact, I have given up pursing this approach and am now trying the
core driver with serial HDF5, which lets you do buffered I/O. For my
problem, I am able to buffer over 50 writes to memory before data
needs to be flushed to disk. However, you are stuck with
1-file-per-core with this method, and each file will contain multiple
time levels - but I can happily deal with this if I/O is respectable.

I have not yet benchmarked performance with this buffered I/O approach
on kraken (100,000 core machine with Lustre) but I will soon. At least
with serial hdf5 I don't have to worry about exactly what's going on
at the MPI layer which makes it difficult to debug.

I will be be closely monitoring this thread in case you are able to
find a solution, as I am still interested in getting collective pHDF5
to work with many cores on Lustre.

Leigh

···

On Tue, Apr 5, 2011 at 1:42 PM, Biddiscombe, John A. <biddisco@cscs.ch> wrote:

Have you tried using the MPI-POSIX VFD for independent access?

Thanks - I'll report back with further findings

Rubbish! I still only get 150MB/s with the mpiposix driver.

As Queen Victoria would have said "we are not amused"

I suspect I've got an error somewhere because something should have changed.

JB

-----Original Message-----
From: hdf-forum-bounces@hdfgroup.org [mailto:hdf-forum-bounces@hdfgroup.org] On Behalf Of Mark Howison
Sent: 05 April 2011 17:19
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] Chunking : was Howison et al Lustre mdc_config

Hi John,

What platform and parallel file system is this on? Have you tried
using the MPI-POSIX VFD for independent access?

Thanks,
Mark

On Tue, Apr 5, 2011 at 11:16 AM, Biddiscombe, John A. <biddisco@cscs.ch> wrote:

Elena,

I was just replying to myself when your email came in.
I knocked up a quick wrapper to enable testing of the cache eviction stuff so I'm happy (ish).

However : I'm seeing puzzling behaviour with chunking.
Using collective IO, (tweaking various params) I see transfer rates from 1GB/s to 3GB/s depending on stripe size, number of cb_nodes, etc.

However, when using chunking with independent IO, I set stripe size to 6MB, chunk dims to match the 6MB, each node is writing 6MB, alignment is set to 6MB intervals and I've followed all the tips I can find. I see (for 512 nodes writing 6MB each = 4GB total) a max throughput of around 150MB/s.

This is shockingly slow compared to collective IO and I'm quite surprised. I've been playing with this for a few days now and my general impression is that
chunking = rubbish
collective = nice

I did not expect chunking to be so bad compared to collective (which is a shame as I was hoping to use it for compression etc).

Can anyone suggest further tweaks that I should be looking out for to change. (one thing for example that seems to make no difference is the H5Pset_istore_k(fcpl, btree_ik); stuff. I still don't quite know what the correct value for btree_ik is. Ive read the man page, but I'm puzzled as to the correct meaning. if I know there will be 512 chunks, what is the 'right' value of btree_ik?

Any clues gratefully received for optimizing chunking. I hoped the thread about 30,000 processes would carry on as I found it interesting to follow.

ttfn

JB

-----Original Message-----
From: hdf-forum-bounces@hdfgroup.org [mailto:hdf-forum-bounces@hdfgroup.org] On Behalf Of Elena Pourmal
Sent: 05 April 2011 17:01
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] Question re: Howison et al Lustre mdc_config tuning recommendations

Hi John,
On Apr 5, 2011, at 4:21 AM, Biddiscombe, John A. wrote:

H5AC_cache_config_t mdc_config;
hid_t file_id;
file_id = H5Fopen("file.h5", H5ACC_RDWR, H5P_DEFAULT);
mdc_config.version = H5AC__CURR_CACHE_CONFIG_VERSION;
H5Pget_mdc_config(file_id, &mdc_config)
mdc_config.evictions_enabled = 0 /* FALSE */;
mdc_config.incr_mode = H5C_incr__off;
mdc_config.decr_mode = H5C_decr__off;
H5Pset_mdc_config(file_id, &mdc_config);

I couldn't find fortran bindings for these. Do they exist in any recent releases or svn branches.

Fortran wrappers for do not exist. Please let us know which Fortran wrappers do you need and we will add them to our to-do list.

Elena

thanks

JB

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

--
Leigh Orf
Associate Professor of Atmospheric Science
Department of Geology and Meteorology
Central Michigan University
Currently on sabbatical at the National Center for Atmospheric
Research in Boulder, CO
NCAR office phone: (303) 497-8200

Mark_Howison · April 13, 2011, 2:38am

Hi John,

Do you know which version of MPT you are running? NERSC received an
updated version a few months back that fixed some problems with kernel
I/O buffering for >4 threads on the XE6. You may want to double check
that you have that update. 150MB/s is pretty bad. How many OSTs do you
have and are you striping over all of them?

Mark

···

On Tue, Apr 5, 2011 at 3:42 PM, Biddiscombe, John A. <biddisco@cscs.ch> wrote:

Have you tried using the MPI-POSIX VFD for independent access?

Thanks - I'll report back with further findings

Rubbish! I still only get 150MB/s with the mpiposix driver.

As Queen Victoria would have said "we are not amused"

I suspect I've got an error somewhere because something should have changed.

JB

-----Original Message-----
From: hdf-forum-bounces@hdfgroup.org [mailto:hdf-forum-bounces@hdfgroup.org] On Behalf Of Mark Howison
Sent: 05 April 2011 17:19
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] Chunking : was Howison et al Lustre mdc_config

Hi John,

What platform and parallel file system is this on? Have you tried
using the MPI-POSIX VFD for independent access?

Thanks,
Mark

On Tue, Apr 5, 2011 at 11:16 AM, Biddiscombe, John A. <biddisco@cscs.ch> wrote:

Elena,

I was just replying to myself when your email came in.
I knocked up a quick wrapper to enable testing of the cache eviction stuff so I'm happy (ish).

However : I'm seeing puzzling behaviour with chunking.
Using collective IO, (tweaking various params) I see transfer rates from 1GB/s to 3GB/s depending on stripe size, number of cb_nodes, etc.

However, when using chunking with independent IO, I set stripe size to 6MB, chunk dims to match the 6MB, each node is writing 6MB, alignment is set to 6MB intervals and I've followed all the tips I can find. I see (for 512 nodes writing 6MB each = 4GB total) a max throughput of around 150MB/s.

This is shockingly slow compared to collective IO and I'm quite surprised. I've been playing with this for a few days now and my general impression is that
chunking = rubbish
collective = nice

I did not expect chunking to be so bad compared to collective (which is a shame as I was hoping to use it for compression etc).

Can anyone suggest further tweaks that I should be looking out for to change. (one thing for example that seems to make no difference is the H5Pset_istore_k(fcpl, btree_ik); stuff. I still don't quite know what the correct value for btree_ik is. Ive read the man page, but I'm puzzled as to the correct meaning. if I know there will be 512 chunks, what is the 'right' value of btree_ik?

Any clues gratefully received for optimizing chunking. I hoped the thread about 30,000 processes would carry on as I found it interesting to follow.

ttfn

JB

-----Original Message-----
From: hdf-forum-bounces@hdfgroup.org [mailto:hdf-forum-bounces@hdfgroup.org] On Behalf Of Elena Pourmal
Sent: 05 April 2011 17:01
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] Question re: Howison et al Lustre mdc_config tuning recommendations

Hi John,
On Apr 5, 2011, at 4:21 AM, Biddiscombe, John A. wrote:

H5AC_cache_config_t mdc_config;
hid_t file_id;
file_id = H5Fopen("file.h5", H5ACC_RDWR, H5P_DEFAULT);
mdc_config.version = H5AC__CURR_CACHE_CONFIG_VERSION;
H5Pget_mdc_config(file_id, &mdc_config)
mdc_config.evictions_enabled = 0 /* FALSE */;
mdc_config.incr_mode = H5C_incr__off;
mdc_config.decr_mode = H5C_decr__off;
H5Pset_mdc_config(file_id, &mdc_config);

I couldn't find fortran bindings for these. Do they exist in any recent releases or svn branches.

Fortran wrappers for do not exist. Please let us know which Fortran wrappers do you need and we will add them to our to-do list.

Elena

thanks

JB

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

miller86 · April 11, 2011, 9:35pm

If you look at some of my recent posts on this list, you'll find that
I am having the same problem with collective I/O with Lustre, trying
to have 30,000 ranks write to one file using collective pHDF5 I/O
(with or without chunking, I still get bad performance).

In fact, I have given up pursing this approach and am now trying the
core driver with serial HDF5, which lets you do buffered I/O.

Hopefully, you have enough memory to write whole file to core in all use
cases you are interested in.

For my
problem, I am able to buffer over 50 writes to memory before data
needs to be flushed to disk.

None of the serial VFD's currently do anything 'special' in the way of
buffering I/O requests from HDF5 to the underlying filesystem. The stdio
VFD may offer more as it will rely upon whatever buffering the
implementation of stdio on top of your filesystem does.

However, you are stuck with
1-file-per-core with this method, and each file will contain multiple
time levels - but I can happily deal with this if I/O is respectable.

There is no reason you have to write a file per mpi-task this way.
Certainly, its simplest to do. But, its almost as simple to collect data
from different MPI tasks into common files.

I've attached a header file for a simple interface (pmpio.h) that allows
you to run on say 100,000 mpi tasks but write to say just 128 files or
any number you pick at run time. I've attached an example of how the
simple pmpio interface is used to do it.

I have not yet benchmarked performance with this buffered I/O approach
on kraken (100,000 core machine with Lustre) but I will soon. At least
with serial hdf5 I don't have to worry about exactly what's going on
at the MPI layer which makes it difficult to debug.

And you can easily do compression, among other things.

pmpio.h (22.2 KB)

pmpio_hdf5_test.c (7.88 KB)

···

On Mon, 2011-04-11 at 13:32, Leigh Orf wrote:

I will be be closely monitoring this thread in case you are able to
find a solution, as I am still interested in getting collective pHDF5
to work with many cores on Lustre.

Leigh

On Tue, Apr 5, 2011 at 1:42 PM, Biddiscombe, John A. <biddisco@cscs.ch> wrote:
>>> Have you tried using the MPI-POSIX VFD for independent access?
>
>>Thanks - I'll report back with further findings
>
> Rubbish! I still only get 150MB/s with the mpiposix driver.
>
> As Queen Victoria would have said "we are not amused"
>
> I suspect I've got an error somewhere because something should have changed.
>
> JB
>
>
> -----Original Message-----
> From: hdf-forum-bounces@hdfgroup.org [mailto:hdf-forum-bounces@hdfgroup.org] On Behalf Of Mark Howison
> Sent: 05 April 2011 17:19
> To: HDF Users Discussion List
> Subject: Re: [Hdf-forum] Chunking : was Howison et al Lustre mdc_config
>
> Hi John,
>
> What platform and parallel file system is this on? Have you tried
> using the MPI-POSIX VFD for independent access?
>
> Thanks,
> Mark
>
> On Tue, Apr 5, 2011 at 11:16 AM, Biddiscombe, John A. <biddisco@cscs.ch> wrote:
>> Elena,
>>
>> I was just replying to myself when your email came in.
>> I knocked up a quick wrapper to enable testing of the cache eviction stuff so I'm happy (ish).
>>
>> However : I'm seeing puzzling behaviour with chunking.
>> Using collective IO, (tweaking various params) I see transfer rates from 1GB/s to 3GB/s depending on stripe size, number of cb_nodes, etc.
>>
>> However, when using chunking with independent IO, I set stripe size to 6MB, chunk dims to match the 6MB, each node is writing 6MB, alignment is set to 6MB intervals and I've followed all the tips I can find. I see (for 512 nodes writing 6MB each = 4GB total) a max throughput of around 150MB/s.
>>
>> This is shockingly slow compared to collective IO and I'm quite surprised. I've been playing with this for a few days now and my general impression is that
>> chunking = rubbish
>> collective = nice
>>
>> I did not expect chunking to be so bad compared to collective (which is a shame as I was hoping to use it for compression etc).
>>
>> Can anyone suggest further tweaks that I should be looking out for to change. (one thing for example that seems to make no difference is the H5Pset_istore_k(fcpl, btree_ik); stuff. I still don't quite know what the correct value for btree_ik is. Ive read the man page, but I'm puzzled as to the correct meaning. if I know there will be 512 chunks, what is the 'right' value of btree_ik?
>>
>> Any clues gratefully received for optimizing chunking. I hoped the thread about 30,000 processes would carry on as I found it interesting to follow.
>>
>> ttfn
>>
>> JB
>>
>>
>>
>> -----Original Message-----
>> From: hdf-forum-bounces@hdfgroup.org [mailto:hdf-forum-bounces@hdfgroup.org] On Behalf Of Elena Pourmal
>> Sent: 05 April 2011 17:01
>> To: HDF Users Discussion List
>> Subject: Re: [Hdf-forum] Question re: Howison et al Lustre mdc_config tuning recommendations
>>
>> Hi John,
>> On Apr 5, 2011, at 4:21 AM, Biddiscombe, John A. wrote:
>>
>>>>>> H5AC_cache_config_t mdc_config;
>>>>>> hid_t file_id;
>>>>>> file_id = H5Fopen("file.h5", H5ACC_RDWR, H5P_DEFAULT);
>>>>>> mdc_config.version = H5AC__CURR_CACHE_CONFIG_VERSION;
>>>>>> H5Pget_mdc_config(file_id, &mdc_config)
>>>>>> mdc_config.evictions_enabled = 0 /* FALSE */;
>>>>>> mdc_config.incr_mode = H5C_incr__off;
>>>>>> mdc_config.decr_mode = H5C_decr__off;
>>>>>> H5Pset_mdc_config(file_id, &mdc_config);
>>>
>>> I couldn't find fortran bindings for these. Do they exist in any recent releases or svn branches.
>>>
>> Fortran wrappers for do not exist. Please let us know which Fortran wrappers do you need and we will add them to our to-do list.
>>
>> Elena
>>> thanks
>>>
>>> JB
>>>
>>> _______________________________________________
>>> Hdf-forum is for HDF software users discussion.
>>> Hdf-forum@hdfgroup.org
>>> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
>>
>>
>> _______________________________________________
>> Hdf-forum is for HDF software users discussion.
>> Hdf-forum@hdfgroup.org
>> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
>>
>> _______________________________________________
>> Hdf-forum is for HDF software users discussion.
>> Hdf-forum@hdfgroup.org
>> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
>>
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> Hdf-forum@hdfgroup.org
> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> Hdf-forum@hdfgroup.org
> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> Hdf-forum@hdfgroup.org
> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
>

--
Mark C. Miller, Lawrence Livermore National Laboratory
================!!LLNL BUSINESS ONLY!!================
miller86@llnl.gov urgent: miller86@pager.llnl.gov
T:8-6 (925)-423-5901 M/W/Th:7-12,2-7 (530)-753-8511

Leigh_Orf · April 12, 2011, 5:39pm

Mark,

> If you look at some of my recent posts on this list, you'll find that
> I am having the same problem with collective I/O with Lustre, trying
> to have 30,000 ranks write to one file using collective pHDF5 I/O
> (with or without chunking, I still get bad performance).
>
> In fact, I have given up pursing this approach and am now trying the
> core driver with serial HDF5, which lets you do buffered I/O.

Hopefully, you have enough memory to write whole file to core in all use
cases you are interested in.

> For my
> problem, I am able to buffer over 50 writes to memory before data
> needs to be flushed to disk.

None of the serial VFD's currently do anything 'special' in the way of
buffering I/O requests from HDF5 to the underlying filesystem. The stdio
VFD may offer more as it will rely upon whatever buffering the
implementation of stdio on top of your filesystem does.

Understand that I just discovered the ability to do buffered I/O with hdf5.
I wasn't aware of the core serial driver until Friday!

My type of problem is characterized by a very small memory footprint per
core for the simulation code, and frequent writes of the model state to
disk. By simply using the core driver I can reduce the frequency of hitting
the disk by a factor of 50-100, which is huge and at least with
kraken/Lustre, I have found the less you do I/O the better.

However, you are stuck with
1-file-per-core with this method, and each file will contain multiple
time levels - but I can happily deal with this if I/O is respectable.

There is no reason you have to write a file per mpi-task this way.

Certainly, its simplest to do. But, its almost as simple to collect data
from different MPI tasks into common files.

I've attached a header file for a simple interface (pmpio.h) that allows
you to run on say 100,000 mpi tasks but write to say just 128 files or
any number you pick at run time. I've attached an example of how the
simple pmpio interface is used to do it.

I am going to looking carefully at your code. At first glance, it appears to
be a similar approach to what I have tried but in my case I created new MPI
communicators which spanned any number of cores (but it has to divide evenly
into the full problem, unlike with your approach). In my case, each
subcommunicator would use pHDF5 collective calls to concurrently write to
its own file, and I could choose the number of files. I still had lousy
performance with all my choices of number of files.

It is not entirely clear to me that you are doing true collective parallel
HDF5 (where I have had problems but have been led to believe it is a path to
happiness) as you do not call h5pset_dxpl_mpio and set the
H5FD_MPIO_COLLECTIVE flag. You also do not construct a property list and
pass it to h5dwrite, instructing each I/O core to write its own piece of a
hdf5 file using offset arrays, h5sselect_hyperslab calls etc., which is what
the examples I have found led me to. It seems you are effectively doing
serial hdf5 in parallel, which is what I am leaning towards at this point.
Your approach is more elegant than mine but I am (a) stuck with fortran and
(b) not a programmer by training, although C is my preferred language for
I/O. Not sure if I could call your code from fortran easily without going
through contortions (again forgive me, I am a weather guy who pretends he is
a programmer).

I fully embraced parallel hdf5 because I thought it could give me all the
flexibility I needed to essentially tune the number of files I wrote, giving
me the option of anywhere from 1 to ncores. While I succeeded in attaining
that flexibility, I have had awful luck with performance with 30kcores on
kraken/Lustre. It is very possible there is a solution that I haven't found
or that I am doing something stupid but I have spent enough time on this and
tried enough things that I am ready to try something new, and the core
driver currently has me very interested. As does your approach.

I have not yet benchmarked performance with this buffered I/O approach
on kraken (100,000 core machine with Lustre) but I will soon. At least
with serial hdf5 I don't have to worry about exactly what's going on
at the MPI layer which makes it difficult to debug.

And you can easily do compression, among other things.

Indeed!

···

On Mon, Apr 11, 2011 at 3:35 PM, Mark Miller <miller86@llnl.gov> wrote:

On Mon, 2011-04-11 at 13:32, Leigh Orf wrote:

>
> I will be be closely monitoring this thread in case you are able to
> find a solution, as I am still interested in getting collective pHDF5
> to work with many cores on Lustre.
>
> Leigh
>
> On Tue, Apr 5, 2011 at 1:42 PM, Biddiscombe, John A. <biddisco@cscs.ch> > wrote:
> >>> Have you tried using the MPI-POSIX VFD for independent access?
> >
> >>Thanks - I'll report back with further findings
> >
> > Rubbish! I still only get 150MB/s with the mpiposix driver.
> >
> > As Queen Victoria would have said "we are not amused"
> >
> > I suspect I've got an error somewhere because something should have
changed.
> >
> > JB
> >
> >
> > -----Original Message-----
> > From: hdf-forum-bounces@hdfgroup.org [mailto:
hdf-forum-bounces@hdfgroup.org] On Behalf Of Mark Howison
> > Sent: 05 April 2011 17:19
> > To: HDF Users Discussion List
> > Subject: Re: [Hdf-forum] Chunking : was Howison et al Lustre mdc_config
> >
> > Hi John,
> >
> > What platform and parallel file system is this on? Have you tried
> > using the MPI-POSIX VFD for independent access?
> >
> > Thanks,
> > Mark
> >
> > On Tue, Apr 5, 2011 at 11:16 AM, Biddiscombe, John A. < > biddisco@cscs.ch> wrote:
> >> Elena,
> >>
> >> I was just replying to myself when your email came in.
> >> I knocked up a quick wrapper to enable testing of the cache eviction
stuff so I'm happy (ish).
> >>
> >> However : I'm seeing puzzling behaviour with chunking.
> >> Using collective IO, (tweaking various params) I see transfer rates
from 1GB/s to 3GB/s depending on stripe size, number of cb_nodes, etc.
> >>
> >> However, when using chunking with independent IO, I set stripe size to
6MB, chunk dims to match the 6MB, each node is writing 6MB, alignment is set
to 6MB intervals and I've followed all the tips I can find. I see (for 512
nodes writing 6MB each = 4GB total) a max throughput of around 150MB/s.
> >>
> >> This is shockingly slow compared to collective IO and I'm quite
surprised. I've been playing with this for a few days now and my general
impression is that
> >> chunking = rubbish
> >> collective = nice
> >>
> >> I did not expect chunking to be so bad compared to collective (which
is a shame as I was hoping to use it for compression etc).
> >>
> >> Can anyone suggest further tweaks that I should be looking out for to
change. (one thing for example that seems to make no difference is the
H5Pset_istore_k(fcpl, btree_ik); stuff. I still don't quite know what the
correct value for btree_ik is. Ive read the man page, but I'm puzzled as to
the correct meaning. if I know there will be 512 chunks, what is the 'right'
value of btree_ik?
> >>
> >> Any clues gratefully received for optimizing chunking. I hoped the
thread about 30,000 processes would carry on as I found it interesting to
follow.
> >>
> >> ttfn
> >>
> >> JB
> >>
> >>
> >>
> >> -----Original Message-----
> >> From: hdf-forum-bounces@hdfgroup.org [mailto:
hdf-forum-bounces@hdfgroup.org] On Behalf Of Elena Pourmal
> >> Sent: 05 April 2011 17:01
> >> To: HDF Users Discussion List
> >> Subject: Re: [Hdf-forum] Question re: Howison et al Lustre mdc_config
tuning recommendations
> >>
> >> Hi John,
> >> On Apr 5, 2011, at 4:21 AM, Biddiscombe, John A. wrote:
> >>
> >>>>>> H5AC_cache_config_t mdc_config;
> >>>>>> hid_t file_id;
> >>>>>> file_id = H5Fopen("file.h5", H5ACC_RDWR, H5P_DEFAULT);
> >>>>>> mdc_config.version = H5AC__CURR_CACHE_CONFIG_VERSION;
> >>>>>> H5Pget_mdc_config(file_id, &mdc_config)
> >>>>>> mdc_config.evictions_enabled = 0 /* FALSE */;
> >>>>>> mdc_config.incr_mode = H5C_incr__off;
> >>>>>> mdc_config.decr_mode = H5C_decr__off;
> >>>>>> H5Pset_mdc_config(file_id, &mdc_config);
> >>>
> >>> I couldn't find fortran bindings for these. Do they exist in any
recent releases or svn branches.
> >>>
> >> Fortran wrappers for do not exist. Please let us know which Fortran
wrappers do you need and we will add them to our to-do list.
> >>
> >> Elena
> >>> thanks
> >>>
> >>> JB
> >>>

--
Leigh Orf
Associate Professor of Atmospheric Science
Department of Geology and Meteorology
Central Michigan University
Currently on sabbatical at the National Center for Atmospheric Research
in Boulder, CO
NCAR office phone: (303) 497-8200

John_Biddiscombe · April 13, 2011, 1:12pm

Mark

I'm away at the moment, so I've been scanning this thread, but will read it all more thoroughly and answer properly on my return in a few days. From memory we have mpt 5.2 (?)

I was trying to stripe over all OSTs (only 4 on the small machine I believe), match chunk sizes to stripe sizes and otherwise align everything in such a way that all is clean. Might try using the split driver to separate the metadata in case that is somehow interfering....

JB

···

-----Original Message-----
From: Mark Howison [mailto:mark.howison@gmail.com]
Sent: 13 April 2011 04:38
To: HDF Users Discussion List
Cc: Biddiscombe, John A.
Subject: Re: [Hdf-forum] Chunking : was Howison et al Lustre mdc_config

Hi John,

Do you know which version of MPT you are running? NERSC received an
updated version a few months back that fixed some problems with kernel
I/O buffering for >4 threads on the XE6. You may want to double check
that you have that update. 150MB/s is pretty bad. How many OSTs do you
have and are you striping over all of them?

Mark

On Tue, Apr 5, 2011 at 3:42 PM, Biddiscombe, John A. <biddisco@cscs.ch> wrote:

Have you tried using the MPI-POSIX VFD for independent access?

Thanks - I'll report back with further findings

Rubbish! I still only get 150MB/s with the mpiposix driver.

As Queen Victoria would have said "we are not amused"

I suspect I've got an error somewhere because something should have changed.

JB

-----Original Message-----
From: hdf-forum-bounces@hdfgroup.org [mailto:hdf-forum-bounces@hdfgroup.org] On Behalf Of Mark Howison
Sent: 05 April 2011 17:19
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] Chunking : was Howison et al Lustre mdc_config

Hi John,

What platform and parallel file system is this on? Have you tried
using the MPI-POSIX VFD for independent access?

Thanks,
Mark

On Tue, Apr 5, 2011 at 11:16 AM, Biddiscombe, John A. <biddisco@cscs.ch> wrote:

Elena,

I was just replying to myself when your email came in.
I knocked up a quick wrapper to enable testing of the cache eviction stuff so I'm happy (ish).

However : I'm seeing puzzling behaviour with chunking.
Using collective IO, (tweaking various params) I see transfer rates from 1GB/s to 3GB/s depending on stripe size, number of cb_nodes, etc.

However, when using chunking with independent IO, I set stripe size to 6MB, chunk dims to match the 6MB, each node is writing 6MB, alignment is set to 6MB intervals and I've followed all the tips I can find. I see (for 512 nodes writing 6MB each = 4GB total) a max throughput of around 150MB/s.

This is shockingly slow compared to collective IO and I'm quite surprised. I've been playing with this for a few days now and my general impression is that
chunking = rubbish
collective = nice

I did not expect chunking to be so bad compared to collective (which is a shame as I was hoping to use it for compression etc).

Can anyone suggest further tweaks that I should be looking out for to change. (one thing for example that seems to make no difference is the H5Pset_istore_k(fcpl, btree_ik); stuff. I still don't quite know what the correct value for btree_ik is. Ive read the man page, but I'm puzzled as to the correct meaning. if I know there will be 512 chunks, what is the 'right' value of btree_ik?

Any clues gratefully received for optimizing chunking. I hoped the thread about 30,000 processes would carry on as I found it interesting to follow.

ttfn

JB

-----Original Message-----
From: hdf-forum-bounces@hdfgroup.org [mailto:hdf-forum-bounces@hdfgroup.org] On Behalf Of Elena Pourmal
Sent: 05 April 2011 17:01
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] Question re: Howison et al Lustre mdc_config tuning recommendations

Hi John,
On Apr 5, 2011, at 4:21 AM, Biddiscombe, John A. wrote:

H5AC_cache_config_t mdc_config;
hid_t file_id;
file_id = H5Fopen("file.h5", H5ACC_RDWR, H5P_DEFAULT);
mdc_config.version = H5AC__CURR_CACHE_CONFIG_VERSION;
H5Pget_mdc_config(file_id, &mdc_config)
mdc_config.evictions_enabled = 0 /* FALSE */;
mdc_config.incr_mode = H5C_incr__off;
mdc_config.decr_mode = H5C_decr__off;
H5Pset_mdc_config(file_id, &mdc_config);

I couldn't find fortran bindings for these. Do they exist in any recent releases or svn branches.

Fortran wrappers for do not exist. Please let us know which Fortran wrappers do you need and we will add them to our to-do list.

Elena

thanks

JB

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

miller86 · April 12, 2011, 5:57pm

Understand that I just discovered the ability to do buffered I/O with
hdf5. I wasn't aware of the core serial driver until Friday!

Yeah, there are a lot of interesting dark corners of HDF5 library that
are useful to know about. Core driver is definitely one of them. That
has saved my behind a few times when we've been in a bind on
performance.

I am going to looking carefully at your code. At first glance, it
appears to be a similar approach to what I have tried but in my case I
created new MPI communicators which spanned any number of cores (but
it has to divide evenly into the full problem, unlike with your
approach). In my case, each subcommunicator would use pHDF5 collective
calls to concurrently write to its own file, and I could choose the
number of files. I still had lousy performance with all my choices of
number of files.

It is not entirely clear to me that you are doing true collective
parallel HDF5 (where I have had problems but have been led to believe

Thats right. There is NOTHING I/O-wise that is parallel. That code is
designed to work with SERIAL compiled HDF5. The only parallel parts are
the file management to orchestrate parallel I/O to multiple files
concurrently. It is the 'Poor Mans' approach to parallel I/O. It is
described in the pmpio.h header file a bit and more here...

http://visitbugs.ornl.gov/projects/hpc-hdf5/wiki/Poor_Man’s_vs_Rich_Mans’_Parallel_IO

it is a path to happiness) as you do not call h5pset_dxpl_mpio and
set the H5FD_MPIO_COLLECTIVE flag. You also do not construct a
property list and pass it to h5dwrite, instructing each I/O core to
write its own piece of a hdf5 file using offset arrays,
h5sselect_hyperslab calls etc., which is what the examples I have
found led me to. It seems you are effectively doing serial hdf5 in
parallel, which is what I am leaning towards at this point. Your
approach is more elegant than mine but I am (a) stuck with fortran and
(b) not a programmer by training, although C is my preferred language
for I/O. Not sure if I could call your code from fortran easily
without going through contortions (again forgive me, I am a weather
guy who pretends he is a programmer).

I fully embraced parallel hdf5 because I thought it could give me all
the flexibility I needed to essentially tune

So, I find the all-collective-all-the-time API for parallel HDF5 to be
way too 'inflexible' to handle sophisticated I/O patterns where data
type, size and shape and existence even vary substantially from
processor-to-processor. For bread-and-butter data parallel apps where
essentially the same few data structures (distributed arrayys) are
distributed across processors, it works ok. But, none of the simulation
apps I support have that kind of a (simple) I/O pattern nor even
approximate it, especially for plot outputs.

···

On Tue, 2011-04-12 at 10:39 -0700, Leigh Orf wrote:

--
Mark C. Miller, Lawrence Livermore National Laboratory
================!!LLNL BUSINESS ONLY!!================
miller86@llnl.gov urgent: miller86@pager.llnl.gov
T:8-6 (925)-423-5901 M/W/Th:7-12,2-7 (530)-753-8511

Leigh_Orf · April 12, 2011, 7:11pm

>
>
> Understand that I just discovered the ability to do buffered I/O with
> hdf5. I wasn't aware of the core serial driver until Friday!

Yeah, there are a lot of interesting dark corners of HDF5 library that
are useful to know about. Core driver is definitely one of them. That
has saved my behind a few times when we've been in a bind on
performance.

Indeed. Write operations go pretty fast when there is no actual disk access!

> I am going to looking carefully at your code. At first glance, it
> appears to be a similar approach to what I have tried but in my case I
> created new MPI communicators which spanned any number of cores (but
> it has to divide evenly into the full problem, unlike with your
> approach). In my case, each subcommunicator would use pHDF5 collective
> calls to concurrently write to its own file, and I could choose the
> number of files. I still had lousy performance with all my choices of
> number of files.
>
> It is not entirely clear to me that you are doing true collective
> parallel HDF5 (where I have had problems but have been led to believe

Thats right. There is NOTHING I/O-wise that is parallel. That code is
designed to work with SERIAL compiled HDF5. The only parallel parts are
the file management to orchestrate parallel I/O to multiple files
concurrently. It is the 'Poor Mans' approach to parallel I/O. It is
described in the pmpio.h header file a bit and more here...

http://visitbugs.ornl.gov/projects/hpc-hdf5/wiki/Poor_Man’s_vs_Rich_Mans’_Parallel_IO

Yes, I am familiar the concept and have found that link before. It seems
we're at a point with HPC where there are still unanswered questions about
the fastest and most practical way to get data from CPU space to diskspace
in a way that also makes post-processing sufficiently simple.

I see more clearly now what you're doing. You store each core's chunk of the
data as a hdf5 group in a file, and only one core accesses a file at a time.
So each core gets its baton, opens, writes, closes, and hands off. I am
pretty sure there is very little overhead for opens and closes when a single
core is doing it (as opposed to thousands of concurrent opens, for instance)
but perhaps it will be non-negligible when multiplied by 100,000?

Your files contain multiple groups representing spatial things, my latest
attempt is multiple groups representing spatial information at a given
location. I see no reason why you couldn't have both multiple times (one
group hierarchy) and core contributions (another one) in a single file, all
designated by hdf5 groups.

Your approach is one of the few remaining I have yet to try. I'm not sure it
will give better I/O performance than one file per core, and that is my main
concern right now. I am kind of used to using actual unix directories in the
way that you are using hdf5 groups, to some degree, to reduce the number of
files in a given directory.

I notice you are a VisIt developer as well (I think we've corresponded
before). Is there a VisIt plugin using hdf5 for your PMMPI format? This
could be an additional motivator for me to try your approach, as I will be
using VisIt for visualization on huge runs soon. I've twice developed VisIt
plugins (hdf4, then 5) but only for file-per-core and my C++ skills are
somewhat lacking.

Leigh

> it is a path to happiness) as you do not call h5pset_dxpl_mpio and
> set the H5FD_MPIO_COLLECTIVE flag. You also do not construct a
> property list and pass it to h5dwrite, instructing each I/O core to
> write its own piece of a hdf5 file using offset arrays,
> h5sselect_hyperslab calls etc., which is what the examples I have
> found led me to. It seems you are effectively doing serial hdf5 in
> parallel, which is what I am leaning towards at this point. Your
> approach is more elegant than mine but I am (a) stuck with fortran and
> (b) not a programmer by training, although C is my preferred language
> for I/O. Not sure if I could call your code from fortran easily
> without going through contortions (again forgive me, I am a weather
> guy who pretends he is a programmer).
>
> I fully embraced parallel hdf5 because I thought it could give me all
> the flexibility I needed to essentially tune

So, I find the all-collective-all-the-time API for parallel HDF5 to be
way too 'inflexible' to handle sophisticated I/O patterns where data
type, size and shape and existence even vary substantially from
processor-to-processor. For bread-and-butter data parallel apps where
essentially the same few data structures (distributed arrayys) are
distributed across processors, it works ok. But, none of the simulation
apps I support have that kind of a (simple) I/O pattern nor even
approximate it, especially for plot outputs.

I think pHDF5 is very neat and useful in some situations. My own experience
is that with a modest number of cores (around 1k) performance is adequate,
but for whatever reason bumping it up another order of magnitude leads to
badness.

···

On Tue, Apr 12, 2011 at 11:57 AM, Mark Miller <miller86@llnl.gov> wrote:

On Tue, 2011-04-12 at 10:39 -0700, Leigh Orf wrote:

--
Mark C. Miller, Lawrence Livermore National Laboratory
================!!LLNL BUSINESS ONLY!!================
miller86@llnl.gov urgent: miller86@pager.llnl.gov
T:8-6 (925)-423-5901 M/W/Th:7-12,2-7 (530)-753-8511

--
Leigh Orf
Associate Professor of Atmospheric Science
Department of Geology and Meteorology
Central Michigan University
Currently on sabbatical at the National Center for Atmospheric Research
in Boulder, CO
NCAR office phone: (303) 497-8200

Mark_Howison · April 13, 2011, 2:33am

Hi Leigh and Mark,

I'm just getting caught up on this thread now and wanted to add a few comments:

1) The idea of passing off a baton to serialize access to a file is
something we tried in H5Part when we ran into a problem around 16K
concurrency where Lustre would actually *time out* while trying to
service that many independent writes to a shared file. This worked
pretty well. We used a token-passing mechanism to write in batches of
2000, but the PMPIO idea takes this even further to where you are
executing the writes in batches equal to the number of files.

2) On a Lustre file system with N OSTs, you could conceivably get the
full bandwidth of the file system by writing into N separate files.

From Lustre's perspective, this looks like file-per-proc access, but

in the end you have many fewer files, which is of course much better
from a data management and post-analysis stand point.

3) I also agree that collective buffering is mostly useful for
rectilinear grids, although I'll point out that it is also good for
variable-length 1D arrays, like the flattened 1D AMR meshes from the
CHOMBO benchmark that we tested in the HDF5/Lustre paper. These
actually performed pretty well in collective mode. The trade-off with
collective mode on the Cray is that it is nearly impossible to get the
full bandwidth of the file system because of the synchronization of
the aggregation phase (MPI_Gather). One optimization is to do
something more sophisticated, where the aggregators are broken into
two subsets that overlap gathering and writing. Of course, however it
is implemented, collective buffering does impose some degree of
synchronization that you can avoid with independent access or PMPIO.

Mark

···

On Tue, Apr 12, 2011 at 3:11 PM, Leigh Orf <leigh.orf@gmail.com> wrote:

On Tue, Apr 12, 2011 at 11:57 AM, Mark Miller <miller86@llnl.gov> wrote:

On Tue, 2011-04-12 at 10:39 -0700, Leigh Orf wrote:

>
>
> Understand that I just discovered the ability to do buffered I/O with
> hdf5. I wasn't aware of the core serial driver until Friday!

Yeah, there are a lot of interesting dark corners of HDF5 library that
are useful to know about. Core driver is definitely one of them. That
has saved my behind a few times when we've been in a bind on
performance.

Indeed. Write operations go pretty fast when there is no actual disk access!

> I am going to looking carefully at your code. At first glance, it
> appears to be a similar approach to what I have tried but in my case I
> created new MPI communicators which spanned any number of cores (but
> it has to divide evenly into the full problem, unlike with your
> approach). In my case, each subcommunicator would use pHDF5 collective
> calls to concurrently write to its own file, and I could choose the
> number of files. I still had lousy performance with all my choices of
> number of files.
>
> It is not entirely clear to me that you are doing true collective
> parallel HDF5 (where I have had problems but have been led to believe

Thats right. There is NOTHING I/O-wise that is parallel. That code is
designed to work with SERIAL compiled HDF5. The only parallel parts are
the file management to orchestrate parallel I/O to multiple files
concurrently. It is the 'Poor Mans' approach to parallel I/O. It is
described in the pmpio.h header file a bit and more here...

http://visitbugs.ornl.gov/projects/hpc-hdf5/wiki/Poor_Man’s_vs_Rich_Mans’_Parallel_IO

Yes, I am familiar the concept and have found that link before. It seems
we're at a point with HPC where there are still unanswered questions about
the fastest and most practical way to get data from CPU space to diskspace
in a way that also makes post-processing sufficiently simple.

I see more clearly now what you're doing. You store each core's chunk of the
data as a hdf5 group in a file, and only one core accesses a file at a time.
So each core gets its baton, opens, writes, closes, and hands off. I am
pretty sure there is very little overhead for opens and closes when a single
core is doing it (as opposed to thousands of concurrent opens, for instance)
but perhaps it will be non-negligible when multiplied by 100,000?

Your files contain multiple groups representing spatial things, my latest
attempt is multiple groups representing spatial information at a given
location. I see no reason why you couldn't have both multiple times (one
group hierarchy) and core contributions (another one) in a single file, all
designated by hdf5 groups.

Your approach is one of the few remaining I have yet to try. I'm not sure it
will give better I/O performance than one file per core, and that is my main
concern right now. I am kind of used to using actual unix directories in the
way that you are using hdf5 groups, to some degree, to reduce the number of
files in a given directory.

I notice you are a VisIt developer as well (I think we've corresponded
before). Is there a VisIt plugin using hdf5 for your PMMPI format? This
could be an additional motivator for me to try your approach, as I will be
using VisIt for visualization on huge runs soon. I've twice developed VisIt
plugins (hdf4, then 5) but only for file-per-core and my C++ skills are
somewhat lacking.

Leigh

> it is a path to happiness) as you do not call h5pset_dxpl_mpio and
> set the H5FD_MPIO_COLLECTIVE flag. You also do not construct a
> property list and pass it to h5dwrite, instructing each I/O core to
> write its own piece of a hdf5 file using offset arrays,
> h5sselect_hyperslab calls etc., which is what the examples I have
> found led me to. It seems you are effectively doing serial hdf5 in
> parallel, which is what I am leaning towards at this point. Your
> approach is more elegant than mine but I am (a) stuck with fortran and
> (b) not a programmer by training, although C is my preferred language
> for I/O. Not sure if I could call your code from fortran easily
> without going through contortions (again forgive me, I am a weather
> guy who pretends he is a programmer).
>
> I fully embraced parallel hdf5 because I thought it could give me all
> the flexibility I needed to essentially tune

So, I find the all-collective-all-the-time API for parallel HDF5 to be
way too 'inflexible' to handle sophisticated I/O patterns where data
type, size and shape and existence even vary substantially from
processor-to-processor. For bread-and-butter data parallel apps where
essentially the same few data structures (distributed arrayys) are
distributed across processors, it works ok. But, none of the simulation
apps I support have that kind of a (simple) I/O pattern nor even
approximate it, especially for plot outputs.

I think pHDF5 is very neat and useful in some situations. My own experience
is that with a modest number of cores (around 1k) performance is adequate,
but for whatever reason bumping it up another order of magnitude leads to
badness.

--
Mark C. Miller, Lawrence Livermore National Laboratory
================!!LLNL BUSINESS ONLY!!================
miller86@llnl.gov urgent: miller86@pager.llnl.gov
T:8-6 (925)-423-5901 M/W/Th:7-12,2-7 (530)-753-8511

--
Leigh Orf
Associate Professor of Atmospheric Science
Department of Geology and Meteorology
Central Michigan University
Currently on sabbatical at the National Center for Atmospheric Research
in Boulder, CO
NCAR office phone: (303) 497-8200

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

miller86 · April 13, 2011, 4:38am

I don't have a lot of data on this but on our BG/P-Lustre installation,
we've gone up to 128,000 files writing a file-per-processor without
incident. I think we've also run 64,000 cpus writing to 256 files using
baton passing a la pmpio.h without incident.

I think Leigh may have inquired about VisIt using pmpio.h in any of its
plugins; No, it doesn't. But, VisIt is more or less only reading the
files and most (not all but most) plugins in VisIt are designed around
the poor man's parallel I/O model. I don't think we have any performance
data for VisIt for rich man's or poor man's parallel I/O.

Mark

···

On Tue, 2011-04-12 at 19:33, Mark Howison wrote:

Hi Leigh and Mark,

I'm just getting caught up on this thread now and wanted to add a few comments:

1) The idea of passing off a baton to serialize access to a file is
something we tried in H5Part when we ran into a problem around 16K
concurrency where Lustre would actually *time out*

--
Mark C. Miller, Lawrence Livermore National Laboratory
================!!LLNL BUSINESS ONLY!!================
miller86@llnl.gov urgent: miller86@pager.llnl.gov
T:8-6 (925)-423-5901 M/W/Th:7-12,2-7 (530)-753-8511