HDF5 causes Fatal error in MPI_Gather

Dear all,

Recently, I've run into a problem with my parallel HDF5 writes. My
program works fine on 8k cores, but when I run it on 16k cores it
crashes when writing a data file through h5dwrite_f(...).
File writes go through one function in the code, so it always uses the
same code, but for some reason I don't understand it writes one file
without problems, but the second one throws the following error message:

Abort(1) on node 0 (rank 0 in comm 1140850688): Fatal error in
MPI_Gather: Invalid buffer pointer, error stack:
MPI_Gather(758): MPI_Gather(sbuf=0xa356f400, scount=16000, MPI_BYTE,
rbuf=(nil), rcount=16000, MPI_BYTE, root=0, comm=0x84000003) failed
MPI_Gather(675): Null buffer pointer

I've been looking through the HDF5 source code and it only seems to call
MPI_Gather in one place, in the function H5D_obtain_mpio_mode. In that
function HDF tries to allocate a receive buffer using

recv_io_mode_info = (uint8_t *)H5MM_malloc(total_chunks * mpi_size);

Which then returns the null pointer seen in rbuf=(nil) instead of a
valid pointer. Thus, to me it seems it's HDF causing the problem and not
MPI.

This problem occurs in both collective and independent IO mode.

Do you have any idea what might be causing this problem, or how to
resolve it? I'm not sure what kind of other information you might need,
but I'll do my best to supply it, if you need any.

Kind regards,

Stefan Frijters

Hi Stefan,

···

On Mar 24, 2010, at 3:11 AM, Stefan Frijters wrote:

Dear all,

Recently, I've run into a problem with my parallel HDF5 writes. My
program works fine on 8k cores, but when I run it on 16k cores it
crashes when writing a data file through h5dwrite_f(...).
File writes go through one function in the code, so it always uses the
same code, but for some reason I don't understand it writes one file
without problems, but the second one throws the following error message:

Abort(1) on node 0 (rank 0 in comm 1140850688): Fatal error in
MPI_Gather: Invalid buffer pointer, error stack:
MPI_Gather(758): MPI_Gather(sbuf=0xa356f400, scount=16000, MPI_BYTE,
rbuf=(nil), rcount=16000, MPI_BYTE, root=0, comm=0x84000003) failed
MPI_Gather(675): Null buffer pointer

I've been looking through the HDF5 source code and it only seems to call
MPI_Gather in one place, in the function H5D_obtain_mpio_mode. In that
function HDF tries to allocate a receive buffer using

recv_io_mode_info = (uint8_t *)H5MM_malloc(total_chunks * mpi_size);

Which then returns the null pointer seen in rbuf=(nil) instead of a
valid pointer. Thus, to me it seems it's HDF causing the problem and not
MPI.

This problem occurs in both collective and independent IO mode.

Do you have any idea what might be causing this problem, or how to
resolve it? I'm not sure what kind of other information you might need,
but I'll do my best to supply it, if you need any.

  This is a scalability problem we are aware of and are working to address, but in the meanwhile, can you increase the size of your chunks for your dataset(s)? (That will reduce the number of chunks and the size of the buffer being allocated)

  Quincey

Hi Quincey,

Thanks for the quick response. Currently, each core is handling its
datasets with a chunk size equal to the size of the local data (the dims
parameter in h5pset_chunk_f is equal to to dims parameter in
h5dwrite_f) because the local arrays are not that large anyway (in the
order of 20x20x20 reals), so if I understand things correctly I'm
already using maximum chunk size.

Do you have an idea why it doesn't crash the first time I try to do it
though? It's a different array, but of the same size and datatype as the
second. As far as I can see I'm closing all used handles at the end of
my function at least.

Kind regards,

Stefan Frijters

···

Hi Stefan,

On Mar 24, 2010, at 3:11 AM, Stefan Frijters wrote:

Dear all,

Recently, I've run into a problem with my parallel HDF5 writes. My
program works fine on 8k cores, but when I run it on 16k cores it
crashes when writing a data file through h5dwrite_f(...).
File writes go through one function in the code, so it always uses the
same code, but for some reason I don't understand it writes one file
without problems, but the second one throws the following error message:

Abort(1) on node 0 (rank 0 in comm 1140850688): Fatal error in
MPI_Gather: Invalid buffer pointer, error stack:
MPI_Gather(758): MPI_Gather(sbuf=0xa356f400, scount=16000, MPI_BYTE,
rbuf=(nil), rcount=16000, MPI_BYTE, root=0, comm=0x84000003) failed
MPI_Gather(675): Null buffer pointer

I've been looking through the HDF5 source code and it only seems to call
MPI_Gather in one place, in the function H5D_obtain_mpio_mode. In that
function HDF tries to allocate a receive buffer using

recv_io_mode_info = (uint8_t *)H5MM_malloc(total_chunks * mpi_size);

Which then returns the null pointer seen in rbuf=(nil) instead of a
valid pointer. Thus, to me it seems it's HDF causing the problem and not
MPI.

This problem occurs in both collective and independent IO mode.

Do you have any idea what might be causing this problem, or how to
resolve it? I'm not sure what kind of other information you might need,
but I'll do my best to supply it, if you need any.

This is a scalability problem we are aware of and are working to address,
but in the meanwhile, can you increase the size of your chunks for your
dataset(s)? (That will reduce the number of chunks and the size of the
buffer being allocated)

  Quincey

Hi Stefan,

Hi Quincey,

Thanks for the quick response. Currently, each core is handling its
datasets with a chunk size equal to the size of the local data (the dims
parameter in h5pset_chunk_f is equal to to dims parameter in
h5dwrite_f) because the local arrays are not that large anyway (in the
order of 20x20x20 reals), so if I understand things correctly I'm
already using maximum chunk size.

  No, you don't have to make them the same size, since the collective I/O should stitch them back together anyway. Try doubling the dimensions on your chunks.

Do you have an idea why it doesn't crash the first time I try to do it
though? It's a different array, but of the same size and datatype as the
second. As far as I can see I'm closing all used handles at the end of
my function at least.

  Hmm, I'm not certain...

    Quincey

···

On Mar 24, 2010, at 10:10 AM, Stefan Frijters wrote:

Kind regards,

Stefan Frijters

Hi Stefan,

On Mar 24, 2010, at 3:11 AM, Stefan Frijters wrote:

Dear all,

Recently, I've run into a problem with my parallel HDF5 writes. My
program works fine on 8k cores, but when I run it on 16k cores it
crashes when writing a data file through h5dwrite_f(...).
File writes go through one function in the code, so it always uses the
same code, but for some reason I don't understand it writes one file
without problems, but the second one throws the following error message:

Abort(1) on node 0 (rank 0 in comm 1140850688): Fatal error in
MPI_Gather: Invalid buffer pointer, error stack:
MPI_Gather(758): MPI_Gather(sbuf=0xa356f400, scount=16000, MPI_BYTE,
rbuf=(nil), rcount=16000, MPI_BYTE, root=0, comm=0x84000003) failed
MPI_Gather(675): Null buffer pointer

I've been looking through the HDF5 source code and it only seems to call
MPI_Gather in one place, in the function H5D_obtain_mpio_mode. In that
function HDF tries to allocate a receive buffer using

recv_io_mode_info = (uint8_t *)H5MM_malloc(total_chunks * mpi_size);

Which then returns the null pointer seen in rbuf=(nil) instead of a
valid pointer. Thus, to me it seems it's HDF causing the problem and not
MPI.

This problem occurs in both collective and independent IO mode.

Do you have any idea what might be causing this problem, or how to
resolve it? I'm not sure what kind of other information you might need,
but I'll do my best to supply it, if you need any.

This is a scalability problem we are aware of and are working to address,
but in the meanwhile, can you increase the size of your chunks for your
dataset(s)? (That will reduce the number of chunks and the size of the
buffer being allocated)

  Quincey

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Hi Quincey,

I can double one dimension on my chunk size (at the cost of really slow IO), but if I double them all I get errors like these:

HDF5-DIAG: Error detected in HDF5 (1.8.4) MPI-process 4:
  #000: H5D.c line 171 in H5Dcreate2(): unable to create dataset
    major: Dataset
    minor: Unable to initialize object
  #001: H5Dint.c line 428 in H5D_create_named(): unable to create and link to dataset
    major: Dataset
    minor: Unable to initialize object
  #002: H5L.c line 1639 in H5L_link_object(): unable to create new link to object
    major: Links
    minor: Unable to initialize object
  #003: H5L.c line 1862 in H5L_create_real(): can't insert link
    major: Symbol table
    minor: Unable to insert object
  #004: H5Gtraverse.c line 877 in H5G_traverse(): internal path traversal failed
    major: Symbol table
    minor: Object not found
  #005: H5Gtraverse.c line 703 in H5G_traverse_real(): traversal operator failed
    major: Symbol table
    minor: Callback failed
  #006: H5L.c line 1685 in H5L_link_cb(): unable to create object
    major: Object header
    minor: Unable to initialize object
  #007: H5O.c line 2677 in H5O_obj_create(): unable to open object
    major: Object header
    minor: Can't open object
  #008: H5Doh.c line 296 in H5O_dset_create(): unable to create dataset
    major: Dataset
    minor: Unable to initialize object
  #009: H5Dint.c line 1030 in H5D_create(): unable to construct layout information
    major: Dataset
    minor: Unable to initialize object
  #010: H5Dchunk.c line 420 in H5D_chunk_construct(): chunk size must be <= maximum dimension size for fixed-sized dimensions
    major: Dataset
    minor: Unable to initialize object

I am currently doing test runs on my local machine on 16 cores because the large machine I run jobs on is unavailable at the moment and has a queueing system rather unsuited to quick test runs, so maybe this is an artefact of running on such a small number of cores? Although I *think* I tried this before and got the same type of error on several thousand cores also.

Kind regards,

Stefan Frijters

···

________________________________________
From: hdf-forum-bounces@hdfgroup.org [hdf-forum-bounces@hdfgroup.org] On Behalf Of Quincey Koziol [koziol@hdfgroup.org]
Sent: 24 March 2010 16:28
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] HDF5 causes Fatal error in MPI_Gather

Hi Stefan,

On Mar 24, 2010, at 10:10 AM, Stefan Frijters wrote:

Hi Quincey,

Thanks for the quick response. Currently, each core is handling its
datasets with a chunk size equal to the size of the local data (the dims
parameter in h5pset_chunk_f is equal to to dims parameter in
h5dwrite_f) because the local arrays are not that large anyway (in the
order of 20x20x20 reals), so if I understand things correctly I'm
already using maximum chunk size.

        No, you don't have to make them the same size, since the collective I/O should stitch them back together anyway. Try doubling the dimensions on your chunks.

Do you have an idea why it doesn't crash the first time I try to do it
though? It's a different array, but of the same size and datatype as the
second. As far as I can see I'm closing all used handles at the end of
my function at least.

        Hmm, I'm not certain...

                Quincey

Kind regards,

Stefan Frijters

Hi Stefan,

On Mar 24, 2010, at 3:11 AM, Stefan Frijters wrote:

Dear all,

Recently, I've run into a problem with my parallel HDF5 writes. My
program works fine on 8k cores, but when I run it on 16k cores it
crashes when writing a data file through h5dwrite_f(...).
File writes go through one function in the code, so it always uses the
same code, but for some reason I don't understand it writes one file
without problems, but the second one throws the following error message:

Abort(1) on node 0 (rank 0 in comm 1140850688): Fatal error in
MPI_Gather: Invalid buffer pointer, error stack:
MPI_Gather(758): MPI_Gather(sbuf=0xa356f400, scount=16000, MPI_BYTE,
rbuf=(nil), rcount=16000, MPI_BYTE, root=0, comm=0x84000003) failed
MPI_Gather(675): Null buffer pointer

I've been looking through the HDF5 source code and it only seems to call
MPI_Gather in one place, in the function H5D_obtain_mpio_mode. In that
function HDF tries to allocate a receive buffer using

recv_io_mode_info = (uint8_t *)H5MM_malloc(total_chunks * mpi_size);

Which then returns the null pointer seen in rbuf=(nil) instead of a
valid pointer. Thus, to me it seems it's HDF causing the problem and not
MPI.

This problem occurs in both collective and independent IO mode.

Do you have any idea what might be causing this problem, or how to
resolve it? I'm not sure what kind of other information you might need,
but I'll do my best to supply it, if you need any.

This is a scalability problem we are aware of and are working to address,
but in the meanwhile, can you increase the size of your chunks for your
dataset(s)? (That will reduce the number of chunks and the size of the
buffer being allocated)

     Quincey

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Hi Stefan,

Hi Quincey,

I can double one dimension on my chunk size (at the cost of really slow IO), but if I double them all I get errors like these:

HDF5-DIAG: Error detected in HDF5 (1.8.4) MPI-process 4:
#000: H5D.c line 171 in H5Dcreate2(): unable to create dataset
   major: Dataset
   minor: Unable to initialize object
#001: H5Dint.c line 428 in H5D_create_named(): unable to create and link to dataset
   major: Dataset
   minor: Unable to initialize object
#002: H5L.c line 1639 in H5L_link_object(): unable to create new link to object
   major: Links
   minor: Unable to initialize object
#003: H5L.c line 1862 in H5L_create_real(): can't insert link
   major: Symbol table
   minor: Unable to insert object
#004: H5Gtraverse.c line 877 in H5G_traverse(): internal path traversal failed
   major: Symbol table
   minor: Object not found
#005: H5Gtraverse.c line 703 in H5G_traverse_real(): traversal operator failed
   major: Symbol table
   minor: Callback failed
#006: H5L.c line 1685 in H5L_link_cb(): unable to create object
   major: Object header
   minor: Unable to initialize object
#007: H5O.c line 2677 in H5O_obj_create(): unable to open object
   major: Object header
   minor: Can't open object
#008: H5Doh.c line 296 in H5O_dset_create(): unable to create dataset
   major: Dataset
   minor: Unable to initialize object
#009: H5Dint.c line 1030 in H5D_create(): unable to construct layout information
   major: Dataset
   minor: Unable to initialize object
#010: H5Dchunk.c line 420 in H5D_chunk_construct(): chunk size must be <= maximum dimension size for fixed-sized dimensions
   major: Dataset
   minor: Unable to initialize object

I am currently doing test runs on my local machine on 16 cores because the large machine I run jobs on is unavailable at the moment and has a queueing system rather unsuited to quick test runs, so maybe this is an artefact of running on such a small number of cores? Although I *think* I tried this before and got the same type of error on several thousand cores also.

  You seem to have increased the chunk dimension to be larger than the dataset dimension. What is the chunk size and dataspace size you are using?

  Quincey

···

On Mar 24, 2010, at 11:06 AM, Frijters, S.C.J. wrote:

Kind regards,

Stefan Frijters

________________________________________
From: hdf-forum-bounces@hdfgroup.org [hdf-forum-bounces@hdfgroup.org] On Behalf Of Quincey Koziol [koziol@hdfgroup.org]
Sent: 24 March 2010 16:28
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] HDF5 causes Fatal error in MPI_Gather

Hi Stefan,

On Mar 24, 2010, at 10:10 AM, Stefan Frijters wrote:

Hi Quincey,

Thanks for the quick response. Currently, each core is handling its
datasets with a chunk size equal to the size of the local data (the dims
parameter in h5pset_chunk_f is equal to to dims parameter in
h5dwrite_f) because the local arrays are not that large anyway (in the
order of 20x20x20 reals), so if I understand things correctly I'm
already using maximum chunk size.

       No, you don't have to make them the same size, since the collective I/O should stitch them back together anyway. Try doubling the dimensions on your chunks.

Do you have an idea why it doesn't crash the first time I try to do it
though? It's a different array, but of the same size and datatype as the
second. As far as I can see I'm closing all used handles at the end of
my function at least.

       Hmm, I'm not certain...

               Quincey

Kind regards,

Stefan Frijters

Hi Stefan,

On Mar 24, 2010, at 3:11 AM, Stefan Frijters wrote:

Dear all,

Recently, I've run into a problem with my parallel HDF5 writes. My
program works fine on 8k cores, but when I run it on 16k cores it
crashes when writing a data file through h5dwrite_f(...).
File writes go through one function in the code, so it always uses the
same code, but for some reason I don't understand it writes one file
without problems, but the second one throws the following error message:

Abort(1) on node 0 (rank 0 in comm 1140850688): Fatal error in
MPI_Gather: Invalid buffer pointer, error stack:
MPI_Gather(758): MPI_Gather(sbuf=0xa356f400, scount=16000, MPI_BYTE,
rbuf=(nil), rcount=16000, MPI_BYTE, root=0, comm=0x84000003) failed
MPI_Gather(675): Null buffer pointer

I've been looking through the HDF5 source code and it only seems to call
MPI_Gather in one place, in the function H5D_obtain_mpio_mode. In that
function HDF tries to allocate a receive buffer using

recv_io_mode_info = (uint8_t *)H5MM_malloc(total_chunks * mpi_size);

Which then returns the null pointer seen in rbuf=(nil) instead of a
valid pointer. Thus, to me it seems it's HDF causing the problem and not
MPI.

This problem occurs in both collective and independent IO mode.

Do you have any idea what might be causing this problem, or how to
resolve it? I'm not sure what kind of other information you might need,
but I'll do my best to supply it, if you need any.

This is a scalability problem we are aware of and are working to address,
but in the meanwhile, can you increase the size of your chunks for your
dataset(s)? (That will reduce the number of chunks and the size of the
buffer being allocated)

    Quincey

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Hi Quincey,

I managed to increase the chunk size - I overlooked the fact that my blocks of data weren't cubes in my testcase. However, it seems that performance can suffer a lot for certain chunk sizes (for my test case):

The size of my entire data array is 40 x 40 x 160. My MPI cartesian grid is 4 x 4 x 1, so every core has a 10 x 10 x 160 subset. Originally I had the chunk size set to 10 x 10 x 160 as well (which explains why I couldn't double the 3rd component), and writes take less than a second. However, if I set the chunk size to 20 x 20 x 160, it's really slow (7 seconds), while 40 x 40 x 160 once again takes less than a second. I'd been reading up on the whole chunking thing before, but I think I'm still ignorant of some of the subtleties. Am I violating a rule here so that HDF goes back to independent IO?

Is there some rule of thumb, or set of guidelines to get good performance out of it? I read your "Parallel HDF5 Hints" document and some others, but it hasn't helped me enough, apparently :-D. The time spent on IO in my application is getting to be somewhat of a hot item.

Thanks again for the continued support,

Stefan Frijters

···

________________________________________
From: hdf-forum-bounces@hdfgroup.org [hdf-forum-bounces@hdfgroup.org] On Behalf Of Quincey Koziol [koziol@hdfgroup.org]
Sent: 24 March 2010 17:12
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] HDF5 causes Fatal error in MPI_Gather

Hi Stefan,

On Mar 24, 2010, at 11:06 AM, Frijters, S.C.J. wrote:

Hi Quincey,

I can double one dimension on my chunk size (at the cost of really slow IO), but if I double them all I get errors like these:

HDF5-DIAG: Error detected in HDF5 (1.8.4) MPI-process 4:
#000: H5D.c line 171 in H5Dcreate2(): unable to create dataset
   major: Dataset
   minor: Unable to initialize object
#001: H5Dint.c line 428 in H5D_create_named(): unable to create and link to dataset
   major: Dataset
   minor: Unable to initialize object
#002: H5L.c line 1639 in H5L_link_object(): unable to create new link to object
   major: Links
   minor: Unable to initialize object
#003: H5L.c line 1862 in H5L_create_real(): can't insert link
   major: Symbol table
   minor: Unable to insert object
#004: H5Gtraverse.c line 877 in H5G_traverse(): internal path traversal failed
   major: Symbol table
   minor: Object not found
#005: H5Gtraverse.c line 703 in H5G_traverse_real(): traversal operator failed
   major: Symbol table
   minor: Callback failed
#006: H5L.c line 1685 in H5L_link_cb(): unable to create object
   major: Object header
   minor: Unable to initialize object
#007: H5O.c line 2677 in H5O_obj_create(): unable to open object
   major: Object header
   minor: Can't open object
#008: H5Doh.c line 296 in H5O_dset_create(): unable to create dataset
   major: Dataset
   minor: Unable to initialize object
#009: H5Dint.c line 1030 in H5D_create(): unable to construct layout information
   major: Dataset
   minor: Unable to initialize object
#010: H5Dchunk.c line 420 in H5D_chunk_construct(): chunk size must be <= maximum dimension size for fixed-sized dimensions
   major: Dataset
   minor: Unable to initialize object

I am currently doing test runs on my local machine on 16 cores because the large machine I run jobs on is unavailable at the moment and has a queueing system rather unsuited to quick test runs, so maybe this is an artefact of running on such a small number of cores? Although I *think* I tried this before and got the same type of error on several thousand cores also.

        You seem to have increased the chunk dimension to be larger than the dataset dimension. What is the chunk size and dataspace size you are using?

        Quincey

Kind regards,

Stefan Frijters

________________________________________
From: hdf-forum-bounces@hdfgroup.org [hdf-forum-bounces@hdfgroup.org] On Behalf Of Quincey Koziol [koziol@hdfgroup.org]
Sent: 24 March 2010 16:28
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] HDF5 causes Fatal error in MPI_Gather

Hi Stefan,

On Mar 24, 2010, at 10:10 AM, Stefan Frijters wrote:

Hi Quincey,

Thanks for the quick response. Currently, each core is handling its
datasets with a chunk size equal to the size of the local data (the dims
parameter in h5pset_chunk_f is equal to to dims parameter in
h5dwrite_f) because the local arrays are not that large anyway (in the
order of 20x20x20 reals), so if I understand things correctly I'm
already using maximum chunk size.

       No, you don't have to make them the same size, since the collective I/O should stitch them back together anyway. Try doubling the dimensions on your chunks.

Do you have an idea why it doesn't crash the first time I try to do it
though? It's a different array, but of the same size and datatype as the
second. As far as I can see I'm closing all used handles at the end of
my function at least.

       Hmm, I'm not certain...

               Quincey

Kind regards,

Stefan Frijters

Hi Stefan,

On Mar 24, 2010, at 3:11 AM, Stefan Frijters wrote:

Dear all,

Recently, I've run into a problem with my parallel HDF5 writes. My
program works fine on 8k cores, but when I run it on 16k cores it
crashes when writing a data file through h5dwrite_f(...).
File writes go through one function in the code, so it always uses the
same code, but for some reason I don't understand it writes one file
without problems, but the second one throws the following error message:

Abort(1) on node 0 (rank 0 in comm 1140850688): Fatal error in
MPI_Gather: Invalid buffer pointer, error stack:
MPI_Gather(758): MPI_Gather(sbuf=0xa356f400, scount=16000, MPI_BYTE,
rbuf=(nil), rcount=16000, MPI_BYTE, root=0, comm=0x84000003) failed
MPI_Gather(675): Null buffer pointer

I've been looking through the HDF5 source code and it only seems to call
MPI_Gather in one place, in the function H5D_obtain_mpio_mode. In that
function HDF tries to allocate a receive buffer using

recv_io_mode_info = (uint8_t *)H5MM_malloc(total_chunks * mpi_size);

Which then returns the null pointer seen in rbuf=(nil) instead of a
valid pointer. Thus, to me it seems it's HDF causing the problem and not
MPI.

This problem occurs in both collective and independent IO mode.

Do you have any idea what might be causing this problem, or how to
resolve it? I'm not sure what kind of other information you might need,
but I'll do my best to supply it, if you need any.

This is a scalability problem we are aware of and are working to address,
but in the meanwhile, can you increase the size of your chunks for your
dataset(s)? (That will reduce the number of chunks and the size of the
buffer being allocated)

    Quincey

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Hi Stefan,

Hi Quincey,

I managed to increase the chunk size - I overlooked the fact that my blocks of data weren't cubes in my testcase. However, it seems that performance can suffer a lot for certain chunk sizes (for my test case):

The size of my entire data array is 40 x 40 x 160. My MPI cartesian grid is 4 x 4 x 1, so every core has a 10 x 10 x 160 subset. Originally I had the chunk size set to 10 x 10 x 160 as well (which explains why I couldn't double the 3rd component), and writes take less than a second. However, if I set the chunk size to 20 x 20 x 160, it's really slow (7 seconds), while 40 x 40 x 160 once again takes less than a second. I'd been reading up on the whole chunking thing before, but I think I'm still ignorant of some of the subtleties. Am I violating a rule here so that HDF goes back to independent IO?

  Hmm, are your datatypes the same in memory and the file? If they aren't, HDF5 will break collective I/O down into independent I/O.

Is there some rule of thumb, or set of guidelines to get good performance out of it? I read your "Parallel HDF5 Hints" document and some others, but it hasn't helped me enough, apparently :-D. The time spent on IO in my application is getting to be somewhat of a hot item.

  That would be where I'd point you.

    Quincey

···

On Mar 25, 2010, at 5:08 AM, Frijters, S.C.J. wrote:

Thanks again for the continued support,

Stefan Frijters
________________________________________
From: hdf-forum-bounces@hdfgroup.org [hdf-forum-bounces@hdfgroup.org] On Behalf Of Quincey Koziol [koziol@hdfgroup.org]
Sent: 24 March 2010 17:12
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] HDF5 causes Fatal error in MPI_Gather

Hi Stefan,

On Mar 24, 2010, at 11:06 AM, Frijters, S.C.J. wrote:

Hi Quincey,

I can double one dimension on my chunk size (at the cost of really slow IO), but if I double them all I get errors like these:

HDF5-DIAG: Error detected in HDF5 (1.8.4) MPI-process 4:
#000: H5D.c line 171 in H5Dcreate2(): unable to create dataset
  major: Dataset
  minor: Unable to initialize object
#001: H5Dint.c line 428 in H5D_create_named(): unable to create and link to dataset
  major: Dataset
  minor: Unable to initialize object
#002: H5L.c line 1639 in H5L_link_object(): unable to create new link to object
  major: Links
  minor: Unable to initialize object
#003: H5L.c line 1862 in H5L_create_real(): can't insert link
  major: Symbol table
  minor: Unable to insert object
#004: H5Gtraverse.c line 877 in H5G_traverse(): internal path traversal failed
  major: Symbol table
  minor: Object not found
#005: H5Gtraverse.c line 703 in H5G_traverse_real(): traversal operator failed
  major: Symbol table
  minor: Callback failed
#006: H5L.c line 1685 in H5L_link_cb(): unable to create object
  major: Object header
  minor: Unable to initialize object
#007: H5O.c line 2677 in H5O_obj_create(): unable to open object
  major: Object header
  minor: Can't open object
#008: H5Doh.c line 296 in H5O_dset_create(): unable to create dataset
  major: Dataset
  minor: Unable to initialize object
#009: H5Dint.c line 1030 in H5D_create(): unable to construct layout information
  major: Dataset
  minor: Unable to initialize object
#010: H5Dchunk.c line 420 in H5D_chunk_construct(): chunk size must be <= maximum dimension size for fixed-sized dimensions
  major: Dataset
  minor: Unable to initialize object

I am currently doing test runs on my local machine on 16 cores because the large machine I run jobs on is unavailable at the moment and has a queueing system rather unsuited to quick test runs, so maybe this is an artefact of running on such a small number of cores? Although I *think* I tried this before and got the same type of error on several thousand cores also.

       You seem to have increased the chunk dimension to be larger than the dataset dimension. What is the chunk size and dataspace size you are using?

       Quincey

Kind regards,

Stefan Frijters

________________________________________
From: hdf-forum-bounces@hdfgroup.org [hdf-forum-bounces@hdfgroup.org] On Behalf Of Quincey Koziol [koziol@hdfgroup.org]
Sent: 24 March 2010 16:28
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] HDF5 causes Fatal error in MPI_Gather

Hi Stefan,

On Mar 24, 2010, at 10:10 AM, Stefan Frijters wrote:

Hi Quincey,

Thanks for the quick response. Currently, each core is handling its
datasets with a chunk size equal to the size of the local data (the dims
parameter in h5pset_chunk_f is equal to to dims parameter in
h5dwrite_f) because the local arrays are not that large anyway (in the
order of 20x20x20 reals), so if I understand things correctly I'm
already using maximum chunk size.

      No, you don't have to make them the same size, since the collective I/O should stitch them back together anyway. Try doubling the dimensions on your chunks.

Do you have an idea why it doesn't crash the first time I try to do it
though? It's a different array, but of the same size and datatype as the
second. As far as I can see I'm closing all used handles at the end of
my function at least.

      Hmm, I'm not certain...

              Quincey

Kind regards,

Stefan Frijters

Hi Stefan,

On Mar 24, 2010, at 3:11 AM, Stefan Frijters wrote:

Dear all,

Recently, I've run into a problem with my parallel HDF5 writes. My
program works fine on 8k cores, but when I run it on 16k cores it
crashes when writing a data file through h5dwrite_f(...).
File writes go through one function in the code, so it always uses the
same code, but for some reason I don't understand it writes one file
without problems, but the second one throws the following error message:

Abort(1) on node 0 (rank 0 in comm 1140850688): Fatal error in
MPI_Gather: Invalid buffer pointer, error stack:
MPI_Gather(758): MPI_Gather(sbuf=0xa356f400, scount=16000, MPI_BYTE,
rbuf=(nil), rcount=16000, MPI_BYTE, root=0, comm=0x84000003) failed
MPI_Gather(675): Null buffer pointer

I've been looking through the HDF5 source code and it only seems to call
MPI_Gather in one place, in the function H5D_obtain_mpio_mode. In that
function HDF tries to allocate a receive buffer using

recv_io_mode_info = (uint8_t *)H5MM_malloc(total_chunks * mpi_size);

Which then returns the null pointer seen in rbuf=(nil) instead of a
valid pointer. Thus, to me it seems it's HDF causing the problem and not
MPI.

This problem occurs in both collective and independent IO mode.

Do you have any idea what might be causing this problem, or how to
resolve it? I'm not sure what kind of other information you might need,
but I'll do my best to supply it, if you need any.

This is a scalability problem we are aware of and are working to address,
but in the meanwhile, can you increase the size of your chunks for your
dataset(s)? (That will reduce the number of chunks and the size of the
buffer being allocated)

   Quincey

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Hi Quincey,

I'm using H5T_NATIVE_DOUBLE to write an array of real*8 and H5T_NATIVE_REAL for real*4. Is that okay?

Kind regards,

Stefan Frijters

···

________________________________________
From: hdf-forum-bounces@hdfgroup.org [hdf-forum-bounces@hdfgroup.org] On Behalf Of Quincey Koziol [koziol@hdfgroup.org]
Sent: 25 March 2010 22:40
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] HDF5 causes Fatal error in MPI_Gather

Hi Stefan,

On Mar 25, 2010, at 5:08 AM, Frijters, S.C.J. wrote:

Hi Quincey,

I managed to increase the chunk size - I overlooked the fact that my blocks of data weren't cubes in my testcase. However, it seems that performance can suffer a lot for certain chunk sizes (for my test case):

The size of my entire data array is 40 x 40 x 160. My MPI cartesian grid is 4 x 4 x 1, so every core has a 10 x 10 x 160 subset. Originally I had the chunk size set to 10 x 10 x 160 as well (which explains why I couldn't double the 3rd component), and writes take less than a second. However, if I set the chunk size to 20 x 20 x 160, it's really slow (7 seconds), while 40 x 40 x 160 once again takes less than a second. I'd been reading up on the whole chunking thing before, but I think I'm still ignorant of some of the subtleties. Am I violating a rule here so that HDF goes back to independent IO?

        Hmm, are your datatypes the same in memory and the file? If they aren't, HDF5 will break collective I/O down into independent I/O.

Is there some rule of thumb, or set of guidelines to get good performance out of it? I read your "Parallel HDF5 Hints" document and some others, but it hasn't helped me enough, apparently :-D. The time spent on IO in my application is getting to be somewhat of a hot item.

        That would be where I'd point you.

                Quincey

Thanks again for the continued support,

Stefan Frijters
________________________________________
From: hdf-forum-bounces@hdfgroup.org [hdf-forum-bounces@hdfgroup.org] On Behalf Of Quincey Koziol [koziol@hdfgroup.org]
Sent: 24 March 2010 17:12
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] HDF5 causes Fatal error in MPI_Gather

Hi Stefan,

On Mar 24, 2010, at 11:06 AM, Frijters, S.C.J. wrote:

Hi Quincey,

I can double one dimension on my chunk size (at the cost of really slow IO), but if I double them all I get errors like these:

HDF5-DIAG: Error detected in HDF5 (1.8.4) MPI-process 4:
#000: H5D.c line 171 in H5Dcreate2(): unable to create dataset
  major: Dataset
  minor: Unable to initialize object
#001: H5Dint.c line 428 in H5D_create_named(): unable to create and link to dataset
  major: Dataset
  minor: Unable to initialize object
#002: H5L.c line 1639 in H5L_link_object(): unable to create new link to object
  major: Links
  minor: Unable to initialize object
#003: H5L.c line 1862 in H5L_create_real(): can't insert link
  major: Symbol table
  minor: Unable to insert object
#004: H5Gtraverse.c line 877 in H5G_traverse(): internal path traversal failed
  major: Symbol table
  minor: Object not found
#005: H5Gtraverse.c line 703 in H5G_traverse_real(): traversal operator failed
  major: Symbol table
  minor: Callback failed
#006: H5L.c line 1685 in H5L_link_cb(): unable to create object
  major: Object header
  minor: Unable to initialize object
#007: H5O.c line 2677 in H5O_obj_create(): unable to open object
  major: Object header
  minor: Can't open object
#008: H5Doh.c line 296 in H5O_dset_create(): unable to create dataset
  major: Dataset
  minor: Unable to initialize object
#009: H5Dint.c line 1030 in H5D_create(): unable to construct layout information
  major: Dataset
  minor: Unable to initialize object
#010: H5Dchunk.c line 420 in H5D_chunk_construct(): chunk size must be <= maximum dimension size for fixed-sized dimensions
  major: Dataset
  minor: Unable to initialize object

I am currently doing test runs on my local machine on 16 cores because the large machine I run jobs on is unavailable at the moment and has a queueing system rather unsuited to quick test runs, so maybe this is an artefact of running on such a small number of cores? Although I *think* I tried this before and got the same type of error on several thousand cores also.

       You seem to have increased the chunk dimension to be larger than the dataset dimension. What is the chunk size and dataspace size you are using?

       Quincey

Kind regards,

Stefan Frijters

________________________________________
From: hdf-forum-bounces@hdfgroup.org [hdf-forum-bounces@hdfgroup.org] On Behalf Of Quincey Koziol [koziol@hdfgroup.org]
Sent: 24 March 2010 16:28
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] HDF5 causes Fatal error in MPI_Gather

Hi Stefan,

On Mar 24, 2010, at 10:10 AM, Stefan Frijters wrote:

Hi Quincey,

Thanks for the quick response. Currently, each core is handling its
datasets with a chunk size equal to the size of the local data (the dims
parameter in h5pset_chunk_f is equal to to dims parameter in
h5dwrite_f) because the local arrays are not that large anyway (in the
order of 20x20x20 reals), so if I understand things correctly I'm
already using maximum chunk size.

      No, you don't have to make them the same size, since the collective I/O should stitch them back together anyway. Try doubling the dimensions on your chunks.

Do you have an idea why it doesn't crash the first time I try to do it
though? It's a different array, but of the same size and datatype as the
second. As far as I can see I'm closing all used handles at the end of
my function at least.

      Hmm, I'm not certain...

              Quincey

Kind regards,

Stefan Frijters

Hi Stefan,

On Mar 24, 2010, at 3:11 AM, Stefan Frijters wrote:

Dear all,

Recently, I've run into a problem with my parallel HDF5 writes. My
program works fine on 8k cores, but when I run it on 16k cores it
crashes when writing a data file through h5dwrite_f(...).
File writes go through one function in the code, so it always uses the
same code, but for some reason I don't understand it writes one file
without problems, but the second one throws the following error message:

Abort(1) on node 0 (rank 0 in comm 1140850688): Fatal error in
MPI_Gather: Invalid buffer pointer, error stack:
MPI_Gather(758): MPI_Gather(sbuf=0xa356f400, scount=16000, MPI_BYTE,
rbuf=(nil), rcount=16000, MPI_BYTE, root=0, comm=0x84000003) failed
MPI_Gather(675): Null buffer pointer

I've been looking through the HDF5 source code and it only seems to call
MPI_Gather in one place, in the function H5D_obtain_mpio_mode. In that
function HDF tries to allocate a receive buffer using

recv_io_mode_info = (uint8_t *)H5MM_malloc(total_chunks * mpi_size);

Which then returns the null pointer seen in rbuf=(nil) instead of a
valid pointer. Thus, to me it seems it's HDF causing the problem and not
MPI.

This problem occurs in both collective and independent IO mode.

Do you have any idea what might be causing this problem, or how to
resolve it? I'm not sure what kind of other information you might need,
but I'll do my best to supply it, if you need any.

This is a scalability problem we are aware of and are working to address,
but in the meanwhile, can you increase the size of your chunks for your
dataset(s)? (That will reduce the number of chunks and the size of the
buffer being allocated)

   Quincey

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Hi Stefan,

Hi Quincey,

I'm using H5T_NATIVE_DOUBLE to write an array of real*8 and H5T_NATIVE_REAL for real*4. Is that okay?

  That will work, but it will cause the I/O to be independent, rather than collective (if you request collective). Try writing to the file in the same datatype you use for your memory datatype and see if the performance is better.

  Quincey

···

On Mar 26, 2010, at 9:25 AM, Frijters, S.C.J. wrote:

Kind regards,

Stefan Frijters
________________________________________
From: hdf-forum-bounces@hdfgroup.org [hdf-forum-bounces@hdfgroup.org] On Behalf Of Quincey Koziol [koziol@hdfgroup.org]
Sent: 25 March 2010 22:40
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] HDF5 causes Fatal error in MPI_Gather

Hi Stefan,

On Mar 25, 2010, at 5:08 AM, Frijters, S.C.J. wrote:

Hi Quincey,

I managed to increase the chunk size - I overlooked the fact that my blocks of data weren't cubes in my testcase. However, it seems that performance can suffer a lot for certain chunk sizes (for my test case):

The size of my entire data array is 40 x 40 x 160. My MPI cartesian grid is 4 x 4 x 1, so every core has a 10 x 10 x 160 subset. Originally I had the chunk size set to 10 x 10 x 160 as well (which explains why I couldn't double the 3rd component), and writes take less than a second. However, if I set the chunk size to 20 x 20 x 160, it's really slow (7 seconds), while 40 x 40 x 160 once again takes less than a second. I'd been reading up on the whole chunking thing before, but I think I'm still ignorant of some of the subtleties. Am I violating a rule here so that HDF goes back to independent IO?

       Hmm, are your datatypes the same in memory and the file? If they aren't, HDF5 will break collective I/O down into independent I/O.

Is there some rule of thumb, or set of guidelines to get good performance out of it? I read your "Parallel HDF5 Hints" document and some others, but it hasn't helped me enough, apparently :-D. The time spent on IO in my application is getting to be somewhat of a hot item.

       That would be where I'd point you.

               Quincey

Thanks again for the continued support,

Stefan Frijters
________________________________________
From: hdf-forum-bounces@hdfgroup.org [hdf-forum-bounces@hdfgroup.org] On Behalf Of Quincey Koziol [koziol@hdfgroup.org]
Sent: 24 March 2010 17:12
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] HDF5 causes Fatal error in MPI_Gather

Hi Stefan,

On Mar 24, 2010, at 11:06 AM, Frijters, S.C.J. wrote:

Hi Quincey,

I can double one dimension on my chunk size (at the cost of really slow IO), but if I double them all I get errors like these:

HDF5-DIAG: Error detected in HDF5 (1.8.4) MPI-process 4:
#000: H5D.c line 171 in H5Dcreate2(): unable to create dataset
major: Dataset
minor: Unable to initialize object
#001: H5Dint.c line 428 in H5D_create_named(): unable to create and link to dataset
major: Dataset
minor: Unable to initialize object
#002: H5L.c line 1639 in H5L_link_object(): unable to create new link to object
major: Links
minor: Unable to initialize object
#003: H5L.c line 1862 in H5L_create_real(): can't insert link
major: Symbol table
minor: Unable to insert object
#004: H5Gtraverse.c line 877 in H5G_traverse(): internal path traversal failed
major: Symbol table
minor: Object not found
#005: H5Gtraverse.c line 703 in H5G_traverse_real(): traversal operator failed
major: Symbol table
minor: Callback failed
#006: H5L.c line 1685 in H5L_link_cb(): unable to create object
major: Object header
minor: Unable to initialize object
#007: H5O.c line 2677 in H5O_obj_create(): unable to open object
major: Object header
minor: Can't open object
#008: H5Doh.c line 296 in H5O_dset_create(): unable to create dataset
major: Dataset
minor: Unable to initialize object
#009: H5Dint.c line 1030 in H5D_create(): unable to construct layout information
major: Dataset
minor: Unable to initialize object
#010: H5Dchunk.c line 420 in H5D_chunk_construct(): chunk size must be <= maximum dimension size for fixed-sized dimensions
major: Dataset
minor: Unable to initialize object

I am currently doing test runs on my local machine on 16 cores because the large machine I run jobs on is unavailable at the moment and has a queueing system rather unsuited to quick test runs, so maybe this is an artefact of running on such a small number of cores? Although I *think* I tried this before and got the same type of error on several thousand cores also.

      You seem to have increased the chunk dimension to be larger than the dataset dimension. What is the chunk size and dataspace size you are using?

      Quincey

Kind regards,

Stefan Frijters

________________________________________
From: hdf-forum-bounces@hdfgroup.org [hdf-forum-bounces@hdfgroup.org] On Behalf Of Quincey Koziol [koziol@hdfgroup.org]
Sent: 24 March 2010 16:28
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] HDF5 causes Fatal error in MPI_Gather

Hi Stefan,

On Mar 24, 2010, at 10:10 AM, Stefan Frijters wrote:

Hi Quincey,

Thanks for the quick response. Currently, each core is handling its
datasets with a chunk size equal to the size of the local data (the dims
parameter in h5pset_chunk_f is equal to to dims parameter in
h5dwrite_f) because the local arrays are not that large anyway (in the
order of 20x20x20 reals), so if I understand things correctly I'm
already using maximum chunk size.

     No, you don't have to make them the same size, since the collective I/O should stitch them back together anyway. Try doubling the dimensions on your chunks.

Do you have an idea why it doesn't crash the first time I try to do it
though? It's a different array, but of the same size and datatype as the
second. As far as I can see I'm closing all used handles at the end of
my function at least.

     Hmm, I'm not certain...

             Quincey

Kind regards,

Stefan Frijters

Hi Stefan,

On Mar 24, 2010, at 3:11 AM, Stefan Frijters wrote:

Dear all,

Recently, I've run into a problem with my parallel HDF5 writes. My
program works fine on 8k cores, but when I run it on 16k cores it
crashes when writing a data file through h5dwrite_f(...).
File writes go through one function in the code, so it always uses the
same code, but for some reason I don't understand it writes one file
without problems, but the second one throws the following error message:

Abort(1) on node 0 (rank 0 in comm 1140850688): Fatal error in
MPI_Gather: Invalid buffer pointer, error stack:
MPI_Gather(758): MPI_Gather(sbuf=0xa356f400, scount=16000, MPI_BYTE,
rbuf=(nil), rcount=16000, MPI_BYTE, root=0, comm=0x84000003) failed
MPI_Gather(675): Null buffer pointer

I've been looking through the HDF5 source code and it only seems to call
MPI_Gather in one place, in the function H5D_obtain_mpio_mode. In that
function HDF tries to allocate a receive buffer using

recv_io_mode_info = (uint8_t *)H5MM_malloc(total_chunks * mpi_size);

Which then returns the null pointer seen in rbuf=(nil) instead of a
valid pointer. Thus, to me it seems it's HDF causing the problem and not
MPI.

This problem occurs in both collective and independent IO mode.

Do you have any idea what might be causing this problem, or how to
resolve it? I'm not sure what kind of other information you might need,
but I'll do my best to supply it, if you need any.

This is a scalability problem we are aware of and are working to address,
but in the meanwhile, can you increase the size of your chunks for your
dataset(s)? (That will reduce the number of chunks and the size of the
buffer being allocated)

  Quincey

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Hi Quincey,

I'm not sure I understand what I'm doing wrong. Focussing on just one of the cases:

- I have a 3D array of real*8, which is equivalent to a C double, right?
- According to the manual, this should correspond to H5T_NATIVE_DOUBLE
- I then use that datatype in:

CALL h5dcreate_f(file_id, dsetname, H5T_NATIVE_DOUBLE , filespace, dset_id, err, plist_id)

and

CALL h5dwrite_f(dset_id, H5T_NATIVE_DOUBLE, scalar, dims, err, file_space_id = filespace, mem_space_id = memspace, xfer_prp = plist_id)

What am I missing?

Also, a quick related performance question - when using a hyperslab to make a data selection, is there a performance difference between

count = (/1, 1, 1/)
stride = (/1, 1, 1/)
block = (/nx, ny, nz/)
CALL h5sselect_hyperslab_f (filespace, H5S_SELECT_SET_F, offset, count, err, stride, block)

and

count = (/nx, ny, nz/)
CALL h5sselect_hyperslab_f (filespace, H5S_SELECT_SET_F, offset, count, err)

I found the manual to be rather unclear on that point.

Kind regards,

Stefan Frijters

···

________________________________________
From: hdf-forum-bounces@hdfgroup.org [hdf-forum-bounces@hdfgroup.org] On Behalf Of Quincey Koziol [koziol@hdfgroup.org]
Sent: 26 March 2010 19:00
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] HDF5 causes Fatal error in MPI_Gather

Hi Stefan,

On Mar 26, 2010, at 9:25 AM, Frijters, S.C.J. wrote:

Hi Quincey,

I'm using H5T_NATIVE_DOUBLE to write an array of real*8 and H5T_NATIVE_REAL for real*4. Is that okay?

        That will work, but it will cause the I/O to be independent, rather than collective (if you request collective). Try writing to the file in the same datatype you use for your memory datatype and see if the performance is better.

        Quincey

Kind regards,

Stefan Frijters
________________________________________
From: hdf-forum-bounces@hdfgroup.org [hdf-forum-bounces@hdfgroup.org] On Behalf Of Quincey Koziol [koziol@hdfgroup.org]
Sent: 25 March 2010 22:40
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] HDF5 causes Fatal error in MPI_Gather

Hi Stefan,

On Mar 25, 2010, at 5:08 AM, Frijters, S.C.J. wrote:

Hi Quincey,

I managed to increase the chunk size - I overlooked the fact that my blocks of data weren't cubes in my testcase. However, it seems that performance can suffer a lot for certain chunk sizes (for my test case):

The size of my entire data array is 40 x 40 x 160. My MPI cartesian grid is 4 x 4 x 1, so every core has a 10 x 10 x 160 subset. Originally I had the chunk size set to 10 x 10 x 160 as well (which explains why I couldn't double the 3rd component), and writes take less than a second. However, if I set the chunk size to 20 x 20 x 160, it's really slow (7 seconds), while 40 x 40 x 160 once again takes less than a second. I'd been reading up on the whole chunking thing before, but I think I'm still ignorant of some of the subtleties. Am I violating a rule here so that HDF goes back to independent IO?

       Hmm, are your datatypes the same in memory and the file? If they aren't, HDF5 will break collective I/O down into independent I/O.

Is there some rule of thumb, or set of guidelines to get good performance out of it? I read your "Parallel HDF5 Hints" document and some others, but it hasn't helped me enough, apparently :-D. The time spent on IO in my application is getting to be somewhat of a hot item.

       That would be where I'd point you.

               Quincey

Thanks again for the continued support,

Stefan Frijters
________________________________________
From: hdf-forum-bounces@hdfgroup.org [hdf-forum-bounces@hdfgroup.org] On Behalf Of Quincey Koziol [koziol@hdfgroup.org]
Sent: 24 March 2010 17:12
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] HDF5 causes Fatal error in MPI_Gather

Hi Stefan,

On Mar 24, 2010, at 11:06 AM, Frijters, S.C.J. wrote:

Hi Quincey,

I can double one dimension on my chunk size (at the cost of really slow IO), but if I double them all I get errors like these:

HDF5-DIAG: Error detected in HDF5 (1.8.4) MPI-process 4:
#000: H5D.c line 171 in H5Dcreate2(): unable to create dataset
major: Dataset
minor: Unable to initialize object
#001: H5Dint.c line 428 in H5D_create_named(): unable to create and link to dataset
major: Dataset
minor: Unable to initialize object
#002: H5L.c line 1639 in H5L_link_object(): unable to create new link to object
major: Links
minor: Unable to initialize object
#003: H5L.c line 1862 in H5L_create_real(): can't insert link
major: Symbol table
minor: Unable to insert object
#004: H5Gtraverse.c line 877 in H5G_traverse(): internal path traversal failed
major: Symbol table
minor: Object not found
#005: H5Gtraverse.c line 703 in H5G_traverse_real(): traversal operator failed
major: Symbol table
minor: Callback failed
#006: H5L.c line 1685 in H5L_link_cb(): unable to create object
major: Object header
minor: Unable to initialize object
#007: H5O.c line 2677 in H5O_obj_create(): unable to open object
major: Object header
minor: Can't open object
#008: H5Doh.c line 296 in H5O_dset_create(): unable to create dataset
major: Dataset
minor: Unable to initialize object
#009: H5Dint.c line 1030 in H5D_create(): unable to construct layout information
major: Dataset
minor: Unable to initialize object
#010: H5Dchunk.c line 420 in H5D_chunk_construct(): chunk size must be <= maximum dimension size for fixed-sized dimensions
major: Dataset
minor: Unable to initialize object

I am currently doing test runs on my local machine on 16 cores because the large machine I run jobs on is unavailable at the moment and has a queueing system rather unsuited to quick test runs, so maybe this is an artefact of running on such a small number of cores? Although I *think* I tried this before and got the same type of error on several thousand cores also.

      You seem to have increased the chunk dimension to be larger than the dataset dimension. What is the chunk size and dataspace size you are using?

      Quincey

Kind regards,

Stefan Frijters

________________________________________
From: hdf-forum-bounces@hdfgroup.org [hdf-forum-bounces@hdfgroup.org] On Behalf Of Quincey Koziol [koziol@hdfgroup.org]
Sent: 24 March 2010 16:28
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] HDF5 causes Fatal error in MPI_Gather

Hi Stefan,

On Mar 24, 2010, at 10:10 AM, Stefan Frijters wrote:

Hi Quincey,

Thanks for the quick response. Currently, each core is handling its
datasets with a chunk size equal to the size of the local data (the dims
parameter in h5pset_chunk_f is equal to to dims parameter in
h5dwrite_f) because the local arrays are not that large anyway (in the
order of 20x20x20 reals), so if I understand things correctly I'm
already using maximum chunk size.

     No, you don't have to make them the same size, since the collective I/O should stitch them back together anyway. Try doubling the dimensions on your chunks.

Do you have an idea why it doesn't crash the first time I try to do it
though? It's a different array, but of the same size and datatype as the
second. As far as I can see I'm closing all used handles at the end of
my function at least.

     Hmm, I'm not certain...

             Quincey

Kind regards,

Stefan Frijters

Hi Stefan,

On Mar 24, 2010, at 3:11 AM, Stefan Frijters wrote:

Dear all,

Recently, I've run into a problem with my parallel HDF5 writes. My
program works fine on 8k cores, but when I run it on 16k cores it
crashes when writing a data file through h5dwrite_f(...).
File writes go through one function in the code, so it always uses the
same code, but for some reason I don't understand it writes one file
without problems, but the second one throws the following error message:

Abort(1) on node 0 (rank 0 in comm 1140850688): Fatal error in
MPI_Gather: Invalid buffer pointer, error stack:
MPI_Gather(758): MPI_Gather(sbuf=0xa356f400, scount=16000, MPI_BYTE,
rbuf=(nil), rcount=16000, MPI_BYTE, root=0, comm=0x84000003) failed
MPI_Gather(675): Null buffer pointer

I've been looking through the HDF5 source code and it only seems to call
MPI_Gather in one place, in the function H5D_obtain_mpio_mode. In that
function HDF tries to allocate a receive buffer using

recv_io_mode_info = (uint8_t *)H5MM_malloc(total_chunks * mpi_size);

Which then returns the null pointer seen in rbuf=(nil) instead of a
valid pointer. Thus, to me it seems it's HDF causing the problem and not
MPI.

This problem occurs in both collective and independent IO mode.

Do you have any idea what might be causing this problem, or how to
resolve it? I'm not sure what kind of other information you might need,
but I'll do my best to supply it, if you need any.

This is a scalability problem we are aware of and are working to address,
but in the meanwhile, can you increase the size of your chunks for your
dataset(s)? (That will reduce the number of chunks and the size of the
buffer being allocated)

  Quincey

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Hi Stefan,

Hi Quincey,

I'm not sure I understand what I'm doing wrong. Focussing on just one of the cases:

- I have a 3D array of real*8, which is equivalent to a C double, right?
- According to the manual, this should correspond to H5T_NATIVE_DOUBLE
- I then use that datatype in:

CALL h5dcreate_f(file_id, dsetname, H5T_NATIVE_DOUBLE , filespace, dset_id, err, plist_id)

and

CALL h5dwrite_f(dset_id, H5T_NATIVE_DOUBLE, scalar, dims, err, file_space_id = filespace, mem_space_id = memspace, xfer_prp = plist_id)

What am I missing?

  Ah, sorry, from your comments below, I thought your memory datatype was H5T_NATIVE_FLOAT and your file datatype was H5T_NATIVE_DOUBLE. Everything above looks OK to me.

Also, a quick related performance question - when using a hyperslab to make a data selection, is there a performance difference between

count = (/1, 1, 1/)
stride = (/1, 1, 1/)
block = (/nx, ny, nz/)
CALL h5sselect_hyperslab_f (filespace, H5S_SELECT_SET_F, offset, count, err, stride, block)

and

count = (/nx, ny, nz/)
CALL h5sselect_hyperslab_f (filespace, H5S_SELECT_SET_F, offset, count, err)

I found the manual to be rather unclear on that point.

  No, they are equivalent.

    Quincey

···

On Mar 29, 2010, at 8:53 AM, Frijters, S.C.J. wrote:

Kind regards,

Stefan Frijters
________________________________________
From: hdf-forum-bounces@hdfgroup.org [hdf-forum-bounces@hdfgroup.org] On Behalf Of Quincey Koziol [koziol@hdfgroup.org]
Sent: 26 March 2010 19:00
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] HDF5 causes Fatal error in MPI_Gather

Hi Stefan,

On Mar 26, 2010, at 9:25 AM, Frijters, S.C.J. wrote:

Hi Quincey,

I'm using H5T_NATIVE_DOUBLE to write an array of real*8 and H5T_NATIVE_REAL for real*4. Is that okay?

       That will work, but it will cause the I/O to be independent, rather than collective (if you request collective). Try writing to the file in the same datatype you use for your memory datatype and see if the performance is better.

       Quincey

Kind regards,

Stefan Frijters
________________________________________
From: hdf-forum-bounces@hdfgroup.org [hdf-forum-bounces@hdfgroup.org] On Behalf Of Quincey Koziol [koziol@hdfgroup.org]
Sent: 25 March 2010 22:40
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] HDF5 causes Fatal error in MPI_Gather

Hi Stefan,

On Mar 25, 2010, at 5:08 AM, Frijters, S.C.J. wrote:

Hi Quincey,

I managed to increase the chunk size - I overlooked the fact that my blocks of data weren't cubes in my testcase. However, it seems that performance can suffer a lot for certain chunk sizes (for my test case):

The size of my entire data array is 40 x 40 x 160. My MPI cartesian grid is 4 x 4 x 1, so every core has a 10 x 10 x 160 subset. Originally I had the chunk size set to 10 x 10 x 160 as well (which explains why I couldn't double the 3rd component), and writes take less than a second. However, if I set the chunk size to 20 x 20 x 160, it's really slow (7 seconds), while 40 x 40 x 160 once again takes less than a second. I'd been reading up on the whole chunking thing before, but I think I'm still ignorant of some of the subtleties. Am I violating a rule here so that HDF goes back to independent IO?

      Hmm, are your datatypes the same in memory and the file? If they aren't, HDF5 will break collective I/O down into independent I/O.

Is there some rule of thumb, or set of guidelines to get good performance out of it? I read your "Parallel HDF5 Hints" document and some others, but it hasn't helped me enough, apparently :-D. The time spent on IO in my application is getting to be somewhat of a hot item.

      That would be where I'd point you.

              Quincey

Thanks again for the continued support,

Stefan Frijters
________________________________________
From: hdf-forum-bounces@hdfgroup.org [hdf-forum-bounces@hdfgroup.org] On Behalf Of Quincey Koziol [koziol@hdfgroup.org]
Sent: 24 March 2010 17:12
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] HDF5 causes Fatal error in MPI_Gather

Hi Stefan,

On Mar 24, 2010, at 11:06 AM, Frijters, S.C.J. wrote:

Hi Quincey,

I can double one dimension on my chunk size (at the cost of really slow IO), but if I double them all I get errors like these:

HDF5-DIAG: Error detected in HDF5 (1.8.4) MPI-process 4:
#000: H5D.c line 171 in H5Dcreate2(): unable to create dataset
major: Dataset
minor: Unable to initialize object
#001: H5Dint.c line 428 in H5D_create_named(): unable to create and link to dataset
major: Dataset
minor: Unable to initialize object
#002: H5L.c line 1639 in H5L_link_object(): unable to create new link to object
major: Links
minor: Unable to initialize object
#003: H5L.c line 1862 in H5L_create_real(): can't insert link
major: Symbol table
minor: Unable to insert object
#004: H5Gtraverse.c line 877 in H5G_traverse(): internal path traversal failed
major: Symbol table
minor: Object not found
#005: H5Gtraverse.c line 703 in H5G_traverse_real(): traversal operator failed
major: Symbol table
minor: Callback failed
#006: H5L.c line 1685 in H5L_link_cb(): unable to create object
major: Object header
minor: Unable to initialize object
#007: H5O.c line 2677 in H5O_obj_create(): unable to open object
major: Object header
minor: Can't open object
#008: H5Doh.c line 296 in H5O_dset_create(): unable to create dataset
major: Dataset
minor: Unable to initialize object
#009: H5Dint.c line 1030 in H5D_create(): unable to construct layout information
major: Dataset
minor: Unable to initialize object
#010: H5Dchunk.c line 420 in H5D_chunk_construct(): chunk size must be <= maximum dimension size for fixed-sized dimensions
major: Dataset
minor: Unable to initialize object

I am currently doing test runs on my local machine on 16 cores because the large machine I run jobs on is unavailable at the moment and has a queueing system rather unsuited to quick test runs, so maybe this is an artefact of running on such a small number of cores? Although I *think* I tried this before and got the same type of error on several thousand cores also.

     You seem to have increased the chunk dimension to be larger than the dataset dimension. What is the chunk size and dataspace size you are using?

     Quincey

Kind regards,

Stefan Frijters

________________________________________
From: hdf-forum-bounces@hdfgroup.org [hdf-forum-bounces@hdfgroup.org] On Behalf Of Quincey Koziol [koziol@hdfgroup.org]
Sent: 24 March 2010 16:28
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] HDF5 causes Fatal error in MPI_Gather

Hi Stefan,

On Mar 24, 2010, at 10:10 AM, Stefan Frijters wrote:

Hi Quincey,

Thanks for the quick response. Currently, each core is handling its
datasets with a chunk size equal to the size of the local data (the dims
parameter in h5pset_chunk_f is equal to to dims parameter in
h5dwrite_f) because the local arrays are not that large anyway (in the
order of 20x20x20 reals), so if I understand things correctly I'm
already using maximum chunk size.

    No, you don't have to make them the same size, since the collective I/O should stitch them back together anyway. Try doubling the dimensions on your chunks.

Do you have an idea why it doesn't crash the first time I try to do it
though? It's a different array, but of the same size and datatype as the
second. As far as I can see I'm closing all used handles at the end of
my function at least.

    Hmm, I'm not certain...

            Quincey

Kind regards,

Stefan Frijters

Hi Stefan,

On Mar 24, 2010, at 3:11 AM, Stefan Frijters wrote:

Dear all,

Recently, I've run into a problem with my parallel HDF5 writes. My
program works fine on 8k cores, but when I run it on 16k cores it
crashes when writing a data file through h5dwrite_f(...).
File writes go through one function in the code, so it always uses the
same code, but for some reason I don't understand it writes one file
without problems, but the second one throws the following error message:

Abort(1) on node 0 (rank 0 in comm 1140850688): Fatal error in
MPI_Gather: Invalid buffer pointer, error stack:
MPI_Gather(758): MPI_Gather(sbuf=0xa356f400, scount=16000, MPI_BYTE,
rbuf=(nil), rcount=16000, MPI_BYTE, root=0, comm=0x84000003) failed
MPI_Gather(675): Null buffer pointer

I've been looking through the HDF5 source code and it only seems to call
MPI_Gather in one place, in the function H5D_obtain_mpio_mode. In that
function HDF tries to allocate a receive buffer using

recv_io_mode_info = (uint8_t *)H5MM_malloc(total_chunks * mpi_size);

Which then returns the null pointer seen in rbuf=(nil) instead of a
valid pointer. Thus, to me it seems it's HDF causing the problem and not
MPI.

This problem occurs in both collective and independent IO mode.

Do you have any idea what might be causing this problem, or how to
resolve it? I'm not sure what kind of other information you might need,
but I'll do my best to supply it, if you need any.

This is a scalability problem we are aware of and are working to address,
but in the meanwhile, can you increase the size of your chunks for your
dataset(s)? (That will reduce the number of chunks and the size of the
buffer being allocated)

Quincey

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org