MPI-IO 32-bit offset bug?

Hi all,

Chris Calderon, a user at NERSC, is receiving the errors at the bottom
of the email during the following scenario:

- a subset of 40 MPI tasks are each opening their own HDF5 file with
MPI-IO in collective mode with the MPI_COMM_SELF communicator
- each task writes about 20,000 small datasets totaling 10GB per file

It's worth noting that we don't intend to use MPI-IO in independent mode, so
we don't really need to fix this error to make the code operational,
but we'd like to understand why the error occurred. At the lowest
level, the error is "can't convert from size to size_i" and looking up
the relevant code, I found:

size_i = (int)size;
   if((hsize_t)size_i != size)
       HGOTO_ERROR...

So my guess is that the offsets at some point become large enough to
cause an int32 overflow. (Each file is about 10GB total, so the
overflow probably occurs around the 8GB mark since 2 billion elements
times 4 bytes per float = 8GB.) Is this a known bug in the MPI-IO VFD?
This suggests that the bug will also affect independent mode, but
another work around is for us to use the MPI-POSIX VFD, which should
bypass this problem.

I looked into using the CORE VFD per Mark Miller's suggestion in an earlier
thread, but the problem is that the 10GB of data will not fit into memory,
and I didn't see any API calls for requesting a "dump to file" before the
file close.

Thanks
Mark

···

----

HDF5-DIAG: Error detected in HDF5 (1.8.4) MPI-process 16:
#000: H5Dio.c line 266 in H5Dwrite(): can't write data
  major: Dataset
  minor: Write failed
#001: H5Dio.c line 578 in H5D_write(): can't write data
  major: Dataset
  minor: Write failed
#002: H5Dmpio.c line 552 in H5D_contig_collective_write(): couldn't
finish shared collective MPI-IO
  major: Low-level I/O
  minor: Write failed
#003: H5Dmpio.c line 1586 in H5D_inter_collective_io(): couldn't
finish collective MPI-IO
  major: Low-level I/O
  minor: Can't get value
#004: H5Dmpio.c line 1632 in H5D_final_collective_io(): optimized write failed
  major: Dataset
  minor: Write failed
#005: H5Dmpio.c line 334 in H5D_mpio_select_write(): can't finish
collective parallel write
  major: Low-level I/O
  minor: Write failed
#006: H5Fio.c line 167 in H5F_block_write(): file write failed
  major: Low-level I/O
  minor: Write failed
#007: H5FDint.c line 185 in H5FD_write(): driver write request failed
  major: Virtual File Layer
  minor: Write failed
#008: H5FDmpio.c line 1726 in H5FD_mpio_write(): can't convert from
size to size_i
  major: Internal error (too specific to document in detail)
  minor: Out of range

Hi Mark,
  Sorry for the delay in replying...

Hi all,

Chris Calderon, a user at NERSC, is receiving the errors at the bottom
of the email during the following scenario:

- a subset of 40 MPI tasks are each opening their own HDF5 file with
MPI-IO in collective mode with the MPI_COMM_SELF communicator
- each task writes about 20,000 small datasets totaling 10GB per file

It's worth noting that we don't intend to use MPI-IO in independent mode, so
we don't really need to fix this error to make the code operational,
but we'd like to understand why the error occurred. At the lowest
level, the error is "can't convert from size to size_i" and looking up
the relevant code, I found:

size_i = (int)size;
  if((hsize_t)size_i != size)
      HGOTO_ERROR...

So my guess is that the offsets at some point become large enough to
cause an int32 overflow. (Each file is about 10GB total, so the
overflow probably occurs around the 8GB mark since 2 billion elements
times 4 bytes per float = 8GB.) Is this a known bug in the MPI-IO VFD?
This suggests that the bug will also affect independent mode, but
another work around is for us to use the MPI-POSIX VFD, which should
bypass this problem.

  There is a limitation in the MPI standard which specifies that an 'int' type must be used for certain file operations, but we may be able to relax that for the MPI-POSIX driver. Could you give me the line number for the code snippet above? I'll take a look and see if it really needs to be there.

  Thanks,
    Quincey

···

On May 18, 2010, at 11:31 AM, Mark Howison wrote:

I looked into using the CORE VFD per Mark Miller's suggestion in an earlier
thread, but the problem is that the 10GB of data will not fit into memory,
and I didn't see any API calls for requesting a "dump to file" before the
file close.

Thanks
Mark

----

HDF5-DIAG: Error detected in HDF5 (1.8.4) MPI-process 16:
#000: H5Dio.c line 266 in H5Dwrite(): can't write data
major: Dataset
minor: Write failed
#001: H5Dio.c line 578 in H5D_write(): can't write data
major: Dataset
minor: Write failed
#002: H5Dmpio.c line 552 in H5D_contig_collective_write(): couldn't
finish shared collective MPI-IO
major: Low-level I/O
minor: Write failed
#003: H5Dmpio.c line 1586 in H5D_inter_collective_io(): couldn't
finish collective MPI-IO
major: Low-level I/O
minor: Can't get value
#004: H5Dmpio.c line 1632 in H5D_final_collective_io(): optimized write failed
major: Dataset
minor: Write failed
#005: H5Dmpio.c line 334 in H5D_mpio_select_write(): can't finish
collective parallel write
major: Low-level I/O
minor: Write failed
#006: H5Fio.c line 167 in H5F_block_write(): file write failed
major: Low-level I/O
minor: Write failed
#007: H5FDint.c line 185 in H5FD_write(): driver write request failed
major: Virtual File Layer
minor: Write failed
#008: H5FDmpio.c line 1726 in H5FD_mpio_write(): can't convert from
size to size_i
major: Internal error (too specific to document in detail)
minor: Out of range

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Hi Quincey, it is from:

#008: H5FDmpio.c line 1726 in H5FD_mpio_write(): can't convert from
size to size_i

Mark

···

On Thu, May 20, 2010 at 4:15 AM, Quincey Koziol <koziol@hdfgroup.org> wrote:

Hi Mark,
Sorry for the delay in replying...

On May 18, 2010, at 11:31 AM, Mark Howison wrote:

Hi all,

Chris Calderon, a user at NERSC, is receiving the errors at the bottom
of the email during the following scenario:

- a subset of 40 MPI tasks are each opening their own HDF5 file with
MPI-IO in collective mode with the MPI_COMM_SELF communicator
- each task writes about 20,000 small datasets totaling 10GB per file

It's worth noting that we don't intend to use MPI-IO in independent mode, so
we don't really need to fix this error to make the code operational,
but we'd like to understand why the error occurred. At the lowest
level, the error is "can't convert from size to size_i" and looking up
the relevant code, I found:

size_i = (int)size;
if((hsize_t)size_i != size)
HGOTO_ERROR...

So my guess is that the offsets at some point become large enough to
cause an int32 overflow. (Each file is about 10GB total, so the
overflow probably occurs around the 8GB mark since 2 billion elements
times 4 bytes per float = 8GB.) Is this a known bug in the MPI-IO VFD?
This suggests that the bug will also affect independent mode, but
another work around is for us to use the MPI-POSIX VFD, which should
bypass this problem.

   There is a limitation in the MPI standard which specifies that an &#39;int&#39; type must be used for certain file operations, but we may be able to relax that for the MPI\-POSIX driver\.  Could you give me the line number for the code snippet above?  I&#39;ll take a look and see if it really needs to be there\.

   Thanks,
           Quincey

I looked into using the CORE VFD per Mark Miller's suggestion in an earlier
thread, but the problem is that the 10GB of data will not fit into memory,
and I didn't see any API calls for requesting a "dump to file" before the
file close.

Thanks
Mark

----

HDF5-DIAG: Error detected in HDF5 (1.8.4) MPI-process 16:
#000: H5Dio.c line 266 in H5Dwrite(): can't write data
major: Dataset
minor: Write failed
#001: H5Dio.c line 578 in H5D_write(): can't write data
major: Dataset
minor: Write failed
#002: H5Dmpio.c line 552 in H5D_contig_collective_write(): couldn't
finish shared collective MPI-IO
major: Low-level I/O
minor: Write failed
#003: H5Dmpio.c line 1586 in H5D_inter_collective_io(): couldn't
finish collective MPI-IO
major: Low-level I/O
minor: Can't get value
#004: H5Dmpio.c line 1632 in H5D_final_collective_io(): optimized write failed
major: Dataset
minor: Write failed
#005: H5Dmpio.c line 334 in H5D_mpio_select_write(): can't finish
collective parallel write
major: Low-level I/O
minor: Write failed
#006: H5Fio.c line 167 in H5F_block_write(): file write failed
major: Low-level I/O
minor: Write failed
#007: H5FDint.c line 185 in H5FD_write(): driver write request failed
major: Virtual File Layer
minor: Write failed
#008: H5FDmpio.c line 1726 in H5FD_mpio_write(): can't convert from
size to size_i
major: Internal error (too specific to document in detail)
minor: Out of range

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Yup... 2 billion elements will overflow the "count" parameter to
MPI_File_write_at_all. So you could either partition this dataset
across more processor, or maybe create a "100 ints" type (?) and write
100 million of those?

==rob

···

On Fri, May 21, 2010 at 11:57:36AM -0700, Mark Howison wrote:

Hi Quincey, it is from:

>> #008: H5FDmpio.c line 1726 in H5FD_mpio_write(): can't convert from
>> size to size_i

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

Hi Mark,

Hi Quincey, it is from:

#008: H5FDmpio.c line 1726 in H5FD_mpio_write(): can't convert from
size to size_i

  Ah, sorry! It is actually in the MPI-IO VFD (and was in your original message! :slight_smile: Hmm, that 'size' parameter is actually a length variable, not an offset variable. Are you sure all the datasets are small?

  Quincey

···

On May 21, 2010, at 1:57 PM, Mark Howison wrote:

Mark

On Thu, May 20, 2010 at 4:15 AM, Quincey Koziol <koziol@hdfgroup.org> wrote:

Hi Mark,
       Sorry for the delay in replying...

On May 18, 2010, at 11:31 AM, Mark Howison wrote:

Hi all,

Chris Calderon, a user at NERSC, is receiving the errors at the bottom
of the email during the following scenario:

- a subset of 40 MPI tasks are each opening their own HDF5 file with
MPI-IO in collective mode with the MPI_COMM_SELF communicator
- each task writes about 20,000 small datasets totaling 10GB per file

It's worth noting that we don't intend to use MPI-IO in independent mode, so
we don't really need to fix this error to make the code operational,
but we'd like to understand why the error occurred. At the lowest
level, the error is "can't convert from size to size_i" and looking up
the relevant code, I found:

size_i = (int)size;
  if((hsize_t)size_i != size)
      HGOTO_ERROR...

So my guess is that the offsets at some point become large enough to
cause an int32 overflow. (Each file is about 10GB total, so the
overflow probably occurs around the 8GB mark since 2 billion elements
times 4 bytes per float = 8GB.) Is this a known bug in the MPI-IO VFD?
This suggests that the bug will also affect independent mode, but
another work around is for us to use the MPI-POSIX VFD, which should
bypass this problem.

       There is a limitation in the MPI standard which specifies that an 'int' type must be used for certain file operations, but we may be able to relax that for the MPI-POSIX driver. Could you give me the line number for the code snippet above? I'll take a look and see if it really needs to be there.

       Thanks,
               Quincey

I looked into using the CORE VFD per Mark Miller's suggestion in an earlier
thread, but the problem is that the 10GB of data will not fit into memory,
and I didn't see any API calls for requesting a "dump to file" before the
file close.

Thanks
Mark

----

HDF5-DIAG: Error detected in HDF5 (1.8.4) MPI-process 16:
#000: H5Dio.c line 266 in H5Dwrite(): can't write data
major: Dataset
minor: Write failed
#001: H5Dio.c line 578 in H5D_write(): can't write data
major: Dataset
minor: Write failed
#002: H5Dmpio.c line 552 in H5D_contig_collective_write(): couldn't
finish shared collective MPI-IO
major: Low-level I/O
minor: Write failed
#003: H5Dmpio.c line 1586 in H5D_inter_collective_io(): couldn't
finish collective MPI-IO
major: Low-level I/O
minor: Can't get value
#004: H5Dmpio.c line 1632 in H5D_final_collective_io(): optimized write failed
major: Dataset
minor: Write failed
#005: H5Dmpio.c line 334 in H5D_mpio_select_write(): can't finish
collective parallel write
major: Low-level I/O
minor: Write failed
#006: H5Fio.c line 167 in H5F_block_write(): file write failed
major: Low-level I/O
minor: Write failed
#007: H5FDint.c line 185 in H5FD_write(): driver write request failed
major: Virtual File Layer
minor: Write failed
#008: H5FDmpio.c line 1726 in H5FD_mpio_write(): can't convert from
size to size_i
major: Internal error (too specific to document in detail)
minor: Out of range

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Hi Quincey,

I have the source for the application from Chris and will do some more
debugging to find out what is really going on...

Mark

···

On Fri, May 21, 2010 at 12:25 PM, Quincey Koziol <koziol@hdfgroup.org> wrote:

Hi Mark,

On May 21, 2010, at 1:57 PM, Mark Howison wrote:

Hi Quincey, it is from:

#008: H5FDmpio.c line 1726 in H5FD_mpio_write(): can't convert from
size to size_i

   Ah, sorry\!  It is actually in the MPI\-IO VFD \(and was in your original message\! :\-\) Hmm, that &#39;size&#39; parameter is actually a length variable, not an offset variable\.  Are you sure all the datasets are small?

   Quincey

Mark

On Thu, May 20, 2010 at 4:15 AM, Quincey Koziol <koziol@hdfgroup.org> wrote:

Hi Mark,
Sorry for the delay in replying...

On May 18, 2010, at 11:31 AM, Mark Howison wrote:

Hi all,

Chris Calderon, a user at NERSC, is receiving the errors at the bottom
of the email during the following scenario:

- a subset of 40 MPI tasks are each opening their own HDF5 file with
MPI-IO in collective mode with the MPI_COMM_SELF communicator
- each task writes about 20,000 small datasets totaling 10GB per file

It's worth noting that we don't intend to use MPI-IO in independent mode, so
we don't really need to fix this error to make the code operational,
but we'd like to understand why the error occurred. At the lowest
level, the error is "can't convert from size to size_i" and looking up
the relevant code, I found:

size_i = (int)size;
if((hsize_t)size_i != size)
HGOTO_ERROR...

So my guess is that the offsets at some point become large enough to
cause an int32 overflow. (Each file is about 10GB total, so the
overflow probably occurs around the 8GB mark since 2 billion elements
times 4 bytes per float = 8GB.) Is this a known bug in the MPI-IO VFD?
This suggests that the bug will also affect independent mode, but
another work around is for us to use the MPI-POSIX VFD, which should
bypass this problem.

   There is a limitation in the MPI standard which specifies that an &#39;int&#39; type must be used for certain file operations, but we may be able to relax that for the MPI\-POSIX driver\.  Could you give me the line number for the code snippet above?  I&#39;ll take a look and see if it really needs to be there\.

   Thanks,
           Quincey

I looked into using the CORE VFD per Mark Miller's suggestion in an earlier
thread, but the problem is that the 10GB of data will not fit into memory,
and I didn't see any API calls for requesting a "dump to file" before the
file close.

Thanks
Mark

----

HDF5-DIAG: Error detected in HDF5 (1.8.4) MPI-process 16:
#000: H5Dio.c line 266 in H5Dwrite(): can't write data
major: Dataset
minor: Write failed
#001: H5Dio.c line 578 in H5D_write(): can't write data
major: Dataset
minor: Write failed
#002: H5Dmpio.c line 552 in H5D_contig_collective_write(): couldn't
finish shared collective MPI-IO
major: Low-level I/O
minor: Write failed
#003: H5Dmpio.c line 1586 in H5D_inter_collective_io(): couldn't
finish collective MPI-IO
major: Low-level I/O
minor: Can't get value
#004: H5Dmpio.c line 1632 in H5D_final_collective_io(): optimized write failed
major: Dataset
minor: Write failed
#005: H5Dmpio.c line 334 in H5D_mpio_select_write(): can't finish
collective parallel write
major: Low-level I/O
minor: Write failed
#006: H5Fio.c line 167 in H5F_block_write(): file write failed
major: Low-level I/O
minor: Write failed
#007: H5FDint.c line 185 in H5FD_write(): driver write request failed
major: Virtual File Layer
minor: Write failed
#008: H5FDmpio.c line 1726 in H5FD_mpio_write(): can't convert from
size to size_i
major: Internal error (too specific to document in detail)
minor: Out of range

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org