collective ops and Lustre

Hi folks,

we are running into some issues trying to get collective i/o
to work on our lustre filesystem.
most of the time only 1 of the MPI processes gets the data correctly.
sometimes we get:

r1i0n0:heistand% mpiexec -np 1 ./a.out -f lustre:test_this -c -v
Parallel test files are:
~ lustre:test_this/ParaEg0.h5
~ lustre:test_this/ParaEg1.h5
- --------------------------------
Proc 0: *** testing PHDF5 dataset using split communicators...
- --------------------------------
Independent write test on file lustre:test_this/ParaEg0.h5 lustre:test_this/ParaEg1.h5
**filenoexist test_this/ParaEg0.h5**filenoexist test_this/ParaEg0.h5HDF5-DIAG: Error detected in HDF5 (1.8.0) MPI-process 0:
~ #000: H5F.c line 1466 in H5Fcreate(): unable to create file
~ major: File accessability
~ minor: Unable to open file
~ #001: H5F.c line 1205 in H5F_open(): unable to open file
~ major: File accessability
~ minor: Unable to open file
~ #002: H5FD.c line 1086 in H5FD_open(): open failed
~ major: Virtual File Layer
~ minor: Unable to initialize object
~ #003: H5FDmpio.c line 998 in H5FD_mpio_open(): MPI_File_open failed
~ major: Internal error (too specific to document in detail)
~ minor: Some MPI function failed
~ #004: H5FDmpio.c line 998 in H5FD_mpio_open(): **filenoexist 0
~ major: Internal error (too specific to document in detail)
~ minor: MPI Error String
a.out: ph5example.c:914: test_split_comm_access: Assertion `fid != -1' failed.
mpispawn.c:303 Unexpected exit status

this is HDF 1.8.0, mvapich for MPT, intel for compiler if that matters/helps any

thanks

- --

···

************************************************************************
~ Steve Heistand NASA Ames Research Center
~ SciCon Group Mail Stop 258-6
~ steve.heistand@nasa.gov (650) 604-4369 Moffett Field, CA 94035-1000
************************************************************************
~ "Any opinions expressed are those of our alien overloads, not my own."

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Hi Steve,

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi folks,

we are running into some issues trying to get collective i/o
to work on our lustre filesystem.
most of the time only 1 of the MPI processes gets the data correctly.

  Hmm, when you run with more MPI processes, does this work correctly? I'm not too certain what's the issue here, but it might be a couple of things... I'll CC Albert and see if he's seen this before.

  Quincey

···

On Jun 2, 2008, at 4:42 PM, Steve Heistand wrote:

sometimes we get:

r1i0n0:heistand% mpiexec -np 1 ./a.out -f lustre:test_this -c -v
Parallel test files are:
~ lustre:test_this/ParaEg0.h5
~ lustre:test_this/ParaEg1.h5
- --------------------------------
Proc 0: *** testing PHDF5 dataset using split communicators...
- --------------------------------
Independent write test on file lustre:test_this/ParaEg0.h5 lustre:test_this/ParaEg1.h5
**filenoexist test_this/ParaEg0.h5**filenoexist test_this/ParaEg0.h5HDF5-DIAG: Error detected in HDF5 (1.8.0) MPI-process 0:
~ #000: H5F.c line 1466 in H5Fcreate(): unable to create file
~ major: File accessability
~ minor: Unable to open file
~ #001: H5F.c line 1205 in H5F_open(): unable to open file
~ major: File accessability
~ minor: Unable to open file
~ #002: H5FD.c line 1086 in H5FD_open(): open failed
~ major: Virtual File Layer
~ minor: Unable to initialize object
~ #003: H5FDmpio.c line 998 in H5FD_mpio_open(): MPI_File_open failed
~ major: Internal error (too specific to document in detail)
~ minor: Some MPI function failed
~ #004: H5FDmpio.c line 998 in H5FD_mpio_open(): **filenoexist 0
~ major: Internal error (too specific to document in detail)
~ minor: MPI Error String
a.out: ph5example.c:914: test_split_comm_access: Assertion `fid != -1' failed.
mpispawn.c:303 Unexpected exit status

this is HDF 1.8.0, mvapich for MPT, intel for compiler if that matters/helps any

thanks

- --
************************************************************************
~ Steve Heistand NASA Ames Research Center
~ SciCon Group Mail Stop 258-6
~ steve.heistand@nasa.gov (650) 604-4369 Moffett Field, CA 94035-1000
************************************************************************
~ "Any opinions expressed are those of our alien overloads, not my own."
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.6 (GNU/Linux)

iD8DBQFIRGk5oBCTJSAkVrERAnzwAJ4jsDP1sTIs5be85+UNkEGzwd0tnQCeNxGV
imCDGerAfh425j4557jO02g=
=tHKr
-----END PGP SIGNATURE-----

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Quincey Koziol wrote:

Hi Steve,

Hi folks,

we are running into some issues trying to get collective i/o
to work on our lustre filesystem.
most of the time only 1 of the MPI processes gets the data correctly.

    Hmm, when you run with more MPI processes, does this work
correctly? I'm not too certain what's the issue here, but it might be a
couple of things... I'll CC Albert and see if he's seen this before.

    Quincey

the actual code runs with many mpi procs. Only 1 process seems to
get right answers though. The test Im running here are just the example
code from the PHDF src tree. When I run just 1 MPI proc I get good answers:

r6i3n14:heistand% mpiexec -np 1 ./a.out
Parallel test files are:
~ ./ParaEg0.h5
~ ./ParaEg1.h5
- --------------------------------
Proc 0: *** testing PHDF5 dataset using split communicators...
- --------------------------------
Proc 0: *** testing PHDF5 dataset independent write...
- --------------------------------
Proc 0: *** testing PHDF5 dataset collective write...
- --------------------------------
Proc 0: *** testing PHDF5 dataset independent read...
- --------------------------------
Proc 0: *** testing PHDF5 dataset collective read...
- --------------------------------

···

On Jun 2, 2008, at 4:42 PM, Steve Heistand wrote:

===================================
PHDF5 tests finished with no errors

but with more I get mostly wrong but oddly close.
(expecting 1213 but get 13 seems like a bit width issue)

r6i3n14:heistand% mpiexec -np 2 ./a.out
- --------------------------------
Parallel test files are:
Proc 1: *** testing PHDF5 dataset using split communicators...
- --------------------------------
~ ./ParaEg0.h5
~ ./ParaEg1.h5
- --------------------------------
Proc 0: *** testing PHDF5 dataset using split communicators...
- --------------------------------
Proc 1: *** testing PHDF5 dataset independent write...
- --------------------------------
Proc 0: *** testing PHDF5 dataset independent write...
- --------------------------------
Proc 0: *** testing PHDF5 dataset collective write...
- --------------------------------
Proc 1: *** testing PHDF5 dataset collective write...
- --------------------------------
Proc 0: *** testing PHDF5 dataset independent read...
- --------------------------------
Proc 1: *** testing PHDF5 dataset independent read...
- --------------------------------
Proc 1: *** testing PHDF5 dataset collective read...
- --------------------------------
Proc 0: *** testing PHDF5 dataset collective read...
- --------------------------------
Dataset Verify failed at [0][0](row 0, col 12): expect 1213, got 13
Dataset Verify failed at [0][0](row 0, col 0): expect 1201, got 1
Dataset Verify failed at [0][1](row 0, col 13): expect 1214, got 14
Dataset Verify failed at [0][2](row 0, col 14): expect 1215, got 15
Dataset Verify failed at [0][3](row 0, col 15): expect 1216, got 16
Dataset Verify failed at [0][1](row 0, col 1): expect 1202, got 2
Dataset Verify failed at [0][4](row 0, col 16): expect 1217, got 17
Dataset Verify failed at [0][2](row 0, col 2): expect 1203, got 3
Dataset Verify failed at [0][5](row 0, col 17): expect 1218, got 18
Dataset Verify failed at [0][3](row 0, col 3): expect 1204, got 4
Dataset Verify failed at [0][6](row 0, col 18): expect 1219, got 19
Dataset Verify failed at [0][4](row 0, col 4): expect 1205, got 5
Dataset Verify failed at [0][7](row 0, col 19): expect 1220, got 20
Dataset Verify failed at [0][5](row 0, col 5): expect 1206, got 6
Dataset Verify failed at [0][8](row 0, col 20): expect 1221, got 21
Dataset Verify failed at [0][6](row 0, col 6): expect 1207, got 7
Dataset Verify failed at [0][9](row 0, col 21): expect 1222, got 22
Dataset Verify failed at [0][7](row 0, col 7): expect 1208, got 8
[more errors ...]
Dataset Verify failed at [0][8](row 0, col 8): expect 1209, got 9
288 errors found in dataset_vrfy
Dataset Verify failed at [0][9](row 0, col 9): expect 1210, got 10
[more errors ...]
288 errors found in dataset_vrfy
Dataset Verify failed at [0][0](row 0, col 0): expect 1201, got 1
Dataset Verify failed at [0][1](row 0, col 1): expect 1202, got 2
Dataset Verify failed at [0][2](row 0, col 2): expect 1203, got 3
Dataset Verify failed at [0][3](row 0, col 3): expect 1204, got 4
Dataset Verify failed at [0][4](row 0, col 4): expect 1205, got 5
Dataset Verify failed at [0][5](row 0, col 5): expect 1206, got 6
Dataset Verify failed at [0][6](row 0, col 6): expect 1207, got 7
Dataset Verify failed at [0][7](row 0, col 7): expect 1208, got 8
Dataset Verify failed at [0][8](row 0, col 8): expect 1209, got 9
Dataset Verify failed at [0][9](row 0, col 9): expect 1210, got 10
[more errors ...]
288 errors found in dataset_vrfy

sometimes we get:

r1i0n0:heistand% mpiexec -np 1 ./a.out -f lustre:test_this -c -v
Parallel test files are:
~ lustre:test_this/ParaEg0.h5
~ lustre:test_this/ParaEg1.h5
--------------------------------
Proc 0: *** testing PHDF5 dataset using split communicators...
--------------------------------
Independent write test on file lustre:test_this/ParaEg0.h5
lustre:test_this/ParaEg1.h5
**filenoexist test_this/ParaEg0.h5**filenoexist
test_this/ParaEg0.h5HDF5-DIAG: Error detected in HDF5 (1.8.0)
MPI-process 0:
~ #000: H5F.c line 1466 in H5Fcreate(): unable to create file
~ major: File accessability
~ minor: Unable to open file
~ #001: H5F.c line 1205 in H5F_open(): unable to open file
~ major: File accessability
~ minor: Unable to open file
~ #002: H5FD.c line 1086 in H5FD_open(): open failed
~ major: Virtual File Layer
~ minor: Unable to initialize object
~ #003: H5FDmpio.c line 998 in H5FD_mpio_open(): MPI_File_open failed
~ major: Internal error (too specific to document in detail)
~ minor: Some MPI function failed
~ #004: H5FDmpio.c line 998 in H5FD_mpio_open(): **filenoexist 0
~ major: Internal error (too specific to document in detail)
~ minor: MPI Error String
a.out: ph5example.c:914: test_split_comm_access: Assertion `fid != -1'
failed.
mpispawn.c:303 Unexpected exit status

this is HDF 1.8.0, mvapich for MPT, intel for compiler if that
matters/helps any

thanks

- ----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to
hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to
hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

- --
************************************************************************
~ Steve Heistand NASA Ames Research Center
~ SciCon Group Mail Stop 258-6
~ steve.heistand@nasa.gov (650) 604-4369 Moffett Field, CA 94035-1000
************************************************************************
~ "Any opinions expressed are those of our alien overloads, not my own."

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.