possible MPI and/or file system bug

Hi All,

   One of the bugs that has been holding up the HDF5 1.8.6 release appears
to be an MPI and/or file system bug. We believe we have re-produced it on
NCSA's Abe with an MPI program (tmpi.c).

   Two requests for help:

   First, we would appreciate it if those of you who are conversant with
MPI would take a look at tmpi.c (see below), and let us know if you see
any problems with it in terms of correctness. We think it is correct,
but MPI can be slippery so extra eyes would be useful.

   Second, we would like to know just how wide spread an issue we are dealing
with. We know it is a problem on NCSA's Abe, and it may be a problem on TACC's
Ranger as well. If you are able run tmpi.c on other machines and report
positive or negative results, that would give us a better idea of the scope of
the problem.

   An outline of tmpi.c follows, along with a description of how the failure
can be exposed on Abe. For similar systems, testing with a similar protocol
should be sufficient. In other cases, some experimentation may be required.
For example, we didn't see the issue on Abe until we ran with more than one
process per node. In you reports, please let us know what flavors of MPI
and file system (GPFS, LUSTRE, etc) the target machine uses. If you succeed
in exposing the failure, please let us know exactly how you did it.

   Finally, the code for tmpi.c appears later in this email, followed by
sample output from Abe and Ranger.

   Do let us know if you can help on either front.

                                               Many thanks,

                                               John Mainzer

======================== description of tmpi.c =========================

   Briefly, tmpi.c consist of a loop in which all processes
proceed as follows. Note that the particulars of synchronization are
controlled by the SET_ATOMICITY and REOPEN #defines. The program fails
in the same way regardless of whether SET_ATOMICITY and/or REOPEN are
TRUE.

   1) Barrier

   2) Open the test file

            If SET_ATOMICITY is TRUE, call MPI_File_set_atomicity()

   3) Participate in a collective write of an integer vector
            to the file. Each process writes 10 integers starting
            at index (mpi_rank * 10). Each integer is set equal to
            its index in the vector.

   4) if REOPEN is TRUE

                close test file/barrier/open test file

                if SET_ATOMICITY is TRUE, call MPI_File_set_atomicity()

            else if neither SET_ATOMICITY nor REOPEN is TRUE

                Sync/Barrier/Sync

   5) Participate in a collective read of the integer vector
      from file. Each process reads the entire vector.

   6) Verify that the vector contains the expected data. If it
      does not, each process issues an error message. In addition,
      process 0 dumps the contents of the vector to stdout, and
      also prints the contents of the vector as an ASCII string
      starting at the point at which the data differs from the
      expected values.

   7) if REOPEN is TRUE

                close test file/barrier/open test file

                if SET_ATOMICITY is TRUE, call MPI_File_set_atomicity()

            else if neither SET_ATOMICITY nor REOPEN is TRUE

                Sync/Barrier/Sync

   8) Construct an array of 80 characters containing the string:

    "Independent write x/y."

      followed by null characters to the end of the array. In
            each string, x is replace by the number of the pass through
            the loop, and y is replaced by the MPI rank of the process.

   9) Perform an independent write of the above array to location
            (mpi_rank * 80) in the file.

  10) Close the file.
     
The above loop is repeated 100 times.

   In the test code below, you will note that the construction of the derived
type used in the collective write is somewhat convoluted. This is done to
duplicate the behavior of HDF5 under the circumstances in which this issue
was first detected.

===================== reproducing the failure on Abe =====================

   On Abe, the program only fails if there are more processes than nodes -- my
tests were on the head nodes of Abe. The failure appears regularly on runs
with six processes distributed between the four head nodes.

   To compile and run on the head nodes of Abe, first start mpd's on all four
head nodes. I did this with:

  mpdboot -n 4 -f ~/mpd.hosts

where mpd.hosts contains:

  honest1
  honest2
  honest3
  honest4

To compile and run:

  mpicc tmpi.c
  mpiexec -n 6 ./a.out

   On Abe, the apparent bug is corruption observed in the vector when it
is read from file and compared with the expected values in steps 5 and 6
above. As you can see from the sample output, this corruption is occasional.
In cases where the corruption contains an identifiable string (for exampled,
see iteration 15 in the sample output), it appears to be data from the
independent writes of the previous iteration.

========================== failure on Ranger ==============================

   While I don't have access to Ranger, a co-worker reports that an earlier
version of tmpi.c fails there as well -- albeit with a crash. As I didn't
run the test, I can't speak to the particulars. However, I have appended the
output reported to me on the off chance that that it will be useful.

============================ test program tmpi.c ============================
#include <mpi.h>
#include <stdlib.h>
#include <stdio.h>
#include <assert.h>
#include <string.h>

#define BLOCK 10
#define NITER 100
#define IND_WRITE_BUF_SIZE 80

/* set to 1 to use MPI_set_file_atomicity */
#define SET_ATOMICITY 0

/* set to 1 to close and reopen the file after writes and reads */
#define REOPEN 0

void construct_file_mpi_datatype(int mpi_rank,
                                 int mpi_size,
                                 int block,
          MPI_Datatype * file_type_ptr)
{
    int block_length[3];
    MPI_Datatype inner_type; /* Inner MPI Datatype */
    MPI_Datatype outer_type; /* Inner MPI Datatype */
    MPI_Datatype filetype; /* MPI File datatype */
    MPI_Datatype old_types[3];
    MPI_Aint extent_len;
    MPI_Aint displacement[3];

    /* Create base contiguous type */
    MPI_Type_contiguous(sizeof(int), MPI_BYTE, &inner_type);

    MPI_Type_vector(1, block, 1, inner_type, &outer_type);
    MPI_Type_free(&inner_type);

    MPI_Type_extent(outer_type, &extent_len);

    inner_type = outer_type;

    block_length[0] = 1;
    block_length[1] = 1;
    block_length[2] = 1;

    old_types[0] = MPI_LB;
    old_types[1] = outer_type;
    old_types[2] = MPI_UB;

    displacement[0] = 0;
    displacement[1] = mpi_rank * block * sizeof(int);
    displacement[2] = (mpi_size) * block * sizeof(int);

    MPI_Type_struct(3, block_length, displacement, old_types, &inner_type);

    MPI_Type_free(&outer_type);

    filetype = inner_type;

    MPI_Type_commit(&filetype);

    *file_type_ptr = filetype;

    return;

} /* construct_file_mpi_datatype() */

void do_independant_write(MPI_File fh,
                          int mpi_rank,
                          int mpi_size,
                          int generation,
                          MPI_Offset base_offset)
{
    char write_buf[IND_WRITE_BUF_SIZE];
    int i;
    int success = 1;

    for ( i = 0; i < IND_WRITE_BUF_SIZE; i++ ) {

        write_buf[i] = '\0';
    }

    sprintf(write_buf, "Independent write %d/%d.", generation, mpi_rank);

    assert(strlen(write_buf) < IND_WRITE_BUF_SIZE);

    MPI_File_set_view(fh, 0, MPI_BYTE, MPI_BYTE, "native", MPI_INFO_NULL);

    MPI_File_write_at(fh,
                      base_offset + (mpi_rank * IND_WRITE_BUF_SIZE),
                      write_buf,
                      IND_WRITE_BUF_SIZE,
                      MPI_BYTE,
                      MPI_STATUS_IGNORE);

    return;

} /* do_independant_write() */

int main(int argc, char *argv[])
{
    int *wbuf = NULL; /* Write buffer */
    int *rbuf = NULL; /* Read buffer */
    int mpi_rank; /* MPI Rank */
    int mpi_size; /* MPI Size */
    int block = BLOCK;
    MPI_File fh; /* File */
    MPI_Datatype filetype; /* MPI File datatype */
    int failed = 0;
    int failure_point;
    int i, j, k;

    /* Setup */
    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &mpi_rank);
    MPI_Comm_size(MPI_COMM_WORLD, &mpi_size);

    /* Loop NITER times */
    for(i=0; i<NITER; i++) {

        if ( mpi_rank == 0 ) {

            fprintf(stdout, "Itteration %d: block size == %d.\n", i, block);
        }

        /* construct the file mpi derived type */
        construct_file_mpi_datatype(mpi_rank, mpi_size, block, &filetype);

        /* Allocate buffers */
        /* All processes read the entire file */
        rbuf = (int *)malloc((mpi_size) * block * sizeof(int));

        wbuf = (int *)malloc(block * sizeof(int));

        /* Fill buffer: final file will be simply a series of increasing
         * integers: 0, 1, 2, 3... */
        for(j=0; j<block; j++)
            wbuf[j] = j + (mpi_rank * block);

        /* Barrier */
        MPI_Barrier(MPI_COMM_WORLD);

        /* Open file collectively */
        MPI_File_open(MPI_COMM_WORLD, "tmpi.dat", MPI_MODE_RDWR
                > MPI_MODE_CREATE, MPI_INFO_NULL, &fh);

#if SET_ATOMICITY
        MPI_File_set_atomicity(fh, 1);
#endif

        /* Set the file view */
        MPI_File_set_view(fh, 0, MPI_BYTE, filetype, "native", MPI_INFO_NULL);

        /* Write the data */
        MPI_File_write_at_all(fh, 0, wbuf, (mpi_rank == 0 ? 2 : 1) * block
                * sizeof(int), MPI_BYTE, MPI_STATUS_IGNORE);

#if REOPEN
        MPI_File_close(&fh);
        MPI_Barrier(MPI_COMM_WORLD);
        MPI_File_open(MPI_COMM_WORLD, "tmpi.dat", MPI_MODE_RDWR,
                MPI_INFO_NULL, &fh);
#if SET_ATOMICITY
        MPI_File_set_atomicity(fh, 1);
#endif
#else
#if ( !( REOPEN || SET_ATOMICITY ) )
        /* Sync/Barrier/Sync */
        MPI_File_sync(fh);
        MPI_Barrier(MPI_COMM_WORLD);
        MPI_File_sync(fh);
#endif
#endif

        MPI_File_set_view(fh, 0, MPI_BYTE, MPI_BYTE, "native", MPI_INFO_NULL);

        /* Read the data */
        MPI_File_read_at_all(fh, 0, rbuf, (mpi_size) * block * sizeof(int),
                MPI_BYTE, MPI_STATUS_IGNORE);

        /* Verify the read data */
        failed = 0;
        for(j = 0; !failed && j < (mpi_size) * block; j++)
            if(rbuf[j] != j) {
                failed = 1;
                failure_point = j;
                printf("Rank %d detected error on iteration %d at location %d!\n",
                        mpi_rank, i, j);
            }

  if ( ( mpi_rank == 0 ) && ( failed ) ) {

            k = 0;
            fprintf(stdout, "\n");
            for ( j = 0; j < (mpi_size) * block; j++ ) {

                fprintf(stdout, " %d", rbuf[j]);
                k++;
                if ( k >= 10 ) {

                    k = 0;
                    fprintf(stdout, "\n");
                }
            }
            fprintf(stdout, "\n");

            fprintf(stdout,
               "String representation of receive buffer starting at rbuf[%d]: \"%s\"\n\n",
               failure_point, (char *)(&(rbuf[failure_point])));
        }

#if REOPEN
        MPI_File_close(&fh);
        MPI_Barrier(MPI_COMM_WORLD);
        MPI_File_open(MPI_COMM_WORLD, "tmpi.dat", MPI_MODE_RDWR,
                MPI_INFO_NULL, &fh);
#if SET_ATOMICITY
        MPI_File_set_atomicity(fh, 1);
#endif
#else
#if ( ! ( REOPEN || SET_ATOMICITY ) )
        /* Sync/Barrier/Sync */
        MPI_File_sync(fh);
        MPI_Barrier(MPI_COMM_WORLD);
        MPI_File_sync(fh);
#endif
#endif

        do_independant_write(fh, mpi_rank, mpi_size, i, 0);

        MPI_Type_free(&filetype);

        MPI_File_close(&fh);

        free(wbuf);
        free(rbuf);
    }

    MPI_Finalize();

    return 0;
}
============================= sample output from Abe ============================
[mainzer@honest1 testpar]$ mpiexec -n 6 ./a.out
Itteration 0: block size == 10.
Itteration 1: block size == 10.
Itteration 2: block size == 10.
Itteration 3: block size == 10.
Itteration 4: block size == 10.
Itteration 5: block size == 10.
Itteration 6: block size == 10.
Itteration 7: block size == 10.
Itteration 8: block size == 10.
Itteration 9: block size == 10.
Itteration 10: block size == 10.
Itteration 11: block size == 10.
Itteration 12: block size == 10.
Itteration 13: block size == 10.
Itteration 14: block size == 10.
Itteration 15: block size == 10.
Rank 0 detected error on iteration 15 at location 40!

0 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37 38 39
1701080649 1684956528 544501349 1953067639 875634789 3027503 0 0 0 0
50 51 52 53 54 55 56 57 58 59

String representation of receive buffer starting at rbuf[40]: "Independent write 14/2."

Rank 2 detected error on iteration 15 at location 40!
Rank 5 detected error on iteration 15 at location 40!
Rank 3 detected error on iteration 15 at location 40!
Rank 4 detected error on iteration 15 at location 40!
Rank 1 detected error on iteration 15 at location 40!
Itteration 16: block size == 10.
Rank 0 detected error on iteration 16 at location 53!

0 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37 38 39
40 41 42 43 44 45 46 47 48 49
50 51 52 0 0 0 0 0 0 0

String representation of receive buffer starting at rbuf[53]: ""

Rank 2 detected error on iteration 16 at location 53!
Rank 5 detected error on iteration 16 at location 53!
Rank 3 detected error on iteration 16 at location 53!
Rank 4 detected error on iteration 16 at location 53!
Rank 1 detected error on iteration 16 at location 53!
Itteration 17: block size == 10.
Itteration 18: block size == 10.
Itteration 19: block size == 10.
Itteration 20: block size == 10.
Itteration 21: block size == 10.
Itteration 22: block size == 10.
Itteration 23: block size == 10.
Itteration 24: block size == 10.
Itteration 25: block size == 10.
Itteration 26: block size == 10.
Itteration 27: block size == 10.
Itteration 28: block size == 10.
Itteration 29: block size == 10.
Itteration 30: block size == 10.
Rank 0 detected error on iteration 30 at location 40!

0 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37 38 39
1701080649 1684956528 544501349 1953067639 959586405 3027503 0 0 0 0
0 0 0 53 54 55 56 57 58 59

String representation of receive buffer starting at rbuf[40]: "Independent write 29/2."

Rank 2 detected error on iteration 30 at location 40!
Rank 5 detected error on iteration 30 at location 40!
Rank 3 detected error on iteration 30 at location 40!
Rank 4 detected error on iteration 30 at location 40!
Rank 1 detected error on iteration 30 at location 40!
Itteration 31: block size == 10.
Itteration 32: block size == 10.
Rank 0 detected error on iteration 32 at location 40!

0 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37 38 39
1701080649 1684956528 544501349 1953067639 825434213 3027503 0 0 0 0
0 0 0 53 54 55 56 57 58 59

String representation of receive buffer starting at rbuf[40]: "Independent write 31/2."

Rank 2 detected error on iteration 32 at location 40!
Rank 5 detected error on iteration 32 at location 40!
Rank 3 detected error on iteration 32 at location 40!
Rank 4 detected error on iteration 32 at location 40!
Rank 1 detected error on iteration 32 at location 40!
Itteration 33: block size == 10.
Itteration 34: block size == 10.
Itteration 35: block size == 10.
Rank 0 detected error on iteration 35 at location 50!

0 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37 38 39
40 41 42 43 44 45 46 47 48 49
0 0 0 0 0 0 0 0 0 0

String representation of receive buffer starting at rbuf[50]: ""

Rank 2 detected error on iteration 35 at location 50!
Rank 5 detected error on iteration 35 at location 50!
Rank 3 detected error on iteration 35 at location 50!
Rank 4 detected error on iteration 35 at location 50!
Rank 1 detected error on iteration 35 at location 50!
Itteration 36: block size == 10.
Itteration 37: block size == 10.
Rank 0 detected error on iteration 37 at location 50!

0 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37 38 39
40 41 42 43 44 45 46 47 48 49
0 0 0 53 54 55 56 57 58 59

String representation of receive buffer starting at rbuf[50]: ""

Rank 2 detected error on iteration 37 at location 50!
Rank 5 detected error on iteration 37 at location 50!
Rank 3 detected error on iteration 37 at location 50!
Rank 4 detected error on iteration 37 at location 50!
Rank 1 detected error on iteration 37 at location 50!
Itteration 38: block size == 10.
Itteration 39: block size == 10.
Itteration 40: block size == 10.
Rank 0 detected error on iteration 40 at location 53!

0 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37 38 39
40 41 42 43 44 45 46 47 48 49
50 51 52 0 0 0 0 0 0 0

String representation of receive buffer starting at rbuf[53]: ""

Rank 2 detected error on iteration 40 at location 53!
Rank 5 detected error on iteration 40 at location 53!
Rank 3 detected error on iteration 40 at location 53!
Rank 4 detected error on iteration 40 at location 53!
Rank 1 detected error on iteration 40 at location 53!
Itteration 41: block size == 10.
Itteration 42: block size == 10.
Itteration 43: block size == 10.
Itteration 44: block size == 10.
Rank 0 detected error on iteration 44 at location 53!

0 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37 38 39
40 41 42 43 44 45 46 47 48 49
50 51 52 0 0 0 0 0 0 0

String representation of receive buffer starting at rbuf[53]: ""

Rank 2 detected error on iteration 44 at location 53!
Rank 3 detected error on iteration 44 at location 53!
Rank 5 detected error on iteration 44 at location 53!
Rank 4 detected error on iteration 44 at location 53!
Rank 1 detected error on iteration 44 at location 53!
Itteration 45: block size == 10.
Itteration 46: block size == 10.
Itteration 47: block size == 10.
Itteration 48: block size == 10.
Itteration 49: block size == 10.
Itteration 50: block size == 10.
Itteration 51: block size == 10.
Rank 0 detected error on iteration 51 at location 50!

0 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37 38 39
40 41 42 43 44 45 46 47 48 49
0 0 0 0 0 0 0 0 0 0

String representation of receive buffer starting at rbuf[50]: ""

Rank 2 detected error on iteration 51 at location 50!
Rank 5 detected error on iteration 51 at location 50!
Rank 3 detected error on iteration 51 at location 50!
Rank 4 detected error on iteration 51 at location 50!
Rank 1 detected error on iteration 51 at location 50!
Itteration 52: block size == 10.
Itteration 53: block size == 10.
Itteration 54: block size == 10.
Rank 0 detected error on iteration 54 at location 53!

0 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37 38 39
40 41 42 43 44 45 46 47 48 49
50 51 52 0 0 0 0 0 0 0

String representation of receive buffer starting at rbuf[53]: ""

Rank 2 detected error on iteration 54 at location 53!
Rank 5 detected error on iteration 54 at location 53!
Rank 3 detected error on iteration 54 at location 53!
Rank 4 detected error on iteration 54 at location 53!
Rank 1 detected error on iteration 54 at location 53!
Itteration 55: block size == 10.
Rank 0 detected error on iteration 55 at location 40!

0 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37 38 39
1701080649 1684956528 544501349 1953067639 875896933 3027503 0 0 0 0
0 0 0 53 54 55 56 57 58 59

String representation of receive buffer starting at rbuf[40]: "Independent write 54/2."

Rank 2 detected error on iteration 55 at location 40!
Rank 3 detected error on iteration 55 at location 40!
Rank 5 detected error on iteration 55 at location 40!
Rank 4 detected error on iteration 55 at location 40!
Rank 1 detected error on iteration 55 at location 40!
Itteration 56: block size == 10.
Rank 0 detected error on iteration 56 at location 40!

0 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37 38 39
1701080649 1684956528 544501349 1953067639 892674149 3027503 0 0 0 0
0 0 0 0 0 0 0 0 0 0

String representation of receive buffer starting at rbuf[40]: "Independent write 55/2."

Rank 2 detected error on iteration 56 at location 40!
Rank 1 detected error on iteration 56 at location 40!
Rank 5 detected error on iteration 56 at location 40!
Rank 3 detected error on iteration 56 at location 40!
Rank 4 detected error on iteration 56 at location 40!
Itteration 57: block size == 10.
Itteration 58: block size == 10.
Itteration 59: block size == 10.
Itteration 60: block size == 10.
Itteration 61: block size == 10.
Rank 0 detected error on iteration 61 at location 50!

0 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37 38 39
40 41 42 43 44 45 46 47 48 49
0 0 0 0 0 0 0 0 0 0

String representation of receive buffer starting at rbuf[50]: ""

Rank 2 detected error on iteration 61 at location 50!
Rank 5 detected error on iteration 61 at location 50!
Rank 3 detected error on iteration 61 at location 50!
Rank 4 detected error on iteration 61 at location 50!
Rank 1 detected error on iteration 61 at location 50!
Itteration 62: block size == 10.
Rank 0 detected error on iteration 62 at location 40!

0 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37 38 39
1701080649 1684956528 544501349 1953067639 825630821 3027503 0 0 0 0
0 0 0 53 54 55 56 57 58 59

String representation of receive buffer starting at rbuf[40]: "Independent write 61/2."

Rank 2 detected error on iteration 62 at location 40!
Rank 5 detected error on iteration 62 at location 40!
Rank 3 detected error on iteration 62 at location 40!
Rank 4 detected error on iteration 62 at location 40!
Rank 1 detected error on iteration 62 at location 40!
Itteration 63: block size == 10.
Itteration 64: block size == 10.
Rank 0 detected error on iteration 64 at location 40!

0 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37 38 39
1701080649 1684956528 544501349 1953067639 859185253 3027503 0 0 0 0
0 0 0 53 54 55 56 57 58 59

String representation of receive buffer starting at rbuf[40]: "Independent write 63/2."

Rank 2 detected error on iteration 64 at location 40!
Rank 5 detected error on iteration 64 at location 40!
Rank 3 detected error on iteration 64 at location 40!
Rank 4 detected error on iteration 64 at location 40!
Rank 1 detected error on iteration 64 at location 40!
Itteration 65: block size == 10.
Itteration 66: block size == 10.
Itteration 67: block size == 10.
Itteration 68: block size == 10.
Itteration 69: block size == 10.
Itteration 70: block size == 10.
Itteration 71: block size == 10.
Itteration 72: block size == 10.
Rank 0 detected error on iteration 72 at location 50!

0 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37 38 39
40 41 42 43 44 45 46 47 48 49
0 0 0 0 0 0 0 0 0 0

String representation of receive buffer starting at rbuf[50]: ""

Rank 2 detected error on iteration 72 at location 50!
Rank 5 detected error on iteration 72 at location 50!
Rank 3 detected error on iteration 72 at location 50!
Rank 4 detected error on iteration 72 at location 50!
Rank 1 detected error on iteration 72 at location 50!
Itteration 73: block size == 10.
Itteration 74: block size == 10.
Rank 0 detected error on iteration 74 at location 50!

0 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37 38 39
40 41 42 43 44 45 46 47 48 49
0 0 0 0 0 0 0 0 0 0

String representation of receive buffer starting at rbuf[50]: ""

Rank 2 detected error on iteration 74 at location 50!
Rank 5 detected error on iteration 74 at location 50!
Rank 3 detected error on iteration 74 at location 50!
Rank 4 detected error on iteration 74 at location 50!
Rank 1 detected error on iteration 74 at location 50!
Itteration 75: block size == 10.
Rank 0 detected error on iteration 75 at location 50!

0 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37 38 39
40 41 42 43 44 45 46 47 48 49
0 0 0 0 0 0 0 0 0 0

String representation of receive buffer starting at rbuf[50]: ""

Rank 2 detected error on iteration 75 at location 50!
Rank 5 detected error on iteration 75 at location 50!
Rank 3 detected error on iteration 75 at location 50!
Rank 4 detected error on iteration 75 at location 50!
Rank 1 detected error on iteration 75 at location 50!
Itteration 76: block size == 10.
Itteration 77: block size == 10.
Rank 0 detected error on iteration 77 at location 40!

0 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37 38 39
1701080649 1684956528 544501349 1953067639 909582437 3027503 0 0 0 0
0 0 0 0 0 0 0 0 0 0

String representation of receive buffer starting at rbuf[40]: "Independent write 76/2."

Rank 2 detected error on iteration 77 at location 40!
Rank 5 detected error on iteration 77 at location 40!
Rank 3 detected error on iteration 77 at location 40!
Rank 4 detected error on iteration 77 at location 40!
Rank 1 detected error on iteration 77 at location 40!
Itteration 78: block size == 10.
Itteration 79: block size == 10.
Rank 0 detected error on iteration 79 at location 40!

0 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37 38 39
1701080649 1684956528 544501349 1953067639 943136869 3027503 0 0 0 0
0 0 0 0 0 0 0 0 0 0

String representation of receive buffer starting at rbuf[40]: "Independent write 78/2."

Rank 2 detected error on iteration 79 at location 40!
Rank 5 detected error on iteration 79 at location 40!
Rank 3 detected error on iteration 79 at location 40!
Rank 4 detected error on iteration 79 at location 40!
Rank 1 detected error on iteration 79 at location 40!
Itteration 80: block size == 10.
Itteration 81: block size == 10.
Itteration 82: block size == 10.
Itteration 83: block size == 10.
Itteration 84: block size == 10.
Rank 0 detected error on iteration 84 at location 40!

Rank 2 detected error on iteration 84 at location 40!
Rank 3 detected error on iteration 84 at location 40!
Rank 5 detected error on iteration 84 at location 40!
0 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37 38 39
1701080649 1684956528 544501349 1953067639 859316325 3027503 0 0 0 0
0 0 0 0 0 0 0 0 0 0

String representation of receive buffer starting at rbuf[40]: "Independent write 83/2."

Rank 4 detected error on iteration 84 at location 40!
Rank 1 detected error on iteration 84 at location 40!
Itteration 85: block size == 10.
Itteration 86: block size == 10.
Rank 0 detected error on iteration 86 at location 40!

0 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37 38 39
1701080649 1684956528 544501349 1953067639 892870757 3027503 0 0 0 0
0 0 0 53 54 55 56 57 58 59

String representation of receive buffer starting at rbuf[40]: "Independent write 85/2."

Rank 2 detected error on iteration 86 at location 40!
Rank 3 detected error on iteration 86 at location 40!
Rank 5 detected error on iteration 86 at location 40!
Rank 4 detected error on iteration 86 at location 40!
Rank 1 detected error on iteration 86 at location 40!
Itteration 87: block size == 10.
Rank 0 detected error on iteration 87 at location 53!

0 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37 38 39
40 41 42 43 44 45 46 47 48 49
50 51 52 0 0 0 0 0 0 0

String representation of receive buffer starting at rbuf[53]: ""

Rank 2 detected error on iteration 87 at location 53!
Rank 5 detected error on iteration 87 at location 53!
Rank 3 detected error on iteration 87 at location 53!
Rank 4 detected error on iteration 87 at location 53!
Rank 1 detected error on iteration 87 at location 53!
Itteration 88: block size == 10.
Rank 0 detected error on iteration 88 at location 40!

0 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37 38 39
1701080649 1684956528 544501349 1953067639 926425189 3027503 0 0 0 0
0 0 0 53 54 55 56 57 58 59

String representation of receive buffer starting at rbuf[40]: "Independent write 87/2."

Rank 2 detected error on iteration 88 at location 40!
Rank 5 detected error on iteration 88 at location 40!
Rank 3 detected error on iteration 88 at location 40!
Rank 4 detected error on iteration 88 at location 40!
Rank 1 detected error on iteration 88 at location 40!
Itteration 89: block size == 10.
Rank 0 detected error on iteration 89 at location 40!

0 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37 38 39
1701080649 1684956528 544501349 1953067639 943202405 3027503 0 0 0 0
0 0 0 53 54 55 56 57 58 59

String representation of receive buffer starting at rbuf[40]: "Independent write 88/2."

Rank 2 detected error on iteration 89 at location 40!
Rank 5 detected error on iteration 89 at location 40!
Rank 3 detected error on iteration 89 at location 40!
Rank 4 detected error on iteration 89 at location 40!
Rank 1 detected error on iteration 89 at location 40!
Itteration 90: block size == 10.
Itteration 91: block size == 10.
Rank 0 detected error on iteration 91 at location 50!

0 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37 38 39
40 41 42 43 44 45 46 47 48 49
0 0 0 53 54 55 56 57 58 59

String representation of receive buffer starting at rbuf[50]: ""

Rank 2 detected error on iteration 91 at location 50!
Rank 5 detected error on iteration 91 at location 50!
Rank 3 detected error on iteration 91 at location 50!
Rank 4 detected error on iteration 91 at location 50!
Rank 1 detected error on iteration 91 at location 50!
Itteration 92: block size == 10.
Itteration 93: block size == 10.
Itteration 94: block size == 10.
Itteration 95: block size == 10.
Itteration 96: block size == 10.
Rank 0 detected error on iteration 96 at location 40!

0 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37 38 39
1701080649 1684956528 544501349 1953067639 892936293 3027503 0 0 0 0
0 0 0 0 0 0 0 0 0 0

String representation of receive buffer starting at rbuf[40]: "Independent write 95/2."

Rank 2 detected error on iteration 96 at location 40!
Rank 5 detected error on iteration 96 at location 40!
Rank 3 detected error on iteration 96 at location 40!
Rank 4 detected error on iteration 96 at location 40!
Rank 1 detected error on iteration 96 at location 40!
Itteration 97: block size == 10.
Itteration 98: block size == 10.
Itteration 99: block size == 10.
[mainzer@honest1 testpar]$
============================= sample output from Ranger ============================
=== Note that this is from a run of an earlier, somewhat different version of ===
=== the test code provided provided above. ===

···

====================================================================================
+ date
Wed Dec 29 14:11:04 CST 2010
+ ibrun ./john
TACC: Starting up job 1746647
TACC: Setting up parallel environment for MVAPICH ssh-based mpirun.
TACC: Setup complete. Running job script.
TACC: starting parallel tasks...
Itteration 0: block size == 10.
Itteration 1: block size == 10.
Itteration 2: block size == 10.
Itteration 3: block size == 10.
Itteration 4: block size == 10.
Itteration 5: block size == 10.
Itteration 6: block size == 10.
Itteration 7: block size == 10.
MPI process terminated unexpectedly
Exit code -5 signaled from i115-104.ranger.tacc.utexas.edu
cleanupKilling remote processes...MPI process terminated unexpectedly
DONE
TACC: MPI job exited with code: 1
TACC: Shutting down parallel environment.
TACC: Shutdown complete. Exiting.
+ date
Wed Dec 29 14:26:42 CST 2010
TACC: Cleaning up after job: 1746647
TACC: Done.

Hello again,

   Appended please find a corrected version of my post of yesterday.

   Unfortunately, when I was simplifying and tidying tmpi.c, I inserted
a buffer overrun error that masked the fact that one of my simplifications
caused the apparent bug to disappear.

   The revised version of my post repairs this error.

   The major change in the test program is that process 0 now writes
two blocks of integers to the vector -- one at the beginning, and one
at the end. The major code changes appear in the function
construct_file_mpi_datatype(). I also changed the name of the block
variable to block_len for readability.

   The failures exposed remain essentially the same.

   As the apparent failure on Ranger was obtained with the erroneous
test code, I have removed all reference to Ranger from my post.

   Please excuse the confusion.

                                           Best regards,

                                           John Mainzer

···

=========================================================================

Hi All,

   One of the bugs that has been holding up the HDF5 1.8.6 release appears
to be an MPI and/or file system bug. We believe we have re-produced it on
NCSA's Abe with an MPI program (tmpi.c).

   Two requests for help:

   First, we would appreciate it if those of you who are conversant with
MPI would take a look at tmpi.c (see below), and let us know if you see
any problems with it in terms of correctness. We think it is correct,
but MPI can be slippery so extra eyes would be useful.

   Second, we would like to know just how wide spread an issue we are dealing
with. We know it is a problem on NCSA's Abe, and it may be a problem elsewhere.
If you are able run tmpi.c on other machines and report positive or negative
results, that would give us a better idea of the scope of the problem.

   An outline of tmpi.c follows, along with a description of how the failure
can be exposed on Abe. For similar systems, testing with a similar protocol
should be sufficient. In other cases, some experimentation may be required.
For example, we didn't see the issue on Abe until we ran with more than one
process per node. In you reports, please let us know what flavors of MPI
and file system (GPFS, LUSTRE, etc) the target machine uses. If you succeed
in exposing the failure, please let us know exactly how you did it.

   Finally, the code for tmpi.c appears later in this email, followed by
sample output from Abe.

   Do let us know if you can help on either front.

                                               Many thanks,

                                               John Mainzer

======================== description of tmpi.c =========================

   Briefly, tmpi.c consist of a loop in which all processes
proceed as follows. Note that the particulars of synchronization are
controlled by the SET_ATOMICITY and REOPEN #defines. The program fails
in the same way regardless of whether SET_ATOMICITY and/or REOPEN are
TRUE.

   1) Barrier

   2) Open the test file

            If SET_ATOMICITY is TRUE, call MPI_File_set_atomicity()

         3) Participate in a collective write of an integer vector
            to the file. Each process writes 10 integers starting
            at index (mpi_rank * 10). In addition, process 0 writes
            another ten integers at index (mpi_size * 10) -- also as
            part of the collective write. Each integer is set equal
            to its index in the vector.

   4) if REOPEN is TRUE

                close test file/barrier/open test file

                if SET_ATOMICITY is TRUE, call MPI_File_set_atomicity()

            else if neither SET_ATOMICITY nor REOPEN is TRUE

                Sync/Barrier/Sync

   5) Participate in a collective read of the integer vector
      from file. Each process reads the entire vector.

   6) Verify that the vector contains the expected data. If it
      does not, each process issues an error message. In addition,
      process 0 dumps the contents of the vector to stdout, and
      also prints the contents of the vector as an ASCII string
      starting at the point at which the data differs from the
      expected values.

   7) if REOPEN is TRUE

                close test file/barrier/open test file

                if SET_ATOMICITY is TRUE, call MPI_File_set_atomicity()

            else if neither SET_ATOMICITY nor REOPEN is TRUE

                Sync/Barrier/Sync

   8) Construct an array of 80 characters containing the string:

    "Independent write x/y."

      followed by null characters to the end of the array. In
            each string, x is replace by the number of the pass through
            the loop, and y is replaced by the MPI rank of the process.

   9) Perform an independent write of the above array to location
            (mpi_rank * 80) in the file.

  10) Close the file.
     
The above loop is repeated 100 times.

   In the test code below, you will note that the construction of the derived
type used in the collective write is somewhat convoluted. This is done to
duplicate the behavior of HDF5 under the circumstances in which this issue
was first detected.

===================== reproducing the failure on Abe =====================

   On Abe, the program only fails if there are more processes than nodes -- my
tests were on the head nodes of Abe. The failure appears regularly on runs
with six processes distributed between the four head nodes.

   To compile and run on the head nodes of Abe, first start mpd's on all four
head nodes. I did this with:

  mpdboot -n 4 -f ~/mpd.hosts

where mpd.hosts contains:

  honest1
  honest2
  honest3
  honest4

To compile and run:

  mpicc tmpi.c
  mpiexec -n 6 ./a.out

   On Abe, the apparent bug is corruption observed in the vector when it
is read from file and compared with the expected values in steps 5 and 6
above. As you can see from the sample output, this corruption is occasional.
In cases where the corruption contains an identifiable string (for exampled,
see iteration 17 in the sample output), it appears to be data from the
independent writes of the previous iteration.

============================ test program tmpi.c ============================
#include <mpi.h>
#include <stdlib.h>
#include <stdio.h>
#include <assert.h>
#include <string.h>

#define BLOCK 10
#define NITER 100
#define IND_WRITE_BUF_SIZE 80

/* set to 1 to use MPI_set_file_atomicity */
#define SET_ATOMICITY 0

/* set to 1 to close and reopen the file after the write */
#define REOPEN 0

void construct_file_mpi_datatype(int mpi_rank,
                                 int mpi_size,
                                 int block_len,
          MPI_Datatype * file_type_ptr)
{
    int block_length[3];
    MPI_Datatype inner_type; /* Inner MPI Datatype */
    MPI_Datatype outer_type; /* Inner MPI Datatype */
    MPI_Datatype filetype; /* MPI File datatype */
    MPI_Datatype old_types[3];
    MPI_Aint displacement[3];

    /* Create base contiguous type */
    MPI_Type_contiguous(sizeof(int), MPI_BYTE, &inner_type);

    if ( mpi_rank == 0 ) {
        /* Rank 0 operates on 2 blocks, other processes only operate on 1 */

        /* Select the first and last blocks for mpi_rank 0 */
        MPI_Type_vector(2, block_len, block_len * mpi_size, inner_type, &outer_type);
        MPI_Type_free(&inner_type);

        inner_type = outer_type;

        filetype = inner_type;

        MPI_Type_commit(&filetype);

    } else {

        /* Select the block corresponding to the mpi_rank */
        MPI_Type_vector(1, block_len, 1, inner_type, &outer_type);
        MPI_Type_free(&inner_type);

        inner_type = outer_type;

        block_length[0] = 1;
        block_length[1] = 1;
        block_length[2] = 1;

        old_types[0] = MPI_LB;
        old_types[1] = outer_type;
        old_types[2] = MPI_UB;

        displacement[0] = 0;
        displacement[1] = mpi_rank * block_len * sizeof(int);
        displacement[2] = (mpi_size + 1) * block_len * sizeof(int);

        MPI_Type_struct(3, block_length, displacement, old_types, &inner_type);

        MPI_Type_free(&outer_type);

        filetype = inner_type;

        MPI_Type_commit(&filetype);
    }

    *file_type_ptr = filetype;

    return;

} /* construct_file_mpi_datatype() */

void do_independant_write(MPI_File fh,
                          int mpi_rank,
                          int mpi_size,
                          int generation,
                          MPI_Offset base_offset)
{
    char write_buf[IND_WRITE_BUF_SIZE];
    int i;
    int success = 1;

    for ( i = 0; i < IND_WRITE_BUF_SIZE; i++ ) {

        write_buf[i] = '\0';
    }

    sprintf(write_buf, "Independent write %d/%d.", generation, mpi_rank);

    assert(strlen(write_buf) < IND_WRITE_BUF_SIZE);

    MPI_File_set_view(fh, 0, MPI_BYTE, MPI_BYTE, "native", MPI_INFO_NULL);

    MPI_File_write_at(fh,
                      base_offset + (mpi_rank * IND_WRITE_BUF_SIZE),
                      write_buf,
                      IND_WRITE_BUF_SIZE,
                      MPI_BYTE,
                      MPI_STATUS_IGNORE);

    return;

} /* do_independant_write() */

int main(int argc, char *argv[])
{
    int *wbuf = NULL; /* Write buffer */
    int *rbuf = NULL; /* Read buffer */
    int mpi_rank; /* MPI Rank */
    int mpi_size; /* MPI Size */
    int block_len = BLOCK;
    MPI_File fh; /* File */
    MPI_Datatype filetype; /* MPI File datatype */
    int failed = 0;
    int failure_point;
    int i, j, k;

    /* Setup */
    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &mpi_rank);
    MPI_Comm_size(MPI_COMM_WORLD, &mpi_size);

    /* Loop NITER times */
    for(i=0; i<NITER; i++) {

        if ( mpi_rank == 0 ) {

            fprintf(stdout, "Itteration %d: block size == %d.\n", i, block_len);
        }

        /* construct the file mpi derived type */
        construct_file_mpi_datatype(mpi_rank, mpi_size, block_len, &filetype);

        /* Allocate buffers */
        /* All processes read the entire file */
        rbuf = (int *)malloc((mpi_size + 1) * block_len * sizeof(int));

        if(mpi_rank == 0) {
            /* Rank 0 operates on 2 blocks, other processes only operate on 1 */
            wbuf = (int *)malloc(2 * block_len * sizeof(int));

            for(j=0; j<block_len; j++) {
                wbuf[j] = j;
                wbuf[j + block_len] = j + (mpi_size * block_len);
            }
        } else {
            wbuf = (int *)malloc(block_len * sizeof(int));

            /* Fill buffer: final file will be simply a series of increasing
             * integers: 0, 1, 2, 3... */
            for(j=0; j<block_len; j++)
                wbuf[j] = j + (mpi_rank * block_len);
        }

        /* Barrier */
        MPI_Barrier(MPI_COMM_WORLD);

        /* Open file collectively */
        MPI_File_open(MPI_COMM_WORLD, "tmpi.dat", MPI_MODE_RDWR
                > MPI_MODE_CREATE, MPI_INFO_NULL, &fh);

#if SET_ATOMICITY
        MPI_File_set_atomicity(fh, 1);
#endif

        /* Set the file view */
        MPI_File_set_view(fh, 0, MPI_BYTE, filetype, "native", MPI_INFO_NULL);

        /* Write the data */
        MPI_File_write_at_all(fh, 0, wbuf,
                              (mpi_rank == 0 ? 2 : 1) * block_len * sizeof(int),
                              MPI_BYTE, MPI_STATUS_IGNORE);

#if REOPEN
        MPI_File_close(&fh);
        MPI_Barrier(MPI_COMM_WORLD);
        MPI_File_open(MPI_COMM_WORLD, "tmpi.dat", MPI_MODE_RDWR,
                MPI_INFO_NULL, &fh);
#if SET_ATOMICITY
        MPI_File_set_atomicity(fh, 1);
#endif
#else
#if ( !( REOPEN || SET_ATOMICITY ) )
        /* Sync/Barrier/Sync */
        MPI_File_sync(fh);
        MPI_Barrier(MPI_COMM_WORLD);
        MPI_File_sync(fh);
#endif
#endif

        MPI_File_set_view(fh, 0, MPI_BYTE, MPI_BYTE, "native", MPI_INFO_NULL);

        /* Read the data */
        MPI_File_read_at_all(fh, 0, rbuf, (mpi_size + 1) * block_len * sizeof(int),
                MPI_BYTE, MPI_STATUS_IGNORE);

        /* Verify the read data */
        failed = 0;
        for(j = 0; !failed && j < (mpi_size + 1) * block_len; j++)
            if(rbuf[j] != j) {
                failed = 1;
                failure_point = j;
                printf("Rank %d detected error on iteration %d at location %d!\n",
                        mpi_rank, i, j);
            }

  if ( ( mpi_rank == 0 ) && ( failed ) ) {

            k = 0;
            fprintf(stdout, "\n");
            for ( j = 0; j < (mpi_size + 1) * block_len; j++ ) {

                fprintf(stdout, " %d", rbuf[j]);
                k++;
                if ( k >= 10 ) {

                    k = 0;
                    fprintf(stdout, "\n");
                }
            }
            fprintf(stdout, "\n");

            fprintf(stdout,
               "String representation of receive buffer starting at rbuf[%d]: \"%s\"\n\n",
               failure_point, (char *)(&(rbuf[failure_point])));
        }

#if REOPEN
        MPI_File_close(&fh);
        MPI_Barrier(MPI_COMM_WORLD);
        MPI_File_open(MPI_COMM_WORLD, "tmpi.dat", MPI_MODE_RDWR,
                MPI_INFO_NULL, &fh);
#if SET_ATOMICITY
        MPI_File_set_atomicity(fh, 1);
#endif
#else
#if ( ! ( REOPEN || SET_ATOMICITY ) )
        /* Sync/Barrier/Sync */
        MPI_File_sync(fh);
        MPI_Barrier(MPI_COMM_WORLD);
        MPI_File_sync(fh);
#endif
#endif

        do_independant_write(fh, mpi_rank, mpi_size, i, 0);

        MPI_Type_free(&filetype);

        MPI_File_close(&fh);

        free(wbuf);
        free(rbuf);
    }

    MPI_Finalize();

    return 0;
}
============================= sample output from Abe ============================
[mainzer@honest1 testpar]$ mpiexec -n 6 ./a.out
Itteration 0: block size == 10.
Itteration 1: block size == 10.
Itteration 2: block size == 10.
Itteration 3: block size == 10.
Itteration 4: block size == 10.
Itteration 5: block size == 10.
Itteration 6: block size == 10.
Itteration 7: block size == 10.
Itteration 8: block size == 10.
Itteration 9: block size == 10.
Itteration 10: block size == 10.
Itteration 11: block size == 10.
Itteration 12: block size == 10.
Itteration 13: block size == 10.
Rank 0 detected error on iteration 13 at location 53!

0 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37 38 39
40 41 42 43 44 45 46 47 48 49
50 51 52 0 0 0 0 0Rank 2 detected error on iteration 13 at location 53!
0 0
60 61 62 63 64 65 66 67 68 69

String representation of receive buffer starting at rbuf[53]: ""

Rank 5 detected error on iteration 13 at location 53!
Rank 3 detected error on iteration 13 at location 53!
Rank 4 detected error on iteration 13 at location 53!
Rank 1 detected error on iteration 13 at location 53!
Itteration 14: block size == 10.
Itteration 15: block size == 10.
Itteration 16: block size == 10.
Itteration 17: block size == 10.
Rank 0 detected error on iteration 17 at location 40!

0 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37 38 39
1701080649 1684956528 544501349 1953067639 909189221 3027503 0 0 0 0
50 51 52 53 54 55 56 57 58 59
60 61 62 63 64 65 66 67 68 69

String representation of receive buffer starting at rbuf[40]: "Independent write 16/2."

Rank 2 detected error on iteration 17 at location 40!
Rank 3 detected error on iteration 17 at location 40!
Rank 5 detected error on iteration 17 at location 40!
Rank 4 detected error on iteration 17 at location 40!
Rank 1 detected error on iteration 17 at location 40!
Itteration 18: block size == 10.
Rank 0 detected error on iteration 18 at location 50!

0 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37 38 39
40 41 42 43 44 45 46 47 48 49
0 0 0 0 0 0 0 0 0 0
Rank 2 detected error on iteration 18 at location 50!
60 61 62 63 64 65 66 67 68 69

String representation of receive buffer starting at rbuf[50]: ""

Rank 3 detected error on iteration 18 at location 50!
Rank 4 detected error on iteration 18 at location 50!
Rank 1 detected error on iteration 18 at location 50!
Rank 5 detected error on iteration 18 at location 50!
Itteration 19: block size == 10.
Itteration 20: block size == 10.
Itteration 21: block size == 10.
Rank 0 detected error on iteration 21 at location 53!

0 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37 38 39
40 41 42 43 44 45 46 47 48 49
50 51 52 0 0 0 0 0 0 0
60 61 62 63 64 65 66 67 68 69

String representation of receive buffer starting at rbuf[53]: ""

Rank 2 detected error on iteration 21 at location 53!
Rank 3 detected error on iteration 21 at location 53!
Rank 5 detected error on iteration 21 at location 53!
Rank 4 detected error on iteration 21 at location 53!
Rank 1 detected error on iteration 21 at location 53!
Itteration 22: block size == 10.
Rank 0 detected error on iteration 22 at location 50!

0 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37 38 39
40 41 42 43 44 45 46 47 48 49
0 0 0 0 0 0 0 0 0 0
60 61 62 63 64 65 66 67 68 69

String representation of receive buffer starting at rbuf[50]: ""

Rank 3 detected error on iteration 22 at location 50!
Rank 2 detected error on iteration 22 at location 50!
Rank 5 detected error on iteration 22 at location 50!
Rank 4 detected error on iteration 22 at location 50!
Rank 1 detected error on iteration 22 at location 50!
Itteration 23: block size == 10.
Itteration 24: block size == 10.
Itteration 25: block size == 10.
Itteration 26: block size == 10.
Rank 0 detected error on iteration 26 at location 50!

0 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37 38 39
40 41 42 43 44 45 46 47 48 49
0 0 0 53 54 55 56 57 58 59
60 61 62 63 64 65 66 67 68 69

String representation of receive buffer starting at rbuf[50]: ""

Rank 2 detected error on iteration 26 at location 50!
Rank 5 detected error on iteration 26 at location 50!
Rank 3 detected error on iteration 26 at location 50!
Rank 1 detected error on iteration 26 at location 50!
Rank 4 detected error on iteration 26 at location 50!
Itteration 27: block size == 10.
Rank 0 detected error on iteration 27 at location 40!

0 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37 38 39
1701080649 1684956528 544501349 1953067639 909254757 3027503 0 0 0 0
50 51 52 53 54 55 56 57 58 59
60 61 62 63 64 65 66 67 68 69
Rank 2 detected error on iteration 27 at location 40!

String representation of receive buffer starting at rbuf[40]: "Independent write 26/2."

Rank 3 detected error on iteration 27 at location 40!
Rank 5 detected error on iteration 27 at location 40!
Rank 1 detected error on iteration 27 at location 40!
Rank 4 detected error on iteration 27 at location 40!
Itteration 28: block size == 10.
Rank 0 detected error on iteration 28 at location 50!

0 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37 38 39
40 41 42 43 44 45 46 47 48 49
0 0 0 0 0 0 0 0 0 0
60 61 62 63 64 65 66 67 68 69

String representation of receive buffer starting at rbuf[50]: ""

Rank 2 detected error on iteration 28 at location 50!
Rank 5 detected error on iteration 28 at location 50!
Rank 3 detected error on iteration 28 at location 50!
Rank 4 detected error on iteration 28 at location 50!
Rank 1 detected error on iteration 28 at location 50!
Itteration 29: block size == 10.
Itteration 30: block size == 10.
Rank 0 detected error on iteration 30 at location 50!

0 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37 38 39
40 41 42 43 44 45 46 47 48 49
0 0 0 0 0 0 0 0 0 0
60 61 62 63 64 65 66Rank 2 detected error on iteration 30 at location 50!
67 68 69

String representation of receive buffer starting at rbuf[50]: ""

Rank 3 detected error on iteration 30 at location 50!
Rank 5 detected error on iteration 30 at location 50!
Rank 1 detected error on iteration 30 at location 50!
Rank 4 detected error on iteration 30 at location 50!
Itteration 31: block size == 10.
Itteration 32: block size == 10.
Itteration 33: block size == 10.
Itteration 34: block size == 10.
Itteration 35: block size == 10.
Itteration 36: block size == 10.
Rank 0 detected error on iteration 36 at location 40!

0 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37 38 39
1701080649 1684956528 544501349 1953067639 892543077 3027503 0 0 0 0
50 51 52 53 54 55 56 57 58 59
60 61 62 63 64 65 66 67 68 69

String representation of receive buffer starting at rbuf[40]: "Independent write 35/2."

Rank 2 detected error on iteration 36 at location 40!
Rank 3 detected error on iteration 36 at location 40!
Rank 5 detected error on iteration 36 at location 40!
Rank 4 detected error on iteration 36 at location 40!
Rank 1 detected error on iteration 36 at location 40!
Itteration 37: block size == 10.
Itteration 38: block size == 10.
Itteration 39: block size == 10.
Itteration 40: block size == 10.
Itteration 41: block size == 10.
Itteration 42: block size == 10.
Itteration 43: block size == 10.
Itteration 44: block size == 10.
Rank 0 detected error on iteration 44 at location 50!

0 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37 38 39
40 41 42 43 44 45 46 47 48 49
0 0 0 53 54 55 56 57 58 59
60 61 62 63 64 65 66 67 68 69

String representation of receive buffer starting at rbuf[50]: ""

Rank 2 detected error on iteration 44 at location 50!
Rank 3 detected error on iteration 44 at location 50!
Rank 5 detected error on iteration 44 at location 50!
Rank 4 detected error on iteration 44 at location 50!
Rank 1 detected error on iteration 44 at location 50!
Itteration 45: block size == 10.
Itteration 46: block size == 10.
Itteration 47: block size == 10.
Rank 0 detected error on iteration 47 at location 50!

0 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37 38 39
40 41 42 43 44 45 46 47 48 49
0 0 0 53 54 55 56 57 58 59
60 61 62 63 64 65 66 67 68 69

String representation of receive buffer starting at rbuf[50]: ""

Rank 3 detected error on iteration 47 at location 50!
Rank 4 detected error on iteration 47 at location 50!
Rank 2 detected error on iteration 47 at location 50!
Rank 5 detected error on iteration 47 at location 50!
Rank 1 detected error on iteration 47 at location 50!
Itteration 48: block size == 10.
Itteration 49: block size == 10.
Itteration 50: block size == 10.
Itteration 51: block size == 10.
Itteration 52: block size == 10.
Itteration 53: block size == 10.
Itteration 54: block size == 10.
Itteration 55: block size == 10.
Itteration 56: block size == 10.
Rank 0 detected error on iteration 56 at location 50!

0 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37 38 39
40 41 42 43 44 45 46 47 48 49
0 0 0 0 0 0 0 0 0 0
60 61 62 63 64 65 66 67 68 69

String representation of receive buffer starting at rbuf[50]: ""

Rank 2 detected error on iteration 56 at location 50!
Rank 3 detected error on iteration 56 at location 50!
Rank 5 detected error on iteration 56 at location 50!
Rank 4 detected error on iteration 56 at location 50!
Rank 1 detected error on iteration 56 at location 50!
Itteration 57: block size == 10.
Itteration 58: block size == 10.
Itteration 59: block size == 10.
Itteration 60: block size == 10.
Itteration 61: block size == 10.
Itteration 62: block size == 10.
Itteration 63: block size == 10.
Itteration 64: block size == 10.
Itteration 65: block size == 10.
Itteration 66: block size == 10.
Itteration 67: block size == 10.
Itteration 68: block size == 10.
Itteration 69: block size == 10.
Rank 0 detected error on iteration 69 at location 50!

0 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37 38 39
40 41 42 43 44 45 46 47 48 49
0 0 0 53 54 55 56 57 58 59
60 61 62 63 64 65 66 67 68 69

String representation of receive buffer starting at rbuf[50]: ""

Rank 2 detected error on iteration 69 at location 50!
Rank 5 detected error on iteration 69 at location 50!
Rank 3 detected error on iteration 69 at location 50!
Rank 4 detected error on iteration 69 at location 50!
Rank 1 detected error on iteration 69 at location 50!
Itteration 70: block size == 10.
Itteration 71: block size == 10.
Itteration 72: block size == 10.
Rank 0 detected error on iteration 72 at location 40!

0 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37 38 39
1701080649 1684956528 544501349 1953067639 825696357 3027503 0 0 0 0
50 51 52 53 54 55 56 57 58 59
60 61 62 63 64 65 66 67 68 69

String representation of receive buffer starting at rbuf[40]: "Independent write 71/2."

Rank 2 detected error on iteration 72 at location 40!
Rank 3 detected error on iteration 72 at location 40!
Rank 5 detected error on iteration 72 at location 40!
Rank 4 detected error on iteration 72 at location 40!
Rank 1 detected error on iteration 72 at location 40!
Itteration 73: block size == 10.
Itteration 74: block size == 10.
Itteration 75: block size == 10.
Itteration 76: block size == 10.
Itteration 77: block size == 10.
Rank 0 detected error on iteration 77 at location 40!

0 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15Rank 2 detected error on iteration 77 at location 40!
16 17 18 19
20 21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37 38 39
1701080649 1684956528 544501349 1953067639 909582437 3027503 0 0 0 0
50 51 52 53 54 55 56 57 58 59
60 61 62 63 64 65 66 67 68 69

String representation of receive buffer starting at rbuf[40]: "Independent write 76/2."

Rank 3 detected error on iteration 77 at location 40!
Rank 5 detected error on iteration 77 at location 40!
Rank 1 detected error on iteration 77 at location 40!
Rank 4 detected error on iteration 77 at location 40!
Itteration 78: block size == 10.
Itteration 79: block size == 10.
Itteration 80: block size == 10.
Itteration 81: block size == 10.
Itteration 82: block size == 10.
Itteration 83: block size == 10.
Itteration 84: block size == 10.
Itteration 85: block size == 10.
Rank 0 detected error on iteration 85 at location 50!

0 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37 38 39
40 41 42 43 44 45 46 47 48 49
0 0 0 0 0 0 0 0 0 0
60 61 62 63 64 65 66 67 68 69

String representation of receive buffer starting at rbuf[50]: ""

Rank 2 detected error on iteration 85 at location 50!
Rank 5 detected error on iteration 85 at location 50!
Rank 3 detected error on iteration 85 at location 50!
Rank 4 detected error on iteration 85 at location 50!
Rank 1 detected error on iteration 85 at location 50!
Itteration 86: block size == 10.
Itteration 87: block size == 10.
Rank 0 detected error on iteration 87 at location 40!

0 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37 38 39
1701080649 1684956528 544501349 1953067639 909647973 3027503 0 0 0 0
50 51 52 53 54 55 56 57 58 59
60Rank 2 detected error on iteration 87 at location 40!
61 62 63 64 65 66 67 68 69

String representation of receive buffer starting at rbuf[40]: "Independent write 86/2."

Rank 3 detected error on iteration 87 at location 40!
Rank 5 detected error on iteration 87 at location 40!
Rank 4 detected error on iteration 87 at location 40!
Rank 1 detected error on iteration 87 at location 40!
Itteration 88: block size == 10.
Itteration 89: block size == 10.
Itteration 90: block size == 10.
Itteration 91: block size == 10.
Itteration 92: block size == 10.
Itteration 93: block size == 10.
Itteration 94: block size == 10.
Itteration 95: block size == 10.
Itteration 96: block size == 10.
Itteration 97: block size == 10.
Itteration 98: block size == 10.
Itteration 99: block size == 10.
[mainzer@honest1 testpar]$