can not open/close a file more than 1010 times

Dear all,

I am using the HDF5-parallel API on a Cray XE6. This had been working fine until a few weeks. This is probably something related to this machine (since a maintenance?) so I sent my issue to the helpdesk of this Cray, but I submit my problem to you as well. I have this problem only with this machine, no problem with another machine (not Cray) used.

Problem:
With parallel version of HDF5, we can not open-and-close the same hdf5 file more than a fixed number of times (1010). Moreover, if we try to open-and-close, for example, two different hdf5 files, this maximum number is divided by two: is there a limit of the number of hdf5 files opening allowed?

Observations:
This problem can be reproduced, still the same given number of opening/closing allowed.
This problem does not depend on the number of mpi procs involved (test up to 512 cores, still 1010 opening before crash), or on the quantity written, or on the actions made with an opened file ( like create dataset, attributes or group).
We checked the status of each hdf5 operations (hdferror argument): no one complains before the crash. So every file seems to be correctly opened/closed. No different behavior if we force the file opening property with H5F_CLOSE_STRONG_F.
The hdf5 file created and used before the crash is still readable and all data written in it before are ok.

Setup:
This problem occurs with the hdf5-parallel version 1.5.8.0 and 1.8.8, both compiled with the intel compiler 12.0.3.174, only on a Cray XE6 (I could send the module config used).

A basic code reproduces our hdf5 calls structure. The attached code produced the described error. It is compiled with the command "h5pfc -FR main.F -o test".

Here is a part of the standard error, produced when the code reached the 1010th opening:

HDF5-DIAG: Error detected in HDF5 (1.8.5) MPI-process 0:
  #000: H5F.c line 1495 in H5Fopen(): unable to open file
    major: File accessability
    minor: Unable to open file
  #001: H5F.c line 1195 in H5F_open(): unable to open file
    major: File accessability
    minor: Unable to open file
  #002: H5FD.c line 1088 in H5FD_open(): open failed
    major: Virtual File Layer
    minor: Unable to initialize object
  #003: H5FDmpio.c line 999 in H5FD_mpio_open(): MPI_File_open failed
    major: Internal error (too specific to document in detail)
    minor: Some MPI function failed
  #004: H5FDmpio.c line 999 in H5FD_mpio_open(): Other I/O error , error stack:
ADIOI_UFS_OPEN(108): Other I/O error Too many open files
    major: Internal error (too specific to document in detail)
    minor: MPI Error String
HDF5-DIAG: Error detected in HDF5 (1.8.5) MPI-process 0:
  #000: H5F.c line 1943 in H5Fclose(): invalid file identifier
    major: Invalid arguments to routine
    minor: Inappropriate type
HDF5-DIAG: Error detected in HDF5 (1.8.5) MPI-process 1:
  #000: H5F.c line 1495 in H5Fopen(): unable to open file
    major: File accessability
    minor: Unable to open file
  #001: H5F.c line 1195 in H5F_open(): unable to open file
    major: File accessability
    minor: Unable to open file
  #002: H5FD.c line 1088 in H5FD_open(): open failed
    major: Virtual File Layer
    minor: Unable to initialize object
HDF5-DIAG: Error detected in HDF5 (1.8.5) HDF5-DIAG: Error detected in HDF5 (1.8.5) HDF5-DIAG: Error detected in HDF5 (1.8.5) #003: H5FDmpio.c line 999 in H5FD_mpio_open(): MPI_File_open failed
MPI-process 3MPI-process 6 major: Internal error (too specific to document in detail)
HDF5-DIAG: Error detected in HDF5 (1.8.5) :
HDF5-DIAG: Error detected in HDF5 (1.8.5) MPI-process 7HDF5-DIAG: Error detected in HDF5 (1.8.5) MPI-process 2:
:
  #000: H5F.c line 1495 in H5Fopen(): unable to open file
    minor: Some MPI function failed
  #000: H5F.c line 1495 in H5Fopen(): unable to open file
  #000: H5F.c line 1495 in H5Fopen(): unable to open file
    major: File accessability
    major: File accessability

I also attach the whole standard error.

I will appreciate any help! Do not hesitate to ask me some additional details if needed!

Thanks,

Best regards,

Stephane

main.F (1.99 KB)

test-hdf5.out.bz2 (151 KB)

Hi Stephane,

I could not replicate the problem you are seeing. I ran the sample program that you sent me on an XE6 and it worked fine..

I am not aware of any limit on the number of times you can open/close a file (at least from the HDF5 side of things). Not sure if the filesystem you are working on imposes such a limitation, but i don't see why it would..

Thanks,
Mohamad

···

On 02/05/2012 04:20 PM, St�phane Backaert wrote:

Dear all,

I am using the HDF5-parallel API on a Cray XE6. This had been working fine until a few weeks. This is probably something related to this machine (since a maintenance?) so I sent my issue to the helpdesk of this Cray, but I submit my problem to you as well. I have this problem only with this machine, no problem with another machine (not Cray) used.

Problem:
With parallel version of HDF5, we can not open-and-close the same hdf5 file more than a fixed number of times (1010). Moreover, if we try to open-and-close, for example, two different hdf5 files, this maximum number is divided by two:* is there a limit of the number of hdf5 files opening allowed?*

Observations:
This problem can be reproduced, still the same given number of opening/closing allowed.
This problem does not depend on the number of mpi procs involved (test up to 512 cores, still 1010 opening before crash), or on the quantity written, or on the actions made with an opened file ( like create dataset, attributes or group).
We checked the status of each hdf5 operations (hdferror argument): no one complains before the crash. So every file seems to be correctly opened/closed. No different behavior if we force the file opening property with H5F_CLOSE_STRONG_F.
The hdf5 file created and used before the crash is still readable and all data written in it before are ok.

Setup:
This problem occurs with the hdf5-parallel version 1.5.8.0 and 1.8.8, both compiled with the intel compiler 12.0.3.174, only on a Cray XE6 (I could send the module config used).

A basic code reproduces our hdf5 calls structure. The attached code produced the described error. It is compiled with the command "h5pfc -FR main.F -o test".

Here is a part of the standard error, produced when the code reached the 1010th opening:

HDF5-DIAG: Error detected in HDF5 (1.8.5) MPI-process 0:
  #000: H5F.c line 1495 in H5Fopen(): unable to open file
    major: File accessability
    minor: Unable to open file
  #001: H5F.c line 1195 in H5F_open(): unable to open file
    major: File accessability
    minor: Unable to open file
  #002: H5FD.c line 1088 in H5FD_open(): open failed
    major: Virtual File Layer
    minor: Unable to initialize object
  #003: H5FDmpio.c line 999 in H5FD_mpio_open(): MPI_File_open failed
    major: Internal error (too specific to document in detail)
    minor: Some MPI function failed
  #004: H5FDmpio.c line 999 in H5FD_mpio_open(): Other I/O error , error stack:
ADIOI_UFS_OPEN(108): Other I/O error Too many open files
    major: Internal error (too specific to document in detail)
    minor: MPI Error String
HDF5-DIAG: Error detected in HDF5 (1.8.5) MPI-process 0:
  #000: H5F.c line 1943 in H5Fclose(): invalid file identifier
    major: Invalid arguments to routine
    minor: Inappropriate type
HDF5-DIAG: Error detected in HDF5 (1.8.5) MPI-process 1:
  #000: H5F.c line 1495 in H5Fopen(): unable to open file
    major: File accessability
    minor: Unable to open file
  #001: H5F.c line 1195 in H5F_open(): unable to open file
    major: File accessability
    minor: Unable to open file
  #002: H5FD.c line 1088 in H5FD_open(): open failed
    major: Virtual File Layer
    minor: Unable to initialize object
HDF5-DIAG: Error detected in HDF5 (1.8.5) HDF5-DIAG: Error detected in HDF5 (1.8.5) HDF5-DIAG: Error detected in HDF5 (1.8.5) #003: H5FDmpio.c line 999 in H5FD_mpio_open(): MPI_File_open failed
MPI-process 3MPI-process 6 major: Internal error (too specific to document in detail)
HDF5-DIAG: Error detected in HDF5 (1.8.5) :
HDF5-DIAG: Error detected in HDF5 (1.8.5) MPI-process 7HDF5-DIAG: Error detected in HDF5 (1.8.5) MPI-process 2:
:
  #000: H5F.c line 1495 in H5Fopen(): unable to open file
    minor: Some MPI function failed
  #000: H5F.c line 1495 in H5Fopen(): unable to open file
    major: File accessability

I also attach the whole standard error.

I will appreciate any help! Do not hesitate to ask me some additional details if needed!

Thanks,

Best regards,

Stephane

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Sephane,

Make sure you point out that this only occurs when using a lustre filesystem.

JB

···

From: hdf-forum-bounces@hdfgroup.org [mailto:hdf-forum-bounces@hdfgroup.org] On Behalf Of Stéphane Backaert
Sent: 05 February 2012 23:21
To: hdf-forum@hdfgroup.org
Subject: [Hdf-forum] can not open/close a file more than 1010 times

Dear all,

I am using the HDF5-parallel API on a Cray XE6. This had been working fine until a few weeks. This is probably something related to this machine (since a maintenance?) so I sent my issue to the helpdesk of this Cray, but I submit my problem to you as well. I have this problem only with this machine, no problem with another machine (not Cray) used.

Problem:
With parallel version of HDF5, we can not open-and-close the same hdf5 file more than a fixed number of times (1010). Moreover, if we try to open-and-close, for example, two different hdf5 files, this maximum number is divided by two: is there a limit of the number of hdf5 files opening allowed?

Observations:
This problem can be reproduced, still the same given number of opening/closing allowed.
This problem does not depend on the number of mpi procs involved (test up to 512 cores, still 1010 opening before crash), or on the quantity written, or on the actions made with an opened file ( like create dataset, attributes or group).
We checked the status of each hdf5 operations (hdferror argument): no one complains before the crash. So every file seems to be correctly opened/closed. No different behavior if we force the file opening property with H5F_CLOSE_STRONG_F.
The hdf5 file created and used before the crash is still readable and all data written in it before are ok.

Setup:
This problem occurs with the hdf5-parallel version 1.5.8.0 and 1.8.8, both compiled with the intel compiler 12.0.3.174, only on a Cray XE6 (I could send the module config used).

A basic code reproduces our hdf5 calls structure. The attached code produced the described error. It is compiled with the command "h5pfc -FR main.F -o test".

Here is a part of the standard error, produced when the code reached the 1010th opening:

HDF5-DIAG: Error detected in HDF5 (1.8.5) MPI-process 0:
  #000: H5F.c line 1495 in H5Fopen(): unable to open file
    major: File accessability
    minor: Unable to open file
  #001: H5F.c line 1195 in H5F_open(): unable to open file
    major: File accessability
    minor: Unable to open file
  #002: H5FD.c line 1088 in H5FD_open(): open failed
    major: Virtual File Layer
    minor: Unable to initialize object
  #003: H5FDmpio.c line 999 in H5FD_mpio_open(): MPI_File_open failed
    major: Internal error (too specific to document in detail)
    minor: Some MPI function failed
  #004: H5FDmpio.c line 999 in H5FD_mpio_open(): Other I/O error , error stack:
ADIOI_UFS_OPEN(108): Other I/O error Too many open files
    major: Internal error (too specific to document in detail)
    minor: MPI Error String
HDF5-DIAG: Error detected in HDF5 (1.8.5) MPI-process 0:
  #000: H5F.c line 1943 in H5Fclose(): invalid file identifier
    major: Invalid arguments to routine
    minor: Inappropriate type
HDF5-DIAG: Error detected in HDF5 (1.8.5) MPI-process 1:
  #000: H5F.c line 1495 in H5Fopen(): unable to open file
    major: File accessability
    minor: Unable to open file
  #001: H5F.c line 1195 in H5F_open(): unable to open file
    major: File accessability
    minor: Unable to open file
  #002: H5FD.c line 1088 in H5FD_open(): open failed
    major: Virtual File Layer
    minor: Unable to initialize object
HDF5-DIAG: Error detected in HDF5 (1.8.5) HDF5-DIAG: Error detected in HDF5 (1.8.5) HDF5-DIAG: Error detected in HDF5 (1.8.5) #003: H5FDmpio.c line 999 in H5FD_mpio_open(): MPI_File_open failed
MPI-process 3MPI-process 6 major: Internal error (too specific to document in detail)
HDF5-DIAG: Error detected in HDF5 (1.8.5) :
HDF5-DIAG: Error detected in HDF5 (1.8.5) MPI-process 7HDF5-DIAG: Error detected in HDF5 (1.8.5) MPI-process 2:
:
  #000: H5F.c line 1495 in H5Fopen(): unable to open file
    minor: Some MPI function failed
  #000: H5F.c line 1495 in H5Fopen(): unable to open file
  #000: H5F.c line 1495 in H5Fopen(): unable to open file
    major: File accessability
    major: File accessability

I also attach the whole standard error.

I will appreciate any help! Do not hesitate to ask me some additional details if needed!

Thanks,

Best regards,

Stephane

Which XE6 is this? (which OS rev?)

-john

···

On Feb 7, 2012, at 1:39 AM, Biddiscombe, John A. wrote:

Sephane,

Make sure you point out that this only occurs when using a lustre filesystem.

JB

From: hdf-forum-bounces@hdfgroup.org [mailto:hdf-forum-bounces@hdfgroup.org] On Behalf Of Stéphane Backaert
Sent: 05 February 2012 23:21
To: hdf-forum@hdfgroup.org
Subject: [Hdf-forum] can not open/close a file more than 1010 times

Dear all,

I am using the HDF5-parallel API on a Cray XE6. This had been working fine until a few weeks. This is probably something related to this machine (since a maintenance?) so I sent my issue to the helpdesk of this Cray, but I submit my problem to you as well. I have this problem only with this machine, no problem with another machine (not Cray) used.

Problem:
With parallel version of HDF5, we can not open-and-close the same hdf5 file more than a fixed number of times (1010). Moreover, if we try to open-and-close, for example, two different hdf5 files, this maximum number is divided by two: is there a limit of the number of hdf5 files opening allowed?

Observations:
This problem can be reproduced, still the same given number of opening/closing allowed.
This problem does not depend on the number of mpi procs involved (test up to 512 cores, still 1010 opening before crash), or on the quantity written, or on the actions made with an opened file ( like create dataset, attributes or group).
We checked the status of each hdf5 operations (hdferror argument): no one complains before the crash. So every file seems to be correctly opened/closed. No different behavior if we force the file opening property with H5F_CLOSE_STRONG_F.
The hdf5 file created and used before the crash is still readable and all data written in it before are ok.

Setup:
This problem occurs with the hdf5-parallel version 1.5.8.0 and 1.8.8, both compiled with the intel compiler 12.0.3.174, only on a Cray XE6 (I could send the module config used).

A basic code reproduces our hdf5 calls structure. The attached code produced the described error. It is compiled with the command "h5pfc -FR main.F -o test".

Here is a part of the standard error, produced when the code reached the 1010th opening:

HDF5-DIAG: Error detected in HDF5 (1.8.5) MPI-process 0:
  #000: H5F.c line 1495 in H5Fopen(): unable to open file
    major: File accessability
    minor: Unable to open file
  #001: H5F.c line 1195 in H5F_open(): unable to open file
    major: File accessability
    minor: Unable to open file
  #002: H5FD.c line 1088 in H5FD_open(): open failed
    major: Virtual File Layer
    minor: Unable to initialize object
  #003: H5FDmpio.c line 999 in H5FD_mpio_open(): MPI_File_open failed
    major: Internal error (too specific to document in detail)
    minor: Some MPI function failed
  #004: H5FDmpio.c line 999 in H5FD_mpio_open(): Other I/O error , error stack:
ADIOI_UFS_OPEN(108): Other I/O error Too many open files
    major: Internal error (too specific to document in detail)
    minor: MPI Error String
HDF5-DIAG: Error detected in HDF5 (1.8.5) MPI-process 0:
  #000: H5F.c line 1943 in H5Fclose(): invalid file identifier
    major: Invalid arguments to routine
    minor: Inappropriate type
HDF5-DIAG: Error detected in HDF5 (1.8.5) MPI-process 1:
  #000: H5F.c line 1495 in H5Fopen(): unable to open file
    major: File accessability
    minor: Unable to open file
  #001: H5F.c line 1195 in H5F_open(): unable to open file
    major: File accessability
    minor: Unable to open file
  #002: H5FD.c line 1088 in H5FD_open(): open failed
    major: Virtual File Layer
    minor: Unable to initialize object
HDF5-DIAG: Error detected in HDF5 (1.8.5) HDF5-DIAG: Error detected in HDF5 (1.8.5) HDF5-DIAG: Error detected in HDF5 (1.8.5) #003: H5FDmpio.c line 999 in H5FD_mpio_open(): MPI_File_open failed
MPI-process 3MPI-process 6 major: Internal error (too specific to document in detail)
HDF5-DIAG: Error detected in HDF5 (1.8.5) :
HDF5-DIAG: Error detected in HDF5 (1.8.5) MPI-process 7HDF5-DIAG: Error detected in HDF5 (1.8.5) MPI-process 2:
:
  #000: H5F.c line 1495 in H5Fopen(): unable to open file
    minor: Some MPI function failed
  #000: H5F.c line 1495 in H5Fopen(): unable to open file
  #000: H5F.c line 1495 in H5Fopen(): unable to open file
    major: File accessability
    major: File accessability

I also attach the whole standard error.

I will appreciate any help! Do not hesitate to ask me some additional details if needed!

Thanks,

Best regards,

Stephane

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Stephane,

I just tried on an XK6 system using gnu compilers and I do get the same issue. I tried with 1.8.8 (compiled myself) and 1.8.7 (from cray). It seems that it only appears when I write to the lustre file system... and it does not seem to occur otherwise.

         1002
         1003
         1004
         1005
HDF5-DIAG: Error detected in HDF5 (1.8.8) MPI-process 0:
   #000: /project/csvis/soumagne/apps/src/rosa/hdf5-vfd-1.8.8/source/src/H5F.c line 1522 in H5Fopen(): unable to open file
     major: File accessability
     minor: Unable to open file
   #001: /project/csvis/soumagne/apps/src/rosa/hdf5-vfd-1.8.8/source/src/H5F.c line 1211 in H5F_open(): unable to open file: time = Tue Feb 7 10:42:17 2012
, name = 'test.h5', tent_flags = 1
     major: File accessability
     minor: Unable to open file

regards,

Jerome

···

On 02/07/2012 10:39 AM, Biddiscombe, John A. wrote:

Sephane,

Make sure you point out that this only occurs when using a lustre filesystem.

JB

*From:*hdf-forum-bounces@hdfgroup.org [mailto:hdf-forum-bounces@hdfgroup.org] *On Behalf Of *St�phane Backaert
*Sent:* 05 February 2012 23:21
*To:* hdf-forum@hdfgroup.org
*Subject:* [Hdf-forum] can not open/close a file more than 1010 times

Dear all,

I am using the HDF5-parallel API on a Cray XE6. This had been working fine until a few weeks. This is probably something related to this machine (since a maintenance?) so I sent my issue to the helpdesk of this Cray, but I submit my problem to you as well. I have this problem only with this machine, no problem with another machine (not Cray) used.

Problem:

With parallel version of HDF5, we can not open-and-close the same hdf5 file more than a fixed number of times (1010). Moreover, if we try to open-and-close, for example, two different hdf5 files, this maximum number is divided by two:* is there a limit of the number of hdf5 files opening allowed?*

Observations:

This problem can be reproduced, still the same given number of opening/closing allowed.

This problem does not depend on the number of mpi procs involved (test up to 512 cores, still 1010 opening before crash), or on the quantity written, or on the actions made with an opened file ( like create dataset, attributes or group).

We checked the status of each hdf5 operations (hdferror argument): no one complains before the crash. So every file seems to be correctly opened/closed. No different behavior if we force the file opening property with H5F_CLOSE_STRONG_F.

The hdf5 file created and used before the crash is still readable and all data written in it before are ok.

Setup:

This problem occurs with the hdf5-parallel version 1.5.8.0 and 1.8.8, both compiled with the intel compiler 12.0.3.174, only on a Cray XE6 (I could send the module config used).

A basic code reproduces our hdf5 calls structure. The attached code produced the described error. It is compiled with the command "h5pfc -FR main.F -o test".

Here is a part of the standard error, produced when the code reached the 1010th opening:

HDF5-DIAG: Error detected in HDF5 (1.8.5) MPI-process 0:

  #000: H5F.c line 1495 in H5Fopen(): unable to open file

    major: File accessability

    minor: Unable to open file

  #001: H5F.c line 1195 in H5F_open(): unable to open file

    major: File accessability

    minor: Unable to open file

  #002: H5FD.c line 1088 in H5FD_open(): open failed

    major: Virtual File Layer

    minor: Unable to initialize object

  #003: H5FDmpio.c line 999 in H5FD_mpio_open(): MPI_File_open failed

    major: Internal error (too specific to document in detail)

    minor: Some MPI function failed

  #004: H5FDmpio.c line 999 in H5FD_mpio_open(): Other I/O error , error stack:

ADIOI_UFS_OPEN(108): Other I/O error Too many open files

    major: Internal error (too specific to document in detail)

    minor: MPI Error String

HDF5-DIAG: Error detected in HDF5 (1.8.5) MPI-process 0:

  #000: H5F.c line 1943 in H5Fclose(): invalid file identifier

    major: Invalid arguments to routine

    minor: Inappropriate type

HDF5-DIAG: Error detected in HDF5 (1.8.5) MPI-process 1:

  #000: H5F.c line 1495 in H5Fopen(): unable to open file

    major: File accessability

    minor: Unable to open file

  #001: H5F.c line 1195 in H5F_open(): unable to open file

    major: File accessability

    minor: Unable to open file

  #002: H5FD.c line 1088 in H5FD_open(): open failed

    major: Virtual File Layer

    minor: Unable to initialize object

HDF5-DIAG: Error detected in HDF5 (1.8.5) HDF5-DIAG: Error detected in HDF5 (1.8.5) HDF5-DIAG: Error detected in HDF5 (1.8.5) #003: H5FDmpio.c line 999 in H5FD_mpio_open(): MPI_File_open failed

MPI-process 3MPI-process 6 major: Internal error (too specific to document in detail)

HDF5-DIAG: Error detected in HDF5 (1.8.5) :

HDF5-DIAG: Error detected in HDF5 (1.8.5) MPI-process 7HDF5-DIAG: Error detected in HDF5 (1.8.5) MPI-process 2:

:

  #000: H5F.c line 1495 in H5Fopen(): unable to open file

    minor: Some MPI function failed

  #000: H5F.c line 1495 in H5Fopen(): unable to open file

  #000: H5F.c line 1495 in H5Fopen(): unable to open file

    major: File accessability

    major: File accessability

I also attach the whole standard error.

I will appreciate any help! Do not hesitate to ask me some additional details if needed!

Thanks,

Best regards,

Stephane

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

I tried this on hopper on /scratch2 which is Lustre also, but it works fine for me..
I used the Cray provided HDF5 (1.8.7)

Since I could not replicate the problem, did you try opening/closing the file using plain MPI-I/O:
     MPI_File_open()
     MPI_File_close()

If you get the same problem there, I would go one step down further using plain posix I/O, just to try and get an idea at what layer you see that issue.

Thanks,
Mohamad

···

On 02/07/2012 05:22 AM, Jerome Soumagne wrote:

Stephane,

I just tried on an XK6 system using gnu compilers and I do get the same issue. I tried with 1.8.8 (compiled myself) and 1.8.7 (from cray). It seems that it only appears when I write to the lustre file system... and it does not seem to occur otherwise.

        1002
        1003
        1004
        1005
HDF5-DIAG: Error detected in HDF5 (1.8.8) MPI-process 0:
  #000: /project/csvis/soumagne/apps/src/rosa/hdf5-vfd-1.8.8/source/src/H5F.c line 1522 in H5Fopen(): unable to open file
    major: File accessability
    minor: Unable to open file
  #001: /project/csvis/soumagne/apps/src/rosa/hdf5-vfd-1.8.8/source/src/H5F.c line 1211 in H5F_open(): unable to open file: time = Tue Feb 7 10:42:17 2012
, name = 'test.h5', tent_flags = 1
    major: File accessability
    minor: Unable to open file

regards,

Jerome

On 02/07/2012 10:39 AM, Biddiscombe, John A. wrote:

Sephane,

Make sure you point out that this only occurs when using a lustre filesystem.

JB

*From:*hdf-forum-bounces@hdfgroup.org [mailto:hdf-forum-bounces@hdfgroup.org] *On Behalf Of *St�phane Backaert
*Sent:* 05 February 2012 23:21
*To:* hdf-forum@hdfgroup.org
*Subject:* [Hdf-forum] can not open/close a file more than 1010 times

Dear all,

I am using the HDF5-parallel API on a Cray XE6. This had been working fine until a few weeks. This is probably something related to this machine (since a maintenance?) so I sent my issue to the helpdesk of this Cray, but I submit my problem to you as well. I have this problem only with this machine, no problem with another machine (not Cray) used.

Problem:

With parallel version of HDF5, we can not open-and-close the same hdf5 file more than a fixed number of times (1010). Moreover, if we try to open-and-close, for example, two different hdf5 files, this maximum number is divided by two:* is there a limit of the number of hdf5 files opening allowed?*

Observations:

This problem can be reproduced, still the same given number of opening/closing allowed.

This problem does not depend on the number of mpi procs involved (test up to 512 cores, still 1010 opening before crash), or on the quantity written, or on the actions made with an opened file ( like create dataset, attributes or group).

We checked the status of each hdf5 operations (hdferror argument): no one complains before the crash. So every file seems to be correctly opened/closed. No different behavior if we force the file opening property with H5F_CLOSE_STRONG_F.

The hdf5 file created and used before the crash is still readable and all data written in it before are ok.

Setup:

This problem occurs with the hdf5-parallel version 1.5.8.0 and 1.8.8, both compiled with the intel compiler 12.0.3.174, only on a Cray XE6 (I could send the module config used).

A basic code reproduces our hdf5 calls structure. The attached code produced the described error. It is compiled with the command "h5pfc -FR main.F -o test".

Here is a part of the standard error, produced when the code reached the 1010th opening:

HDF5-DIAG: Error detected in HDF5 (1.8.5) MPI-process 0:

  #000: H5F.c line 1495 in H5Fopen(): unable to open file

    major: File accessability

    minor: Unable to open file

  #001: H5F.c line 1195 in H5F_open(): unable to open file

    major: File accessability

    minor: Unable to open file

  #002: H5FD.c line 1088 in H5FD_open(): open failed

    major: Virtual File Layer

    minor: Unable to initialize object

  #003: H5FDmpio.c line 999 in H5FD_mpio_open(): MPI_File_open failed

    major: Internal error (too specific to document in detail)

    minor: Some MPI function failed

  #004: H5FDmpio.c line 999 in H5FD_mpio_open(): Other I/O error , error stack:

ADIOI_UFS_OPEN(108): Other I/O error Too many open files

    major: Internal error (too specific to document in detail)

    minor: MPI Error String

HDF5-DIAG: Error detected in HDF5 (1.8.5) MPI-process 0:

  #000: H5F.c line 1943 in H5Fclose(): invalid file identifier

    major: Invalid arguments to routine

    minor: Inappropriate type

HDF5-DIAG: Error detected in HDF5 (1.8.5) MPI-process 1:

  #000: H5F.c line 1495 in H5Fopen(): unable to open file

    major: File accessability

    minor: Unable to open file

  #001: H5F.c line 1195 in H5F_open(): unable to open file

    major: File accessability

    minor: Unable to open file

  #002: H5FD.c line 1088 in H5FD_open(): open failed

    major: Virtual File Layer

    minor: Unable to initialize object

HDF5-DIAG: Error detected in HDF5 (1.8.5) HDF5-DIAG: Error detected in HDF5 (1.8.5) HDF5-DIAG: Error detected in HDF5 (1.8.5) #003: H5FDmpio.c line 999 in H5FD_mpio_open(): MPI_File_open failed

MPI-process 3MPI-process 6 major: Internal error (too specific to document in detail)

HDF5-DIAG: Error detected in HDF5 (1.8.5) :

HDF5-DIAG: Error detected in HDF5 (1.8.5) MPI-process 7HDF5-DIAG: Error detected in HDF5 (1.8.5) MPI-process 2:

:

  #000: H5F.c line 1495 in H5Fopen(): unable to open file

    minor: Some MPI function failed

  #000: H5F.c line 1495 in H5Fopen(): unable to open file

  #000: H5F.c line 1495 in H5Fopen(): unable to open file

    major: File accessability

    major: File accessability

I also attach the whole standard error.

I will appreciate any help! Do not hesitate to ask me some additional details if needed!

Thanks,

Best regards,

Stephane

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

I've just tried with an MPI_file_open/close and I get a "Failure to get stripe info: Bad file descriptor" message after about 1012 iterations, so that's definitely not a problem with HDF5.

Jerome

···

On 02/07/2012 03:38 PM, Mohamad Chaarawi wrote:

I tried this on hopper on /scratch2 which is Lustre also, but it works fine for me..
I used the Cray provided HDF5 (1.8.7)

Since I could not replicate the problem, did you try opening/closing the file using plain MPI-I/O:
    MPI_File_open()
    MPI_File_close()

If you get the same problem there, I would go one step down further using plain posix I/O, just to try and get an idea at what layer you see that issue.

Thanks,
Mohamad

On 02/07/2012 05:22 AM, Jerome Soumagne wrote:

Stephane,

I just tried on an XK6 system using gnu compilers and I do get the same issue. I tried with 1.8.8 (compiled myself) and 1.8.7 (from cray). It seems that it only appears when I write to the lustre file system... and it does not seem to occur otherwise.

        1002
        1003
        1004
        1005
HDF5-DIAG: Error detected in HDF5 (1.8.8) MPI-process 0:
  #000: /project/csvis/soumagne/apps/src/rosa/hdf5-vfd-1.8.8/source/src/H5F.c line 1522 in H5Fopen(): unable to open file
    major: File accessability
    minor: Unable to open file
  #001: /project/csvis/soumagne/apps/src/rosa/hdf5-vfd-1.8.8/source/src/H5F.c line 1211 in H5F_open(): unable to open file: time = Tue Feb 7 10:42:17 2012
, name = 'test.h5', tent_flags = 1
    major: File accessability
    minor: Unable to open file

regards,

Jerome

On 02/07/2012 10:39 AM, Biddiscombe, John A. wrote:

Sephane,

Make sure you point out that this only occurs when using a lustre filesystem.

JB

*From:*hdf-forum-bounces@hdfgroup.org [mailto:hdf-forum-bounces@hdfgroup.org] *On Behalf Of *St�phane Backaert
*Sent:* 05 February 2012 23:21
*To:* hdf-forum@hdfgroup.org
*Subject:* [Hdf-forum] can not open/close a file more than 1010 times

Dear all,

I am using the HDF5-parallel API on a Cray XE6. This had been working fine until a few weeks. This is probably something related to this machine (since a maintenance?) so I sent my issue to the helpdesk of this Cray, but I submit my problem to you as well. I have this problem only with this machine, no problem with another machine (not Cray) used.

Problem:

With parallel version of HDF5, we can not open-and-close the same hdf5 file more than a fixed number of times (1010). Moreover, if we try to open-and-close, for example, two different hdf5 files, this maximum number is divided by two:* is there a limit of the number of hdf5 files opening allowed?*

Observations:

This problem can be reproduced, still the same given number of opening/closing allowed.

This problem does not depend on the number of mpi procs involved (test up to 512 cores, still 1010 opening before crash), or on the quantity written, or on the actions made with an opened file ( like create dataset, attributes or group).

We checked the status of each hdf5 operations (hdferror argument): no one complains before the crash. So every file seems to be correctly opened/closed. No different behavior if we force the file opening property with H5F_CLOSE_STRONG_F.

The hdf5 file created and used before the crash is still readable and all data written in it before are ok.

Setup:

This problem occurs with the hdf5-parallel version 1.5.8.0 and 1.8.8, both compiled with the intel compiler 12.0.3.174, only on a Cray XE6 (I could send the module config used).

A basic code reproduces our hdf5 calls structure. The attached code produced the described error. It is compiled with the command "h5pfc -FR main.F -o test".

Here is a part of the standard error, produced when the code reached the 1010th opening:

HDF5-DIAG: Error detected in HDF5 (1.8.5) MPI-process 0:

  #000: H5F.c line 1495 in H5Fopen(): unable to open file

    major: File accessability

    minor: Unable to open file

  #001: H5F.c line 1195 in H5F_open(): unable to open file

    major: File accessability

    minor: Unable to open file

  #002: H5FD.c line 1088 in H5FD_open(): open failed

    major: Virtual File Layer

    minor: Unable to initialize object

  #003: H5FDmpio.c line 999 in H5FD_mpio_open(): MPI_File_open failed

    major: Internal error (too specific to document in detail)

    minor: Some MPI function failed

  #004: H5FDmpio.c line 999 in H5FD_mpio_open(): Other I/O error , error stack:

ADIOI_UFS_OPEN(108): Other I/O error Too many open files

    major: Internal error (too specific to document in detail)

    minor: MPI Error String

HDF5-DIAG: Error detected in HDF5 (1.8.5) MPI-process 0:

  #000: H5F.c line 1943 in H5Fclose(): invalid file identifier

    major: Invalid arguments to routine

    minor: Inappropriate type

HDF5-DIAG: Error detected in HDF5 (1.8.5) MPI-process 1:

  #000: H5F.c line 1495 in H5Fopen(): unable to open file

    major: File accessability

    minor: Unable to open file

  #001: H5F.c line 1195 in H5F_open(): unable to open file

    major: File accessability

    minor: Unable to open file

  #002: H5FD.c line 1088 in H5FD_open(): open failed

    major: Virtual File Layer

    minor: Unable to initialize object

HDF5-DIAG: Error detected in HDF5 (1.8.5) HDF5-DIAG: Error detected in HDF5 (1.8.5) HDF5-DIAG: Error detected in HDF5 (1.8.5) #003: H5FDmpio.c line 999 in H5FD_mpio_open(): MPI_File_open failed

MPI-process 3MPI-process 6 major: Internal error (too specific to document in detail)

HDF5-DIAG: Error detected in HDF5 (1.8.5) :

HDF5-DIAG: Error detected in HDF5 (1.8.5) MPI-process 7HDF5-DIAG: Error detected in HDF5 (1.8.5) MPI-process 2:

:

  #000: H5F.c line 1495 in H5Fopen(): unable to open file

    minor: Some MPI function failed

  #000: H5F.c line 1495 in H5Fopen(): unable to open file

  #000: H5F.c line 1495 in H5Fopen(): unable to open file

    major: File accessability

    major: File accessability

I also attach the whole standard error.

I will appreciate any help! Do not hesitate to ask me some additional details if needed!

Thanks,

Best regards,

Stephane

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Just for information, this issue seems to have been introduced with Cray MPT versions >= 5.4.1.
It has been reported to Cray.

Jerome

···

On 02/07/2012 04:16 PM, Jerome Soumagne wrote:

I've just tried with an MPI_file_open/close and I get a "Failure to get stripe info: Bad file descriptor" message after about 1012 iterations, so that's definitely not a problem with HDF5.

Jerome

On 02/07/2012 03:38 PM, Mohamad Chaarawi wrote:

I tried this on hopper on /scratch2 which is Lustre also, but it works fine for me..
I used the Cray provided HDF5 (1.8.7)

Since I could not replicate the problem, did you try opening/closing the file using plain MPI-I/O:
    MPI_File_open()
    MPI_File_close()

If you get the same problem there, I would go one step down further using plain posix I/O, just to try and get an idea at what layer you see that issue.

Thanks,
Mohamad

On 02/07/2012 05:22 AM, Jerome Soumagne wrote:

Stephane,

I just tried on an XK6 system using gnu compilers and I do get the same issue. I tried with 1.8.8 (compiled myself) and 1.8.7 (from cray). It seems that it only appears when I write to the lustre file system... and it does not seem to occur otherwise.

        1002
        1003
        1004
        1005
HDF5-DIAG: Error detected in HDF5 (1.8.8) MPI-process 0:
  #000: /project/csvis/soumagne/apps/src/rosa/hdf5-vfd-1.8.8/source/src/H5F.c line 1522 in H5Fopen(): unable to open file
    major: File accessability
    minor: Unable to open file
  #001: /project/csvis/soumagne/apps/src/rosa/hdf5-vfd-1.8.8/source/src/H5F.c line 1211 in H5F_open(): unable to open file: time = Tue Feb 7 10:42:17 2012
, name = 'test.h5', tent_flags = 1
    major: File accessability
    minor: Unable to open file

regards,

Jerome

On 02/07/2012 10:39 AM, Biddiscombe, John A. wrote:

Sephane,

Make sure you point out that this only occurs when using a lustre filesystem.

JB

*From:*hdf-forum-bounces@hdfgroup.org [mailto:hdf-forum-bounces@hdfgroup.org] *On Behalf Of *St�phane Backaert
*Sent:* 05 February 2012 23:21
*To:* hdf-forum@hdfgroup.org
*Subject:* [Hdf-forum] can not open/close a file more than 1010 times

Dear all,

I am using the HDF5-parallel API on a Cray XE6. This had been working fine until a few weeks. This is probably something related to this machine (since a maintenance?) so I sent my issue to the helpdesk of this Cray, but I submit my problem to you as well. I have this problem only with this machine, no problem with another machine (not Cray) used.

Problem:

With parallel version of HDF5, we can not open-and-close the same hdf5 file more than a fixed number of times (1010). Moreover, if we try to open-and-close, for example, two different hdf5 files, this maximum number is divided by two:* is there a limit of the number of hdf5 files opening allowed?*

Observations:

This problem can be reproduced, still the same given number of opening/closing allowed.

This problem does not depend on the number of mpi procs involved (test up to 512 cores, still 1010 opening before crash), or on the quantity written, or on the actions made with an opened file ( like create dataset, attributes or group).

We checked the status of each hdf5 operations (hdferror argument): no one complains before the crash. So every file seems to be correctly opened/closed. No different behavior if we force the file opening property with H5F_CLOSE_STRONG_F.

The hdf5 file created and used before the crash is still readable and all data written in it before are ok.

Setup:

This problem occurs with the hdf5-parallel version 1.5.8.0 and 1.8.8, both compiled with the intel compiler 12.0.3.174, only on a Cray XE6 (I could send the module config used).

A basic code reproduces our hdf5 calls structure. The attached code produced the described error. It is compiled with the command "h5pfc -FR main.F -o test".

Here is a part of the standard error, produced when the code reached the 1010th opening:

HDF5-DIAG: Error detected in HDF5 (1.8.5) MPI-process 0:

  #000: H5F.c line 1495 in H5Fopen(): unable to open file

    major: File accessability

    minor: Unable to open file

  #001: H5F.c line 1195 in H5F_open(): unable to open file

    major: File accessability

    minor: Unable to open file

  #002: H5FD.c line 1088 in H5FD_open(): open failed

    major: Virtual File Layer

    minor: Unable to initialize object

  #003: H5FDmpio.c line 999 in H5FD_mpio_open(): MPI_File_open failed

    major: Internal error (too specific to document in detail)

    minor: Some MPI function failed

  #004: H5FDmpio.c line 999 in H5FD_mpio_open(): Other I/O error , error stack:

ADIOI_UFS_OPEN(108): Other I/O error Too many open files

    major: Internal error (too specific to document in detail)

    minor: MPI Error String

HDF5-DIAG: Error detected in HDF5 (1.8.5) MPI-process 0:

  #000: H5F.c line 1943 in H5Fclose(): invalid file identifier

    major: Invalid arguments to routine

    minor: Inappropriate type

HDF5-DIAG: Error detected in HDF5 (1.8.5) MPI-process 1:

  #000: H5F.c line 1495 in H5Fopen(): unable to open file

    major: File accessability

    minor: Unable to open file

  #001: H5F.c line 1195 in H5F_open(): unable to open file

    major: File accessability

    minor: Unable to open file

  #002: H5FD.c line 1088 in H5FD_open(): open failed

    major: Virtual File Layer

    minor: Unable to initialize object

HDF5-DIAG: Error detected in HDF5 (1.8.5) HDF5-DIAG: Error detected in HDF5 (1.8.5) HDF5-DIAG: Error detected in HDF5 (1.8.5) #003: H5FDmpio.c line 999 in H5FD_mpio_open(): MPI_File_open failed

MPI-process 3MPI-process 6 major: Internal error (too specific to document in detail)

HDF5-DIAG: Error detected in HDF5 (1.8.5) :

HDF5-DIAG: Error detected in HDF5 (1.8.5) MPI-process 7HDF5-DIAG: Error detected in HDF5 (1.8.5) MPI-process 2:

:

  #000: H5F.c line 1495 in H5Fopen(): unable to open file

    minor: Some MPI function failed

  #000: H5F.c line 1495 in H5Fopen(): unable to open file

  #000: H5F.c line 1495 in H5Fopen(): unable to open file

    major: File accessability

    major: File accessability

I also attach the whole standard error.

I will appreciate any help! Do not hesitate to ask me some additional details if needed!

Thanks,

Best regards,

Stephane

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org