HDF5 library hangs in call to H5DWrite

Hi,

I have (yet) another problem with the HDF5 library. I am trying to write some data in parallel to a file, where each process writes it's data to it's own dataset. The datasets are first created (as collective operations), and then H5Dwrite hangs when the data are to be written. No error messages are printed, the processes just hangs. I have used GDB on the hanging processes (all processes), and confirmed that it is actually H5Dwrite that hangs.

The strange thing is that this does not always happen, sometimes it works fine. To make it even stranger, it seems that the probability of failure increases with increased problem size and number of processes (or is that really strange?). This writes are in a time-loop, and sometimes a few steps finishes before one write hangs.

I have also found out that if I set the transfer mode to H5FD_MPIO_INDEPENDENT it seems that everything is working fine.

I have tried this on two computers, one workstation and one cluster. The workstation uses OpenMPI with HDF5 1.8.4 and the cluster uses SGI's MPT-MPI with HDF5 1.8.7. Based on the completely different MPI packages and systems, I think MPI and other system issues can be ruled out. The resulting sources of error is then my code (probably) and HDF5 (not so sure about that).

I have attached an example code that shows how I am doing the HDF5-stuff. Unfortunately it is not runnable, but at least you can see how I create and write to the dataset.

Thanks in advance for all help.

Best regards,
H�kon Strandenes

example.C (2.06 KB)

I think the problem may be that you are trying to execute a collective
write to separate datasets. That would explain why collective hangs and
independent succeeds.

I am a bit rusty on HDF5's parallel I/O semantics but AFAIK, a
collective write can only be done to the same dataset. That does NOT
mean each processor has to have an identical dsetID (e.g.
memcmp(&proc_1_dsetID, &proc_2_dsetID, sizeof(hid_t)) may be nonzero)
but it does mean the dataset object to which each processor's dsetID
references in the file has to be the same. In other words the name (or
path) of the dataset used in the create/open call needs to have been the
same.

To issue writes to different datasets simultaneously in parallel, I
think you're only option is independent.

I wonder if your aiming to do collective to different datasets because
you expect that collective will be more easily 'coordinated' by the
underlying filsystem and therefore has a higher chance at better
performance than independent. If so, I don't know if that very often
turns out to be true/possible in practice.

I hope others with a little more parallel I/O experience might chime
in :wink:

Mark

···

On Sat, 2012-10-13 at 10:48 +0200, H�kon Strandenes wrote:

Hi,

I have (yet) another problem with the HDF5 library. I am trying to write
some data in parallel to a file, where each process writes it's data to
it's own dataset. The datasets are first created (as collective
operations), and then H5Dwrite hangs when the data are to be written. No
error messages are printed, the processes just hangs. I have used GDB on
the hanging processes (all processes), and confirmed that it is actually
H5Dwrite that hangs.

The strange thing is that this does not always happen, sometimes it
works fine. To make it even stranger, it seems that the probability of
failure increases with increased problem size and number of processes
(or is that really strange?). This writes are in a time-loop, and
sometimes a few steps finishes before one write hangs.

I have also found out that if I set the transfer mode to
H5FD_MPIO_INDEPENDENT it seems that everything is working fine.

I have tried this on two computers, one workstation and one cluster. The
workstation uses OpenMPI with HDF5 1.8.4 and the cluster uses SGI's
MPT-MPI with HDF5 1.8.7. Based on the completely different MPI packages
and systems, I think MPI and other system issues can be ruled out. The
resulting sources of error is then my code (probably) and HDF5 (not so
sure about that).

I have attached an example code that shows how I am doing the
HDF5-stuff. Unfortunately it is not runnable, but at least you can see
how I create and write to the dataset.

Thanks in advance for all help.

Best regards,
H�kon Strandenes
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

--
Mark C. Miller, Lawrence Livermore National Laboratory
================!!LLNL BUSINESS ONLY!!================
miller86@llnl.gov urgent: miller86@pager.llnl.gov
T:8-6 (925)-423-5901 M/W/Th:7-12,2-7 (530)-753-8511

Yes Mark is correct. You program is erroneous.
The current interface for reading and writing to datasets (collectively) requires all processes to call the operation for each read/write operation. You can correct your program by having each processes participate with a NULL selection in the read/write operation, except for the dataset that belongs to that process, or just use independent I/O.

We are working on a new interface that would allow collective access to multiple datasets simultaneously, so stay tuned :slight_smile:

Thanks,
Mohamad

···

On 10/13/2012 10:44 AM, Mark Miller wrote:

I think the problem may be that you are trying to execute a collective
write to separate datasets. That would explain why collective hangs and
independent succeeds.

I am a bit rusty on HDF5's parallel I/O semantics but AFAIK, a
collective write can only be done to the same dataset. That does NOT
mean each processor has to have an identical dsetID (e.g.
memcmp(&proc_1_dsetID, &proc_2_dsetID, sizeof(hid_t)) may be nonzero)
but it does mean the dataset object to which each processor's dsetID
references in the file has to be the same. In other words the name (or
path) of the dataset used in the create/open call needs to have been the
same.

To issue writes to different datasets simultaneously in parallel, I
think you're only option is independent.

I wonder if your aiming to do collective to different datasets because
you expect that collective will be more easily 'coordinated' by the
underlying filsystem and therefore has a higher chance at better
performance than independent. If so, I don't know if that very often
turns out to be true/possible in practice.

I hope others with a little more parallel I/O experience might chime
in :wink:

Mark

On Sat, 2012-10-13 at 10:48 +0200, H�kon Strandenes wrote:

Hi,

I have (yet) another problem with the HDF5 library. I am trying to write
some data in parallel to a file, where each process writes it's data to
it's own dataset. The datasets are first created (as collective
operations), and then H5Dwrite hangs when the data are to be written. No
error messages are printed, the processes just hangs. I have used GDB on
the hanging processes (all processes), and confirmed that it is actually
H5Dwrite that hangs.

The strange thing is that this does not always happen, sometimes it
works fine. To make it even stranger, it seems that the probability of
failure increases with increased problem size and number of processes
(or is that really strange?). This writes are in a time-loop, and
sometimes a few steps finishes before one write hangs.

I have also found out that if I set the transfer mode to
H5FD_MPIO_INDEPENDENT it seems that everything is working fine.

I have tried this on two computers, one workstation and one cluster. The
workstation uses OpenMPI with HDF5 1.8.4 and the cluster uses SGI's
MPT-MPI with HDF5 1.8.7. Based on the completely different MPI packages
and systems, I think MPI and other system issues can be ruled out. The
resulting sources of error is then my code (probably) and HDF5 (not so
sure about that).

I have attached an example code that shows how I am doing the
HDF5-stuff. Unfortunately it is not runnable, but at least you can see
how I create and write to the dataset.

Thanks in advance for all help.

Best regards,
H�kon Strandenes
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Thanks, I had a suspicion about that. Some more problems have appeared (H5Dcreate2 freezes/hangs after a few writes), but I will try to debug some more before I ask you...

Anyways; I have a few questions about how HDF5 does the writes. When I now do independent I/O (each process prints to it's own dataset), will the actual data transfer happen in parallel, assuming the underlaying file system is parallel (Lustre)? The reason why I ask is of course that I have a large parallel simulation (with thousands of processes), and if each rank should wait for the lower ranks to finish (i.e. rank 4 must wait for rank 3 to finish, rank 3 must wait for rank 2 etc.), the I/O operations could take a tremendous amount of time.

Since the data is domain decomposed, I also thought it would be easiest to write each domain to a dataset, instead of trying to "stitch" together the domains before writes (witch would require quite a bit of communication and CPU cycles).

H�kon

···

On 14. okt. 2012 03:31, Mohamad Chaarawi wrote:

Yes Mark is correct. You program is erroneous.
The current interface for reading and writing to datasets (collectively)
requires all processes to call the operation for each read/write
operation. You can correct your program by having each processes
participate with a NULL selection in the read/write operation, except
for the dataset that belongs to that process, or just use independent I/O.

We are working on a new interface that would allow collective access to
multiple datasets simultaneously, so stay tuned :slight_smile:

Thanks,
Mohamad

On 10/13/2012 10:44 AM, Mark Miller wrote:

I think the problem may be that you are trying to execute a collective
write to separate datasets. That would explain why collective hangs and
independent succeeds.

I am a bit rusty on HDF5's parallel I/O semantics but AFAIK, a
collective write can only be done to the same dataset. That does NOT
mean each processor has to have an identical dsetID (e.g.
memcmp(&proc_1_dsetID, &proc_2_dsetID, sizeof(hid_t)) may be nonzero)
but it does mean the dataset object to which each processor's dsetID
references in the file has to be the same. In other words the name (or
path) of the dataset used in the create/open call needs to have been the
same.

To issue writes to different datasets simultaneously in parallel, I
think you're only option is independent.

I wonder if your aiming to do collective to different datasets because
you expect that collective will be more easily 'coordinated' by the
underlying filsystem and therefore has a higher chance at better
performance than independent. If so, I don't know if that very often
turns out to be true/possible in practice.

I hope others with a little more parallel I/O experience might chime
in :wink:

Mark

On Sat, 2012-10-13 at 10:48 +0200, H�kon Strandenes wrote:

Hi,

I have (yet) another problem with the HDF5 library. I am trying to write
some data in parallel to a file, where each process writes it's data to
it's own dataset. The datasets are first created (as collective
operations), and then H5Dwrite hangs when the data are to be written. No
error messages are printed, the processes just hangs. I have used GDB on
the hanging processes (all processes), and confirmed that it is actually
H5Dwrite that hangs.

The strange thing is that this does not always happen, sometimes it
works fine. To make it even stranger, it seems that the probability of
failure increases with increased problem size and number of processes
(or is that really strange?). This writes are in a time-loop, and
sometimes a few steps finishes before one write hangs.

I have also found out that if I set the transfer mode to
H5FD_MPIO_INDEPENDENT it seems that everything is working fine.

I have tried this on two computers, one workstation and one cluster. The
workstation uses OpenMPI with HDF5 1.8.4 and the cluster uses SGI's
MPT-MPI with HDF5 1.8.7. Based on the completely different MPI packages
and systems, I think MPI and other system issues can be ruled out. The
resulting sources of error is then my code (probably) and HDF5 (not so
sure about that).

I have attached an example code that shows how I am doing the
HDF5-stuff. Unfortunately it is not runnable, but at least you can see
how I create and write to the dataset.

Thanks in advance for all help.

Best regards,
H�kon Strandenes
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Hi H�kon,

Thanks, I had a suspicion about that. Some more problems have appeared (H5Dcreate2 freezes/hangs after a few writes), but I will try to debug some more before I ask you...

Anyways; I have a few questions about how HDF5 does the writes. When I now do independent I/O (each process prints to it's own dataset), will the actual data transfer happen in parallel, assuming the underlaying file system is parallel (Lustre)?

Yes independent transfer just translates to independent MPI-I/O read/write operations. So if your file system supports parallel file access, The HDF5 operations would occur in parallel. There is ofcourse the issue with Lustre, that data access on the same OST are serialized, but if your datasets are large enough, it would not be a huge issue.

The reason why I ask is of course that I have a large parallel simulation (with thousands of processes), and if each rank should wait for the lower ranks to finish (i.e. rank 4 must wait for rank 3 to finish, rank 3 must wait for rank 2 etc.), the I/O operations could take a tremendous amount of time.

Since the data is domain decomposed, I also thought it would be easiest to write each domain to a dataset, instead of trying to "stitch" together the domains before writes (witch would require quite a bit of communication and CPU cycles).

Yes this is one use case that we are considering supporting better in-terms of non-collective metadata access (so you don't have to call H5Dcreate n times). Another ongoing (but separate) work includes what I mentioned earlier, where we have H5Dread/write_multi, where you can access several datasets in one call collectively or independently.

Thanks,
Mohamad

···

On 10/14/2012 3:54 AM, H�kon Strandenes wrote:

H�kon

On 14. okt. 2012 03:31, Mohamad Chaarawi wrote:

Yes Mark is correct. You program is erroneous.
The current interface for reading and writing to datasets (collectively)
requires all processes to call the operation for each read/write
operation. You can correct your program by having each processes
participate with a NULL selection in the read/write operation, except
for the dataset that belongs to that process, or just use independent I/O.

We are working on a new interface that would allow collective access to
multiple datasets simultaneously, so stay tuned :slight_smile:

Thanks,
Mohamad

On 10/13/2012 10:44 AM, Mark Miller wrote:

I think the problem may be that you are trying to execute a collective
write to separate datasets. That would explain why collective hangs and
independent succeeds.

I am a bit rusty on HDF5's parallel I/O semantics but AFAIK, a
collective write can only be done to the same dataset. That does NOT
mean each processor has to have an identical dsetID (e.g.
memcmp(&proc_1_dsetID, &proc_2_dsetID, sizeof(hid_t)) may be nonzero)
but it does mean the dataset object to which each processor's dsetID
references in the file has to be the same. In other words the name (or
path) of the dataset used in the create/open call needs to have been the
same.

To issue writes to different datasets simultaneously in parallel, I
think you're only option is independent.

I wonder if your aiming to do collective to different datasets because
you expect that collective will be more easily 'coordinated' by the
underlying filsystem and therefore has a higher chance at better
performance than independent. If so, I don't know if that very often
turns out to be true/possible in practice.

I hope others with a little more parallel I/O experience might chime
in :wink:

Mark

On Sat, 2012-10-13 at 10:48 +0200, H�kon Strandenes wrote:

Hi,

I have (yet) another problem with the HDF5 library. I am trying to write
some data in parallel to a file, where each process writes it's data to
it's own dataset. The datasets are first created (as collective
operations), and then H5Dwrite hangs when the data are to be written. No
error messages are printed, the processes just hangs. I have used GDB on
the hanging processes (all processes), and confirmed that it is actually
H5Dwrite that hangs.

The strange thing is that this does not always happen, sometimes it
works fine. To make it even stranger, it seems that the probability of
failure increases with increased problem size and number of processes
(or is that really strange?). This writes are in a time-loop, and
sometimes a few steps finishes before one write hangs.

I have also found out that if I set the transfer mode to
H5FD_MPIO_INDEPENDENT it seems that everything is working fine.

I have tried this on two computers, one workstation and one cluster. The
workstation uses OpenMPI with HDF5 1.8.4 and the cluster uses SGI's
MPT-MPI with HDF5 1.8.7. Based on the completely different MPI packages
and systems, I think MPI and other system issues can be ruled out. The
resulting sources of error is then my code (probably) and HDF5 (not so
sure about that).

I have attached an example code that shows how I am doing the
HDF5-stuff. Unfortunately it is not runnable, but at least you can see
how I create and write to the dataset.

Thanks in advance for all help.

Best regards,
H�kon Strandenes
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Thanks for that. Everything is much more clear now.

As I said in my previous post, fixing this issue immediately caused another issue to appear (we are still in the same code). However, when the previous issue appeared after a few processes and was quite irreguar, this one occurs at least deterministic.

The problem is much like the previous, except that it is now a call to H5Dcreate2 that hangs. I have tried this on a few problem sizes (in number of cells) and it seems that the crucial thing is the number of processes. I have made these tests:

1: 16 processes successfully written mesh + 100 time steps, a total of 12 GB (approx). No issues.

2: 32 processes writes mesh, and hangs when creating data sets for time step no. 4. (same problem size as above).

3: 64 processes writes mesh and hangs when creating data sets for time step no. 1 (same problem size as above).

Above 64 processes, the code hangs already when the mesh is written (the mesh datasets is created in the same way as the field data datasets).

Again no error messages appear. If I attach GDB to the hanging processes, I find that it stops when creating exactly the same dataset each time.

I have tried to double- and triple check that I close all resources after use, and I found no errors.

Does any of you have any idea about what the error could be?

Thanks,
H�kon

···

On 14. okt. 2012 17:02, Mohamad Chaarawi wrote:

Hi H�kon,

On 10/14/2012 3:54 AM, H�kon Strandenes wrote:

Thanks, I had a suspicion about that. Some more problems have appeared
(H5Dcreate2 freezes/hangs after a few writes), but I will try to debug
some more before I ask you...

Anyways; I have a few questions about how HDF5 does the writes. When I
now do independent I/O (each process prints to it's own dataset), will
the actual data transfer happen in parallel, assuming the underlaying
file system is parallel (Lustre)?

Yes independent transfer just translates to independent MPI-I/O
read/write operations. So if your file system supports parallel file
access, The HDF5 operations would occur in parallel. There is ofcourse
the issue with Lustre, that data access on the same OST are serialized,
but if your datasets are large enough, it would not be a huge issue.

The reason why I ask is of course that I have a large parallel
simulation (with thousands of processes), and if each rank should wait
for the lower ranks to finish (i.e. rank 4 must wait for rank 3 to
finish, rank 3 must wait for rank 2 etc.), the I/O operations could
take a tremendous amount of time.

Since the data is domain decomposed, I also thought it would be
easiest to write each domain to a dataset, instead of trying to
"stitch" together the domains before writes (witch would require quite
a bit of communication and CPU cycles).

Yes this is one use case that we are considering supporting better
in-terms of non-collective metadata access (so you don't have to call
H5Dcreate n times). Another ongoing (but separate) work includes what I
mentioned earlier, where we have H5Dread/write_multi, where you can
access several datasets in one call collectively or independently.

Thanks,
Mohamad

H�kon

On 14. okt. 2012 03:31, Mohamad Chaarawi wrote:

Yes Mark is correct. You program is erroneous.
The current interface for reading and writing to datasets (collectively)
requires all processes to call the operation for each read/write
operation. You can correct your program by having each processes
participate with a NULL selection in the read/write operation, except
for the dataset that belongs to that process, or just use independent
I/O.

We are working on a new interface that would allow collective access to
multiple datasets simultaneously, so stay tuned :slight_smile:

Thanks,
Mohamad

On 10/13/2012 10:44 AM, Mark Miller wrote:

I think the problem may be that you are trying to execute a collective
write to separate datasets. That would explain why collective hangs and
independent succeeds.

I am a bit rusty on HDF5's parallel I/O semantics but AFAIK, a
collective write can only be done to the same dataset. That does NOT
mean each processor has to have an identical dsetID (e.g.
memcmp(&proc_1_dsetID, &proc_2_dsetID, sizeof(hid_t)) may be nonzero)
but it does mean the dataset object to which each processor's dsetID
references in the file has to be the same. In other words the name (or
path) of the dataset used in the create/open call needs to have been
the
same.

To issue writes to different datasets simultaneously in parallel, I
think you're only option is independent.

I wonder if your aiming to do collective to different datasets because
you expect that collective will be more easily 'coordinated' by the
underlying filsystem and therefore has a higher chance at better
performance than independent. If so, I don't know if that very often
turns out to be true/possible in practice.

I hope others with a little more parallel I/O experience might chime
in :wink:

Mark

On Sat, 2012-10-13 at 10:48 +0200, H�kon Strandenes wrote:

Hi,

I have (yet) another problem with the HDF5 library. I am trying to
write
some data in parallel to a file, where each process writes it's
data to
it's own dataset. The datasets are first created (as collective
operations), and then H5Dwrite hangs when the data are to be
written. No
error messages are printed, the processes just hangs. I have used
GDB on
the hanging processes (all processes), and confirmed that it is
actually
H5Dwrite that hangs.

The strange thing is that this does not always happen, sometimes it
works fine. To make it even stranger, it seems that the probability of
failure increases with increased problem size and number of processes
(or is that really strange?). This writes are in a time-loop, and
sometimes a few steps finishes before one write hangs.

I have also found out that if I set the transfer mode to
H5FD_MPIO_INDEPENDENT it seems that everything is working fine.

I have tried this on two computers, one workstation and one
cluster. The
workstation uses OpenMPI with HDF5 1.8.4 and the cluster uses SGI's
MPT-MPI with HDF5 1.8.7. Based on the completely different MPI
packages
and systems, I think MPI and other system issues can be ruled out. The
resulting sources of error is then my code (probably) and HDF5 (not so
sure about that).

I have attached an example code that shows how I am doing the
HDF5-stuff. Unfortunately it is not runnable, but at least you can see
how I create and write to the dataset.

Thanks in advance for all help.

Best regards,
H�kon Strandenes
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Thanks for that. Everything is much more clear now.

As I said in my previous post, fixing this issue immediately caused another issue to appear (we are still in the same code). However, when the previous issue appeared after a few processes and was quite irreguar, this one occurs at least deterministic.

The problem is much like the previous, except that it is now a call to H5Dcreate2 that hangs. I have tried this on a few problem sizes (in number of cells) and it seems that the crucial thing is the number of processes. I have made these tests:

Just to make sure, are you creating a dataset (with a different name) for each process? And are you calling H5Dcreate collectively for each dataset created on all processes? And are you calling them from all processes in order?

If you are doing that, then I can not tell what the problem is unless you can send me a small program for me to run and replicate the issue.

Thanks,
Mohamad

···

On 10/14/2012 3:53 PM, H�kon Strandenes wrote:

1: 16 processes successfully written mesh + 100 time steps, a total of 12 GB (approx). No issues.

2: 32 processes writes mesh, and hangs when creating data sets for time step no. 4. (same problem size as above).

3: 64 processes writes mesh and hangs when creating data sets for time step no. 1 (same problem size as above).

Above 64 processes, the code hangs already when the mesh is written (the mesh datasets is created in the same way as the field data datasets).

Again no error messages appear. If I attach GDB to the hanging processes, I find that it stops when creating exactly the same dataset each time.

I have tried to double- and triple check that I close all resources after use, and I found no errors.

Does any of you have any idea about what the error could be?

Thanks,
H�kon

On 14. okt. 2012 17:02, Mohamad Chaarawi wrote:

Hi H�kon,

On 10/14/2012 3:54 AM, H�kon Strandenes wrote:

Thanks, I had a suspicion about that. Some more problems have appeared
(H5Dcreate2 freezes/hangs after a few writes), but I will try to debug
some more before I ask you...

Anyways; I have a few questions about how HDF5 does the writes. When I
now do independent I/O (each process prints to it's own dataset), will
the actual data transfer happen in parallel, assuming the underlaying
file system is parallel (Lustre)?

Yes independent transfer just translates to independent MPI-I/O
read/write operations. So if your file system supports parallel file
access, The HDF5 operations would occur in parallel. There is ofcourse
the issue with Lustre, that data access on the same OST are serialized,
but if your datasets are large enough, it would not be a huge issue.

The reason why I ask is of course that I have a large parallel
simulation (with thousands of processes), and if each rank should wait
for the lower ranks to finish (i.e. rank 4 must wait for rank 3 to
finish, rank 3 must wait for rank 2 etc.), the I/O operations could
take a tremendous amount of time.

Since the data is domain decomposed, I also thought it would be
easiest to write each domain to a dataset, instead of trying to
"stitch" together the domains before writes (witch would require quite
a bit of communication and CPU cycles).

Yes this is one use case that we are considering supporting better
in-terms of non-collective metadata access (so you don't have to call
H5Dcreate n times). Another ongoing (but separate) work includes what I
mentioned earlier, where we have H5Dread/write_multi, where you can
access several datasets in one call collectively or independently.

Thanks,
Mohamad

H�kon

On 14. okt. 2012 03:31, Mohamad Chaarawi wrote:

Yes Mark is correct. You program is erroneous.
The current interface for reading and writing to datasets (collectively)
requires all processes to call the operation for each read/write
operation. You can correct your program by having each processes
participate with a NULL selection in the read/write operation, except
for the dataset that belongs to that process, or just use independent
I/O.

We are working on a new interface that would allow collective access to
multiple datasets simultaneously, so stay tuned :slight_smile:

Thanks,
Mohamad

On 10/13/2012 10:44 AM, Mark Miller wrote:

I think the problem may be that you are trying to execute a collective
write to separate datasets. That would explain why collective hangs and
independent succeeds.

I am a bit rusty on HDF5's parallel I/O semantics but AFAIK, a
collective write can only be done to the same dataset. That does NOT
mean each processor has to have an identical dsetID (e.g.
memcmp(&proc_1_dsetID, &proc_2_dsetID, sizeof(hid_t)) may be nonzero)
but it does mean the dataset object to which each processor's dsetID
references in the file has to be the same. In other words the name (or
path) of the dataset used in the create/open call needs to have been
the
same.

To issue writes to different datasets simultaneously in parallel, I
think you're only option is independent.

I wonder if your aiming to do collective to different datasets because
you expect that collective will be more easily 'coordinated' by the
underlying filsystem and therefore has a higher chance at better
performance than independent. If so, I don't know if that very often
turns out to be true/possible in practice.

I hope others with a little more parallel I/O experience might chime
in :wink:

Mark

On Sat, 2012-10-13 at 10:48 +0200, H�kon Strandenes wrote:

Hi,

I have (yet) another problem with the HDF5 library. I am trying to
write
some data in parallel to a file, where each process writes it's
data to
it's own dataset. The datasets are first created (as collective
operations), and then H5Dwrite hangs when the data are to be
written. No
error messages are printed, the processes just hangs. I have used
GDB on
the hanging processes (all processes), and confirmed that it is
actually
H5Dwrite that hangs.

The strange thing is that this does not always happen, sometimes it
works fine. To make it even stranger, it seems that the probability of
failure increases with increased problem size and number of processes
(or is that really strange?). This writes are in a time-loop, and
sometimes a few steps finishes before one write hangs.

I have also found out that if I set the transfer mode to
H5FD_MPIO_INDEPENDENT it seems that everything is working fine.

I have tried this on two computers, one workstation and one
cluster. The
workstation uses OpenMPI with HDF5 1.8.4 and the cluster uses SGI's
MPT-MPI with HDF5 1.8.7. Based on the completely different MPI
packages
and systems, I think MPI and other system issues can be ruled out. The
resulting sources of error is then my code (probably) and HDF5 (not so
sure about that).

I have attached an example code that shows how I am doing the
HDF5-stuff. Unfortunately it is not runnable, but at least you can see
how I create and write to the dataset.

Thanks in advance for all help.

Best regards,
H�kon Strandenes
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

I found the problem. The problem was that the logic that should initiate the writing process suddenly suddenly after a while failed on some processes (i.e. some processes tried to do a write while other tried to continue the time loop). Changed the logic that controlled this from floating point to integer math, and this works great now.

This was an embarrassing failure, sorry for bothering you.

Thanks,
H�kon

···

On 15. okt. 2012 01:39, Mohamad Chaarawi wrote:

On 10/14/2012 3:53 PM, H�kon Strandenes wrote:

Thanks for that. Everything is much more clear now.

As I said in my previous post, fixing this issue immediately caused
another issue to appear (we are still in the same code). However, when
the previous issue appeared after a few processes and was quite
irreguar, this one occurs at least deterministic.

The problem is much like the previous, except that it is now a call to
H5Dcreate2 that hangs. I have tried this on a few problem sizes (in
number of cells) and it seems that the crucial thing is the number of
processes. I have made these tests:

Just to make sure, are you creating a dataset (with a different name)
for each process? And are you calling H5Dcreate collectively for each
dataset created on all processes? And are you calling them from all
processes in order?

If you are doing that, then I can not tell what the problem is unless
you can send me a small program for me to run and replicate the issue.

Thanks,
Mohamad

1: 16 processes successfully written mesh + 100 time steps, a total of
12 GB (approx). No issues.

2: 32 processes writes mesh, and hangs when creating data sets for
time step no. 4. (same problem size as above).

3: 64 processes writes mesh and hangs when creating data sets for time
step no. 1 (same problem size as above).

Above 64 processes, the code hangs already when the mesh is written
(the mesh datasets is created in the same way as the field data
datasets).

Again no error messages appear. If I attach GDB to the hanging
processes, I find that it stops when creating exactly the same dataset
each time.

I have tried to double- and triple check that I close all resources
after use, and I found no errors.

Does any of you have any idea about what the error could be?

Thanks,
H�kon

On 14. okt. 2012 17:02, Mohamad Chaarawi wrote:

Hi H�kon,

On 10/14/2012 3:54 AM, H�kon Strandenes wrote:

Thanks, I had a suspicion about that. Some more problems have appeared
(H5Dcreate2 freezes/hangs after a few writes), but I will try to debug
some more before I ask you...

Anyways; I have a few questions about how HDF5 does the writes. When I
now do independent I/O (each process prints to it's own dataset), will
the actual data transfer happen in parallel, assuming the underlaying
file system is parallel (Lustre)?

Yes independent transfer just translates to independent MPI-I/O
read/write operations. So if your file system supports parallel file
access, The HDF5 operations would occur in parallel. There is ofcourse
the issue with Lustre, that data access on the same OST are serialized,
but if your datasets are large enough, it would not be a huge issue.

The reason why I ask is of course that I have a large parallel
simulation (with thousands of processes), and if each rank should wait
for the lower ranks to finish (i.e. rank 4 must wait for rank 3 to
finish, rank 3 must wait for rank 2 etc.), the I/O operations could
take a tremendous amount of time.

Since the data is domain decomposed, I also thought it would be
easiest to write each domain to a dataset, instead of trying to
"stitch" together the domains before writes (witch would require quite
a bit of communication and CPU cycles).

Yes this is one use case that we are considering supporting better
in-terms of non-collective metadata access (so you don't have to call
H5Dcreate n times). Another ongoing (but separate) work includes what I
mentioned earlier, where we have H5Dread/write_multi, where you can
access several datasets in one call collectively or independently.

Thanks,
Mohamad

H�kon

On 14. okt. 2012 03:31, Mohamad Chaarawi wrote:

Yes Mark is correct. You program is erroneous.
The current interface for reading and writing to datasets
(collectively)
requires all processes to call the operation for each read/write
operation. You can correct your program by having each processes
participate with a NULL selection in the read/write operation, except
for the dataset that belongs to that process, or just use independent
I/O.

We are working on a new interface that would allow collective
access to
multiple datasets simultaneously, so stay tuned :slight_smile:

Thanks,
Mohamad

On 10/13/2012 10:44 AM, Mark Miller wrote:

I think the problem may be that you are trying to execute a
collective
write to separate datasets. That would explain why collective
hangs and
independent succeeds.

I am a bit rusty on HDF5's parallel I/O semantics but AFAIK, a
collective write can only be done to the same dataset. That does NOT
mean each processor has to have an identical dsetID (e.g.
memcmp(&proc_1_dsetID, &proc_2_dsetID, sizeof(hid_t)) may be nonzero)
but it does mean the dataset object to which each processor's dsetID
references in the file has to be the same. In other words the name
(or
path) of the dataset used in the create/open call needs to have been
the
same.

To issue writes to different datasets simultaneously in parallel, I
think you're only option is independent.

I wonder if your aiming to do collective to different datasets
because
you expect that collective will be more easily 'coordinated' by the
underlying filsystem and therefore has a higher chance at better
performance than independent. If so, I don't know if that very often
turns out to be true/possible in practice.

I hope others with a little more parallel I/O experience might chime
in :wink:

Mark

On Sat, 2012-10-13 at 10:48 +0200, H�kon Strandenes wrote:

Hi,

I have (yet) another problem with the HDF5 library. I am trying to
write
some data in parallel to a file, where each process writes it's
data to
it's own dataset. The datasets are first created (as collective
operations), and then H5Dwrite hangs when the data are to be
written. No
error messages are printed, the processes just hangs. I have used
GDB on
the hanging processes (all processes), and confirmed that it is
actually
H5Dwrite that hangs.

The strange thing is that this does not always happen, sometimes it
works fine. To make it even stranger, it seems that the
probability of
failure increases with increased problem size and number of
processes
(or is that really strange?). This writes are in a time-loop, and
sometimes a few steps finishes before one write hangs.

I have also found out that if I set the transfer mode to
H5FD_MPIO_INDEPENDENT it seems that everything is working fine.

I have tried this on two computers, one workstation and one
cluster. The
workstation uses OpenMPI with HDF5 1.8.4 and the cluster uses SGI's
MPT-MPI with HDF5 1.8.7. Based on the completely different MPI
packages
and systems, I think MPI and other system issues can be ruled
out. The
resulting sources of error is then my code (probably) and HDF5
(not so
sure about that).

I have attached an example code that shows how I am doing the
HDF5-stuff. Unfortunately it is not runnable, but at least you
can see
how I create and write to the dataset.

Thanks in advance for all help.

Best regards,
H�kon Strandenes
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org