is anyone aware of troubles with PHDF5 and IntelMPI? A test code to
reads an HDF5 file in parallel has trouble when scaling if I run it with
IntelMPI, but no trouble if I run it, for example, with POE.
I'm using Intel compilers 13.0.1, IntelMPI 4.1.3.049, and HDF5 1.8.10
The code just reads a 800x800x800 HDF5 file, and the times I get for
reading it are:
But the same code (compiled with the above modules), but submitted with
IBM's POE instead of IntelMPI has no trouble with 1600 procs (actually
no trouble at all with up to 4096 procs) and it reads the file in
0.8963E+01 secs.
Any help appreciated,
···
--
Ángel de Vicente http://www.iac.es/galeria/angelv/
---------------------------------------------------------------------------------------------
ADVERTENCIA: Sobre la privacidad y cumplimiento de la Ley de Protecci�n de Datos, acceda a http://www.iac.es/disclaimer.php
WARNING: For more information on privacy and fulfilment of the Law concerning the Protection of Data, consult http://www.iac.es/disclaimer.php?lang=en
is anyone aware of troubles with PHDF5 and IntelMPI? A test code to
reads an HDF5 file in parallel has trouble when scaling if I run it with
IntelMPI, but no trouble if I run it, for example, with POE.
The curie web site says "Global File System" and "Lustre", so I don't know which one you're using.
But the same code (compiled with the above modules), but submitted with
IBM's POE instead of IntelMPI has no trouble with 1600 procs (actually
no trouble at all with up to 4096 procs) and it reads the file in
0.8963E+01 secs.
is anyone aware of troubles with PHDF5 and IntelMPI? A test code to
reads an HDF5 file in parallel has trouble when scaling if I run it with
IntelMPI, but no trouble if I run it, for example, with POE.
The curie web site says "Global File System" and "Lustre", so I don't know which
one you're using.
thanks, but this issue is not happening in CURIE, but in MareNostrum,
which uses GPFS.
--
Ángel de Vicente http://www.iac.es/galeria/angelv/
---------------------------------------------------------------------------------------------
ADVERTENCIA: Sobre la privacidad y cumplimiento de la Ley de Protecci�n de Datos, acceda a http://www.iac.es/disclaimer.php
WARNING: For more information on privacy and fulfilment of the Law concerning the Protection of Data, consult http://www.iac.es/disclaimer.php?lang=en
Good to know. While intel mpi does not include any GPFS optimizations, there's really only one optimization that matters for GPFS writes: aligning ROMIO file domains to file system block boundaries.
Set the MPI-IO hint "striping_unit" to the GPFS block size.
Setting MPI-IO hints through HDF5 requires property lists and some other gyrations. Here's a good example, except you would set different hints:
is anyone aware of troubles with PHDF5 and IntelMPI? A test code to
reads an HDF5 file in parallel has trouble when scaling if I run it with
IntelMPI, but no trouble if I run it, for example, with POE.
The curie web site says "Global File System" and "Lustre", so I don't know which
one you're using.
is anyone aware of troubles with PHDF5 and IntelMPI? A test code to
reads an HDF5 file in parallel has trouble when scaling if I run it with
IntelMPI, but no trouble if I run it, for example, with POE.
thanks, but this issue is not happening in CURIE, but in MareNostrum,
which uses GPFS.
Good to know. While intel mpi does not include any GPFS optimizations, there's
really only one optimization that matters for GPFS writes: aligning ROMIO file
domains to file system block boundaries.
Set the MPI-IO hint "striping_unit" to the GPFS block size.
But this problem is happening when reading a file, not writing it (in
any case, I have tried setting the striping_unit as well, but no
difference). So far I have no idea what is going on. ~1500 procs is
where the trouble begins, but the number of processors that breaks the
program is not fixed. I run it sucessfully with 1515 processors, then it
failed with 1480...
Any pointers appreciated,
···
--
Ángel de Vicente http://www.iac.es/galeria/angelv/
---------------------------------------------------------------------------------------------
ADVERTENCIA: Sobre la privacidad y cumplimiento de la Ley de Protecci�n de Datos, acceda a http://www.iac.es/disclaimer.php
WARNING: For more information on privacy and fulfilment of the Law concerning the Protection of Data, consult http://www.iac.es/disclaimer.php?lang=en
Set the MPI-IO hint "striping_unit" to the GPFS block size.
But this problem is happening when reading a file, not writing it
ah, it's right there in the subject. sorry about that.
>(in
any case, I have tried setting the striping_unit as well, but no
difference). So far I have no idea what is going on. ~1500 procs is
where the trouble begins, but the number of processors that breaks the
program is not fixed. I run it sucessfully with 1515 processors, then it
failed with 1480...
I suppose all one can do is get a backtrace from a few processors (by, for example, attaching to a hung process with gdb) and see if you are stuck in communication or if you are stuck in a case where the processes are making very many teeny-tiny read operations (so not stuck, but performing I/O so poorly as to be making imperceptible progress)
(in
any case, I have tried setting the striping_unit as well, but no
difference). So far I have no idea what is going on. ~1500 procs is
where the trouble begins, but the number of processors that breaks the
program is not fixed. I run it sucessfully with 1515 processors, then it
failed with 1480...
I suppose all one can do is get a backtrace from a few processors (by, for
example, attaching to a hung process with gdb) and see if you are stuck in
communication or if you are stuck in a case where the processes are making very
many teeny-tiny read operations (so not stuck, but performing I/O so poorly as
to be making imperceptible progress)
I will try to attach to some process and see if I can get somewhere, but
the issue seems definitely a communication one: I changed the program so
that no actual reading is done, just opening the file and closing it,
and still gets hung at the h5fopen_f call, so for some reason the file
cannot even get opened when I go beyond ~1500 procs...
Thanks,
···
--
Ángel de Vicente http://www.iac.es/galeria/angelv/
---------------------------------------------------------------------------------------------
ADVERTENCIA: Sobre la privacidad y cumplimiento de la Ley de Protecci�n de Datos, acceda a http://www.iac.es/disclaimer.php
WARNING: For more information on privacy and fulfilment of the Law concerning the Protection of Data, consult http://www.iac.es/disclaimer.php?lang=en