How to include parallelization in the program

Good morning everyone,

I’ve recently started using parallel HDF5 for my company, as we wish to save analysed data on multiple files at a time. It would be an N:N case, with an output stream for each file.
The main program itself is written in C#, but we already have an API that allows us to make calls to hdf5 and MPI in C and C++. It retrieves data from an external device, executes some analysis and then saves the data, and parallelizing these three parts would speed up the process. However i’m not quite sure how to implement such parallelization on the third bit:
So far i’ve seen that parallelization is usually implemented right off the bat: the program is started with mpiexec (i’m on Windows), with a specified number of processes. (like “mpiexec -n x Program.exe). Unfortunately running multiple instances of the whole program in parallel would be problematic, but i’ve seen that one should be able to spawn processes later during runtime with MPI_Spawn(), indicating an executable as a target (provided that the “main” process, the program itself, has been started with “mpiexec -n 1 Program.exe” for example).
This second method could do it for us, but I was wondering if there is a more elegant way to achieve parallel output writing, like calling a function from my own program instead of an executable.

Bonus question, just to make sure i’ve got the basics of PHDF5 right in the first place: I do need to have a process for each parallel action that I want to perform in parallel, be it writing N streams to N files, or writing N streams to a single file?

Thank you in advance

Stefano

···

Sent from Mail for Windows 10

Hello,

the requirements for PHDF5 are here
<https://support.hdfgroup.org/HDF5/PHDF5/>. It may be good idea to check
whether there is a speed up from parllelFS + phdf5 set up.

In my interpretation there is a benefit to use PHDF5 when you have a full
parallel system backed with parallel file system which capable handling IO
parallel: large super-computing (batch) environments are such.
On the other end of the spectrum you can have a single computer, single
drive system with multiple cores; AWS EC2 instances without local HDD are
such.

If the latter case using PHDF5 you pull into extra code lines and some
restrictions (no filters, ... ) as you see at some choke point there must
be a mechanism to serialise all the READ/WRITE operations.
If you have the latter setup using a separate process and a reliable
software fabric ( ie: ZeroMQ + protocol buffer or similar queue ) get you
the result.
There also is another approach: to write into separate files, local in
memory fs;

1) then copy all files into one single HDF5 container
2) use a separate HDF5 with external file driver to link the files into a
single image.

The copy/collect version works on batch processors if your 'collector'
script is scheduled after the MPI job.

Of course in case you are having a true parallel environment indeed you
should benefit from parallel IO.

best,
steve

···

On Mon, Feb 19, 2018 at 3:10 AM, Stefano Salvadè <stefano.salvade@3brain.com > wrote:

Good morning everyone,

I’ve recently started using parallel HDF5 for my company, as we wish to
save analysed data on multiple files at a time. It would be an N:N case,
with an output stream for each file.

The main program itself is written in C#, but we already have an API that
allows us to make calls to hdf5 and MPI in C and C++. It retrieves data
from an external device, executes some analysis and then saves the data,
and parallelizing these three parts would speed up the process. However i’m
not quite sure how to implement such parallelization on the third bit:

So far i’ve seen that parallelization is usually implemented right off the
bat: the program is started with mpiexec (i’m on Windows), with a specified
number of processes. (like “mpiexec -n x Program.exe). Unfortunately
running multiple instances of the whole program in parallel would be
problematic, but i’ve seen that one should be able to spawn processes later
during runtime with MPI_Spawn(), indicating an executable as a target
(provided that the “main” process, the program itself, has been started
with “mpiexec -n 1 Program.exe” for example).

This second method could do it for us, but I was wondering if there is a
more elegant way to achieve parallel output writing, like calling a
function from my own program instead of an executable.

Bonus question, just to make sure i’ve got the basics of PHDF5 right in
the first place: I do need to have a process for each parallel action that
I want to perform in parallel, be it writing N streams to N files, or
writing N streams to a single file?

Thank you in advance

Stefano

Sent from Mail <https://go.microsoft.com/fwlink/?LinkId=550986> for
Windows 10

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Hi Stefano,

I am not sure I am understanding your question or the problem(s) you are hoping to solve.

You mention the “N:N case” and so I am assuming you are talking N processor writing to N files. You don’t need parallel HDF5 to do that. You can use serial HDF5 because each stream is a wholly independent file.

The only situation in which you *need* parallel HDF5 is when you want multiple MPI processes (parts of a distributed parallel executable) to write to the *same* file concurrently. Then, their work on the file has to be coordinated (e.g. creation of HDF5 objects) and their I/O to read/write data from/to objects in the file can be done either collectively (coordinated) or independently. But, it sounds like MPI parallelism is not really what you are looking for and that is especially true if you only want N:N case.

Now, if what you *really* want is multiple different OS processes (or maybe threads within a single process) to be able to write concurrently to a single file, then there are not really too many options for you *without* taking some degree of responsibility of coordinating those processes *yourself*. HDF5 will not do much to help you here. The support in HDF5 for calling it from multiple threads within the same executable is pretty limited. The locking is very coarse grained and in all likelihood winds up serializing the threads. And, there is nothing HDF5 itself (nor any other I/O library for that matter) can do if you want multiple different OS processes to write to the same file without doing a lot of the *work* yourself to coordinate them.

Finally, your note gives me the impression that maybe what you are looking for is one (or more) processes whose number grows (and maybe shrinks) over the life of the application and where each processor needs to write some data. If that is your ultimate goal, I think there are various ways you could try to implement that both with and without MPI and parallel HDF5. For example, if you went with MPI and had a loose upper bound on the total number of processes you needed, then you could mpiexec that number but then idle/sleep all those that don’t need to be running at a particular time….you could dynamically create MPI communicators that represent the current number of tasks you need and you could open and use a *single* HDF5 file on that communicator shared among those processors. Then, if you need to change the number of tasks, you would close the HDF5 file, close the communicator and create a new communicator on a different number of tasks and re-open the file with that new communicator. There is a lot involved there but I think it could be made to work. But, that is *only* if you want a single file that is routinely being written to by differing number of tasks. If you really just want N files from N tasks and N varies with time, then why not just use the OS to spawn I/O tasks and each task opens a uniquely named HDF5 perhaps by internal task id or something?

Finally, it might be worth having a look at this HDF5 Blog post….

Not sure any of this is helpful but I thought I would mention some ideas.

Good luck.

Mark

"Hdf-forum on behalf of Stefano Salvadè" wrote:

Good morning everyone,

I’ve recently started using parallel HDF5 for my company, as we wish to save analysed data on multiple files at a time. It would be an N:N case, with an output stream for each file.
The main program itself is written in C#, but we already have an API that allows us to make calls to hdf5 and MPI in C and C++. It retrieves data from an external device, executes some analysis and then saves the data, and parallelizing these three parts would speed up the process. However i’m not quite sure how to implement such parallelization on the third bit:
So far i’ve seen that parallelization is usually implemented right off the bat: the program is started with mpiexec (i’m on Windows), with a specified number of processes. (like “mpiexec -n x Program.exe). Unfortunately running multiple instances of the whole program in parallel would be problematic, but i’ve seen that one should be able to spawn processes later during runtime with MPI_Spawn(), indicating an executable as a target (provided that the “main” process, the program itself, has been started with “mpiexec -n 1 Program.exe” for example).
This second method could do it for us, but I was wondering if there is a more elegant way to achieve parallel output writing, like calling a function from my own program instead of an executable.

Bonus question, just to make sure i’ve got the basics of PHDF5 right in the first place: I do need to have a process for each parallel action that I want to perform in parallel, be it writing N streams to N files, or writing N streams to a single file?

Thank you in advance

Stefano

···

Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10

First of all, thank you all for your replies, they were very helpful!

We actually started our implementation tentative right from that blog article that you linked! However as we got more and more insights on how phdf5 and mpi actually worked, we started changing the plan, until it got somewhat confused...
As both of you pointed out, we should just use OS processes management, as indeed we want an N:N setup ultimately.
Just one small question for you, Mark: at the end of your reply, when you talk about OS spawned I/O tasks, you mean actual tasks/threads, or processes? Because as far as I (believe to) know, if I want to open and interact with two different hdf5 files at the very same time (not sequentially), I need to do it on two different processes. Or can I actually use tasks, as long as each file is opened inside them?

Thank you again,

Stefano

···

Sent from Mail for Windows 10

From: Miller, Mark C.
Sent: 19 February 2018 20:17
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] How to include parallelization in the program

Hi Stefano,

I am not sure I am understanding your question or the problem(s) you are hoping to solve.

You mention the “N:N case” and so I am assuming you are talking N processor writing to N files. You don’t need parallel HDF5 to do that. You can use serial HDF5 because each stream is a wholly independent file.

The only situation in which you *need* parallel HDF5 is when you want multiple MPI processes (parts of a distributed parallel executable) to write to the *same* file concurrently. Then, their work on the file has to be coordinated (e.g. creation of HDF5 objects) and their I/O to read/write data from/to objects in the file can be done either collectively (coordinated) or independently. But, it sounds like MPI parallelism is not really what you are looking for and that is especially true if you only want N:N case.

Now, if what you *really* want is multiple different OS processes (or maybe threads within a single process) to be able to write concurrently to a single file, then there are not really too many options for you *without* taking some degree of responsibility of coordinating those processes *yourself*. HDF5 will not do much to help you here. The support in HDF5 for calling it from multiple threads within the same executable is pretty limited. The locking is very coarse grained and in all likelihood winds up serializing the threads. And, there is nothing HDF5 itself (nor any other I/O library for that matter) can do if you want multiple different OS processes to write to the same file without doing a lot of the *work* yourself to coordinate them.

Finally, your note gives me the impression that maybe what you are looking for is one (or more) processes whose number grows (and maybe shrinks) over the life of the application and where each processor needs to write some data. If that is your ultimate goal, I think there are various ways you could try to implement that both with and without MPI and parallel HDF5. For example, if you went with MPI and had a loose upper bound on the total number of processes you needed, then you could mpiexec that number but then idle/sleep all those that don’t need to be running at a particular time….you could dynamically create MPI communicators that represent the current number of tasks you need and you could open and use a *single* HDF5 file on that communicator shared among those processors. Then, if you need to change the number of tasks, you would close the HDF5 file, close the communicator and create a new communicator on a different number of tasks and re-open the file with that new communicator. There is a lot involved there but I think it could be made to work. But, that is *only* if you want a single file that is routinely being written to by differing number of tasks. If you really just want N files from N tasks and N varies with time, then why not just use the OS to spawn I/O tasks and each task opens a uniquely named HDF5 perhaps by internal task id or something?

Finally, it might be worth having a look at this HDF5 Blog post….

Not sure any of this is helpful but I thought I would mention some ideas.

Good luck.

Mark

"Hdf-forum on behalf of Stefano Salvadè" wrote:

Good morning everyone,

I’ve recently started using parallel HDF5 for my company, as we wish to save analysed data on multiple files at a time. It would be an N:N case, with an output stream for each file.
The main program itself is written in C#, but we already have an API that allows us to make calls to hdf5 and MPI in C and C++. It retrieves data from an external device, executes some analysis and then saves the data, and parallelizing these three parts would speed up the process. However i’m not quite sure how to implement such parallelization on the third bit:
So far i’ve seen that parallelization is usually implemented right off the bat: the program is started with mpiexec (i’m on Windows), with a specified number of processes. (like “mpiexec -n x Program.exe). Unfortunately running multiple instances of the whole program in parallel would be problematic, but i’ve seen that one should be able to spawn processes later during runtime with MPI_Spawn(), indicating an executable as a target (provided that the “main” process, the program itself, has been started with “mpiexec -n 1 Program.exe” for example).
This second method could do it for us, but I was wondering if there is a more elegant way to achieve parallel output writing, like calling a function from my own program instead of an executable.

Bonus question, just to make sure i’ve got the basics of PHDF5 right in the first place: I do need to have a process for each parallel action that I want to perform in parallel, be it writing N streams to N files, or writing N streams to a single file?

Thank you in advance

Stefano

Sent from Mail for Windows 10

Yes, that’s correct…processes. Two threads in the same executable wind up using HDF5 library’s current thread-safety locking semantics which are too coarse-grained to permit any concurrency. You need to use full processes *and* to different HDF5 files then.

Mark

"Hdf-forum on behalf of Stefano Salvadè" wrote:

Just one small question for you, Mark: at the end of your reply, when you talk about OS spawned I/O tasks, you mean actual tasks/threads, or processes? Because as far as I (believe to) know, if I want to open and interact with two different hdf5 files at the very same time (not sequentially), I need to do it on two different processes. Or can I actually use tasks, as long as each file is opened inside them?