Hi Stefano,
I am not sure I am understanding your question or the problem(s) you are hoping to solve.
You mention the “N:N case” and so I am assuming you are talking N processor writing to N files. You don’t need parallel HDF5 to do that. You can use serial HDF5 because each stream is a wholly independent file.
The only situation in which you *need* parallel HDF5 is when you want multiple MPI processes (parts of a distributed parallel executable) to write to the *same* file concurrently. Then, their work on the file has to be coordinated (e.g. creation of HDF5 objects) and their I/O to read/write data from/to objects in the file can be done either collectively (coordinated) or independently. But, it sounds like MPI parallelism is not really what you are looking for and that is especially true if you only want N:N case.
Now, if what you *really* want is multiple different OS processes (or maybe threads within a single process) to be able to write concurrently to a single file, then there are not really too many options for you *without* taking some degree of responsibility of coordinating those processes *yourself*. HDF5 will not do much to help you here. The support in HDF5 for calling it from multiple threads within the same executable is pretty limited. The locking is very coarse grained and in all likelihood winds up serializing the threads. And, there is nothing HDF5 itself (nor any other I/O library for that matter) can do if you want multiple different OS processes to write to the same file without doing a lot of the *work* yourself to coordinate them.
Finally, your note gives me the impression that maybe what you are looking for is one (or more) processes whose number grows (and maybe shrinks) over the life of the application and where each processor needs to write some data. If that is your ultimate goal, I think there are various ways you could try to implement that both with and without MPI and parallel HDF5. For example, if you went with MPI and had a loose upper bound on the total number of processes you needed, then you could mpiexec that number but then idle/sleep all those that don’t need to be running at a particular time….you could dynamically create MPI communicators that represent the current number of tasks you need and you could open and use a *single* HDF5 file on that communicator shared among those processors. Then, if you need to change the number of tasks, you would close the HDF5 file, close the communicator and create a new communicator on a different number of tasks and re-open the file with that new communicator. There is a lot involved there but I think it could be made to work. But, that is *only* if you want a single file that is routinely being written to by differing number of tasks. If you really just want N files from N tasks and N varies with time, then why not just use the OS to spawn I/O tasks and each task opens a uniquely named HDF5 perhaps by internal task id or something?
Finally, it might be worth having a look at this HDF5 Blog post….
Not sure any of this is helpful but I thought I would mention some ideas.
Good luck.
Mark
"Hdf-forum on behalf of Stefano Salvadè" wrote:
Good morning everyone,
I’ve recently started using parallel HDF5 for my company, as we wish to save analysed data on multiple files at a time. It would be an N:N case, with an output stream for each file.
The main program itself is written in C#, but we already have an API that allows us to make calls to hdf5 and MPI in C and C++. It retrieves data from an external device, executes some analysis and then saves the data, and parallelizing these three parts would speed up the process. However i’m not quite sure how to implement such parallelization on the third bit:
So far i’ve seen that parallelization is usually implemented right off the bat: the program is started with mpiexec (i’m on Windows), with a specified number of processes. (like “mpiexec -n x Program.exe). Unfortunately running multiple instances of the whole program in parallel would be problematic, but i’ve seen that one should be able to spawn processes later during runtime with MPI_Spawn(), indicating an executable as a target (provided that the “main” process, the program itself, has been started with “mpiexec -n 1 Program.exe” for example).
This second method could do it for us, but I was wondering if there is a more elegant way to achieve parallel output writing, like calling a function from my own program instead of an executable.
Bonus question, just to make sure i’ve got the basics of PHDF5 right in the first place: I do need to have a process for each parallel action that I want to perform in parallel, be it writing N streams to N files, or writing N streams to a single file?
Thank you in advance
Stefano
···
Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10