HDF5 Parallel writing to single file in different groups from different python programms

nagendar.varra · May 9, 2018, 12:39pm

We have started using Hdf5 file for saving the data.
Data Received from different source of python programs, each python program executes on different Hardware but all are connected in Network(ethernet).
So we want to write all the received data into a single Hdf5 file by creating separate independent group for each python program.
we are using ‘MPI’ python package for this purpose.
But the problem we are unable to write data in parallel. whichever group created first/last able to write to file.
Is it possible to write to a single .h5 file from different python programs.
If possible How we have to do that, can you share the example to achieve this.

  Software versions used :
   Ubuntu : 16.0.4
   Python : 3.5
  MPI : MPI4_py(is there any alternate other than MPI)
   Hdf5 : 1.8.20(Already Enabled --enable parallel, --enable --shared, --enable thread safe) do we missed any ?
   H5py : 2.7.1
Is parallel writing is possible in a single file?

We need to append data from multiple python programs to a h5 file.
please suggest the best way to do it.

epourmal · May 16, 2018, 5:02pm

Hello Nagendar,

It looks like you need multiple independent writers to one file that HDF5 doesn’t support yet.

If you use MPI, you will need to have one python program and use HDF5 parallel programming model (see HDF5 Parallel Tutorial) to write your data.

Thank you!

Elena

nagendar.varra · May 17, 2018, 9:32am

Dear Elena,
Thanks for your reply.
Can you share a example program for parallel writing of H5 file

nagendar.varra · May 17, 2018, 11:39am

Dear Elina,

Can you provide a sample program in python.

Thank you…

nagendar.varra · May 17, 2018, 12:50pm

Hi Elena,
Thanks for your replay.
I have tried the example you mentioned below problem i faced . if Single python program try access a H5 file using multi threading concept.
The first group data set data is updated properly, but the second, third data group data set are corrupted.
Can you share a example code for parallel writing of H5 from single python program …

nagendar.varra · May 18, 2018, 6:10am

Dear Elina,

Is it possible to create the data sets dynamically.
In my case Number of groups are not fixed, Number of groups will change every time. data sets created in each group is not fixed. I mean number of data sets created based on time & data reception.
So we have to create the data sets dynamically and write data into them. is it possible ???
please help.

epourmal · May 21, 2018, 12:26am

Dear Negendar,

Unfortunately, I don’t have any examples of MPI Python code. May be someone on this list could provide you with an example or you may be you post your question on the h5py mailing list?

Sorry!

Elena

epourmal · May 21, 2018, 12:32am

Hi Nagendar,

I am not sure how MT Python works. Could you please share your program? If you were using multi-threaded C program and thread-safe HDF5 library, you should be able to do what you are doing. Once again, may be people on the h5py mailing list will be more helpful. Sorry!

Elena

pierre-elie.normand · August 21, 2018, 5:07pm

I believe the example, in python using MPI, you are looking for can be found here:
http://docs.h5py.org/en/latest/mpi.html

walemark · December 16, 2019, 4:36pm

The Python Global Interpreter Lock or GIL, in simple words, is a mutex (or a lock) that allows only one thread to hold the control of the Python interpreter. All the GIL does is make sure only one thread is executing Python code at a time; control still switches between threads. What the GIL prevents then, is making use of more than one CPU core or separate CPUs to run threads in parallel.

Python threading is great for creating a responsive GUI, or for handling multiple short web requests where I/O is the bottleneck more than the Python code. It is not suitable for parallelizing computationally intensive Python code, stick to the multiprocessing module for such tasks or delegate to a dedicated external library. For actual parallelization in Python, you should use the multiprocessing module to fork multiple processes that execute in parallel (due to the global interpreter lock, Python threads provide interleaving, but they are in fact executed serially, not in parallel, and are only useful when interleaving I/O operations). However, threading is still an appropriate model if you want to run multiple I/O-bound tasks simultaneously.

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

HDF5 Parallel writing to single file in different groups from different python programms