How to use phdf5 in Spark

fusy · August 23, 2023, 10:45am

The parallel computing framework I am currently using is Spark. However, I have learned that parallel HDF5 file reading is typically implemented using MPI. I would like to inquire if there is a method or technique available that would allow me to achieve parallel HDF5 file reading functionality within Spark.
Thank you in advance for any assistance or guidance you can provide.

gheber · August 23, 2023, 11:39am

Where is your data stored?

You can have Spark tasks reading the same or different HDF5 files independently using the serial HDF5 library. Assuming you are using Java, HDFql might be a good choice. And there are other fine choices, such as jHDF and JavaCPP.

G.

fusy · August 23, 2023, 11:48am

Thanks for your help.
The data is stored in a distributed file system, such as Lustre.
However, due to the large size of individual HDF5 file, the reading process takes a long time. Therefore, we want to accelerate the reading process by parallelization.
Currently, we are using Python as the implementation language.

gheber · August 23, 2023, 11:45pm

This blog post covered a slightly different topic, but the general idea should work in your case. If you have a single or a few large HDF5 files, you need a strategy which Spark task should read which files/datasets/hyperslabs. Just create a CSV file that lists the portions of the HDF5 inventory that you want each Spark task to read, and take it from there.

Remember that Spark tasks don’t communicate like MPI tasks. If you want MPI in the mix, you could look at MPI4Spark, e.g., Spark Meets MPI: Towards High-Performance Communication Framework for Spark using MPI | IEEE Conference Publication | IEEE Xplore.

OK? G.