Read subgroups parallel bug?

mario.kaip · February 14, 2022, 7:38pm

Hello,

I have a problem to read subgroup names parallel. I have to work with HDF5 files with many subgroups (>100k). Due to time requirements it would be nice to parallelize the task to read all subgroup names of a specific group. I have created a minimal example but it hangs in some internal MPI communication:

#include <string>
#include <vector>
#include <iostream>
#include "hdf5.h"
#include "mpi.h"

#define FILENAME "myFile.h5"
#define GROUPNAME "myGroup"
#define NUMSUBGROUPS 100

void createTestFile(){
  hid_t file = H5Fcreate(FILENAME, H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT);
  hid_t group = H5Gcreate (file, GROUPNAME, H5P_DEFAULT, H5P_DEFAULT, H5P_DEFAULT);
  
  /* Create multiple subgroups */
  for (uint32_t i = 0; i < NUMSUBGROUPS; i++){
    const std::string cSubgroupName = std::to_string(i);
    hid_t subgroup = H5Gcreate (group, cSubgroupName.c_str(), H5P_DEFAULT, H5P_DEFAULT, H5P_DEFAULT);
    H5Gclose (subgroup);
  }
  
  H5Gclose (group);
  H5Fclose (file);
}

void getChunkOfSubgroupNames(const hid_t cFile, const std::string& cGroupName, const hsize_t cStartIdx, const hsize_t cNumSubgroupNamesToRead,
                             std::vector<std::string>* pReadSubgrouNames){
  
  const size_t cNameArraySize = 64;
  char* pNameArray = (char*)std::malloc(sizeof(char) * cNameArraySize);
  
  for (hsize_t i = cStartIdx; i < cStartIdx + cNumSubgroupNamesToRead; i++){
    H5Lget_name_by_idx(cFile, cGroupName.c_str(), H5_INDEX_NAME, H5_ITER_NATIVE, i, pNameArray, cNameArraySize, H5P_DEFAULT);
    const std::string cFoundSubgroupName(pNameArray);
    pReadSubgrouNames->push_back(cFoundSubgroupName);
  }
  
  std::free(pNameArray);
}

int main(int argc, char** argv){
  
  MPI_Init(&argc, &argv);
  
  MPI_Comm cComm = MPI_COMM_WORLD;
  int procId = 0;
  int nProcs = 0;
  MPI_Comm_rank(cComm, &procId);
  MPI_Comm_size(cComm, &nProcs);
  
  /* Create file by process 0 */
  if (procId == 0){
    createTestFile();
  }
  MPI_Barrier(cComm);
  
  /* Open created HDF5 file parallel for all processes */
  hid_t parallel_plist = H5Pcreate(H5P_FILE_ACCESS);
  H5Pset_fapl_mpio(parallel_plist, cComm, MPI_INFO_NULL);
  
  hid_t file = H5Fopen(FILENAME, H5F_ACC_RDONLY, parallel_plist);
//  hid_t file = H5Fopen(FILENAME, H5F_ACC_RDONLY, H5P_DEFAULT);
  
  /* Read subgroup names parallel */
  const hsize_t cLocNumSubgroupNamesToRead = NUMSUBGROUPS / nProcs;
  const hsize_t cLocOffsetSubgroupNamesToRead = procId * cLocNumSubgroupNamesToRead;
  std::cout << "Proc " << procId << " tries to read " << cLocNumSubgroupNamesToRead << " subgroup names at offest " << cLocOffsetSubgroupNamesToRead << "." << std::endl;
  
  std::vector<std::string> localReadSubgrouNames = {};
  getChunkOfSubgroupNames(file, GROUPNAME, cLocOffsetSubgroupNamesToRead, cLocNumSubgroupNamesToRead, &localReadSubgrouNames);
  
  std::cout << "Proc " << procId << " has found " << localReadSubgrouNames.size() << " subgroups." << std::endl;
  
  H5Fclose (file);
  
  MPI_Finalize();
  
  return 0;
}

This can be compiled with:

g++ -c -g -std=c++11 -MMD -MP -MF main.o.d -o main.o main.cpp
g++ -o parallelloopsubgroups main.o -lhdf5 -lmpi

Running the program with

mpirun -np 2 parallelloopsubgroups

leads to a dead lock.

If I simply open the file without the MPI communicator with
H5Fopen(FILENAME, H5F_ACC_RDONLY, H5P_DEFAULT);
then everthing works fine but this is not indented.

My environment:
OpenMpi 4.1.2
Parallel HDF5 1.12.1

Can anybody help?
Thanks

gheber · February 15, 2022, 5:28pm

I’m not sure why it hangs, but even if it worked, it would be slow. It’s a perfect “read storm.” Collective I/O might help with that, but H5L_get_name_by_idx is an expensive call. A better way would be be to use H5Literate on rank 0 and then broadcast, but even that is gonna be slow, especially, w/ a parallel file system, because of tons of small reads. The thought of pushing this to >100k subgroups is scary. 1k would be scary already. Do you really need groups, or would a dataset of object references do the trick?

Best, G.

mario.kaip · February 16, 2022, 8:14am

I’ve tried your idea using H5Literate before but it’s to slow. If possible I need some kind of implementation which scales well with the number of processes.

Therefore I tried the variant with the more expensive calls of H5L_get_name_by_idx to distribute the computational effort over several processes. I have to admit that I don’t quite understand the mechanics of reading and broadcasting the metadata in the background of the API calls.
Unfortunately, the format of the HDF5 files is fixed for me and cannot be changed. But it would be possible to add tracking and indexing of creation order during creation if this would help.
By the way, the name of a subgroup is not arbitrary, it always consists of 10 digits in my use case.

My target environment is a HPC with parallel file system where I can run about 500-1000 processes.

Thank you for your feedback!

gheber · February 16, 2022, 1:52pm

OK, maybe before we get too deep into the weeds of technical details, can you describe your use case a bit? I’m sure your use case is not, “I would like to (repeatedly) determine the link names in a group, in parallel.” What’s the bigger picture? I understand that you cannot change the existing files & let’s assume they are read-only. That doesn’t mean that we cannot create new HDF5 files that reference relevant content in existing files. What is that “relevant content” and how would you like to process it?

Best, G.

mario.kaip · February 16, 2022, 2:08pm

The big picture problem can be described as follows:
My application gets a read-only HDF5 file with many subgroups. Each subgroup holds datasets and attributes. For each of the data linked in the subgroup we have to do some complex number crunching. Hence it’s worth to parallelize in such a way that each process reads the content of a disjoint subset of subgroups.
The main problem here is that the names of the subgroups are a kind of IDs (10 digits) which I don’t know a priori. In order to assign the first N subgroups to the first process, for example, I have to determine the names of the first N subgroups (in lexicographically order).

It is possible to create own HDF5 files during the application if it helps to tackle the problem.

I hope it brings some light on the use case.

Thanks

gheber · February 17, 2022, 1:34pm

Are you running against a parallel file system or a local file system or NFS or …?
Are you reading each file only once or several times?
What’s a typical file size?

G.

mario.kaip · February 17, 2022, 1:36pm

We are using BeeGFS as parallel file system.
We read each file once at the beginning of the processing. After full data extraction of the file we start the number crunching and the file is not read again.
One file is about 1-2GB

gheber · February 17, 2022, 1:50pm

I think you have at least two options:

Have multiple ranks process different files simultaneously
Have rank 0 preprocess the file (determine link names) and then broadcast to other ranks

Details:

Option 1 assumes the files fit in their entirety into memory. Just read (as many as you can fit) the HDF5 files as HDF5 file images and then process them completely in memory (no further I/O). In this case, you’d read the full 1-2GB file into a memory buffer and then open that as an HDF5 file. All operations on such an HDF5 file image will be in-memory and as fast as they can possibly be. You can have multiple ranks work on multiple files, and you’d be only limited how fast BeeGFS can feed those files to the MPI ranks.

Under option2, you load a file completely into memory as under option 1. You determine the link names, and broadcast them to the other ranks. (Since H5Literate will be in memory, it’ll be as fast as can be.)
Once the other ranks have their link names/assignments, they can go back to BeeGFS to read the data they need.

OK? G.

gheber · February 17, 2022, 2:00pm

There are other variations. If the files don’t fit into memory, you could open the file with the CoreVFD and use a , say 1 MB increment. Since the symbol table and local heap for the “big group” will be fairly localized, you’ll be reading only a very small portion of the file. Then you can again do H5Literate on rank 0 and broadcast the link names, and back to option 2.

The key is really to do the iteration in memory and not to keep going back to BeeGFS for breadcrumbs.

G.

mario.kaip · February 18, 2022, 7:48am

Loading the HDF5 file into memory sounds interesting. This may be a solution for the problem.
Do you have some reference on how to do this?

gheber · February 18, 2022, 12:53pm

The easiest way to check this out would be to use a different file access property. Instead of H5Pset_fapl_mpio, use H5Pset_fapl_core. And that should be it for a single rank run. The rest of your code should just work as is. Note that this is a sequential driver, and there’s maybe little point in reading the same file on all ranks. You’d rather be processing different (sets of) files on different ranks.

For a more advanced uses of HDF5 file images, see the RFC.

OK? G.

mario.kaip · February 18, 2022, 2:24pm

Using H5Pset_fapl_core on the toy example works fine. I’ll try to get this work in my application.
Thank you very much for your advice!

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Read subgroups parallel bug?