HDF5 split file with MPI I/O; and help for reading many small datasets

Good morning everyone,

we’re attempting to create a largely read-only file containing many nested groups and datasets. Profiling the reading shows that 40 to 60 percent of the runtime is spent in H5Gopen. Therefore, we would like to investigate if a split file, i.e. H5Pset_fapl_split, leads to better performance.

In order to create the file in a timely manner we want to first allocate all datasets sequentially and then fill the datasets using MPI I/O.

When attempting to do so, we found that the application deadlocks when opening the file a second time. The relevant parts of the reproducer are:

    if(comm_rank == 0) {
        auto fcpl = H5P_DEFAULT;
        auto fapl = H5Pcreate(H5P_FILE_ACCESS);

        H5Pset_libver_bounds(fapl, H5F_LIBVER_V110, H5F_LIBVER_V110);
        H5Pset_fapl_split(
            fapl,
            ".meta", H5P_DEFAULT,
            ".raw", H5P_DEFAULT
        );

        hid_t fid = H5Fcreate(filename.c_str(), H5F_ACC_EXCL, H5P_DEFAULT, fapl);

        // Allocate datasets here. 

        H5Pclose(fapl);
        H5Fclose(fid);
    }
    MPI_Barrier(comm);

    {
        auto fapl = H5Pcreate(H5P_FILE_ACCESS);
        H5Pset_libver_bounds(fapl, H5F_LIBVER_V110, H5F_LIBVER_V110);

        auto mpio_fapl = H5Pcreate(H5P_FILE_ACCESS);
        H5Pset_fapl_mpio(mpio_fapl, comm, MPI_INFO_NULL);

        H5Pset_fapl_split(fapl, ".meta", mpio_fapl, ".raw", mpio_fapl);

        // This line deadlocks.
        hid_t fid = H5Fopen(filename.c_str(), H5F_ACC_RDWR, fapl);

        // Fill the datasets.

        H5Pclose(mpio_fapl);
        H5Pclose(fapl);
        H5Fclose(fid);
    }

The link to a full reproducer including backtraces can be found at the end of this post.

The question are:

  • Does split file work with MPI-IO?
  • Is there anything obviously wrong in the way we’re trying to use MPI-IO with split files?

Many thanks for your attention.

1 Like

No. Apologies for another gap in the documentation.

Can you describe your use case a little more? How is the data laid out and written, and how were you planning on reading it? How many datasets do you have and how big are they?

Best, G.

Thank you for clearing that up.

We’d like to store digital reconstructions of neurons, which we’ll call morphologies. Imagine a graph of cylinders. Each morphology consists of one dataset of “points” shape=(N, 4) which stores the start and end point of each cylinder and the radius of the “cylinder” at each end point. You can absolutely ignore the fact that these are variable radius “cylinders” (I will from now on). The other dataset is a list of integer offsets of each branch (a branch is a sequence of cylinders in the graph without branching).

More to the point, we have N groups that look like this:

$ h5dump -g "000/00295" -H morphologies.h5
      GROUP "000" {
         GROUP "00295" {
            DATASET "points" {
               DATATYPE  H5T_IEEE_F32LE
               DATASPACE  SIMPLE { ( 6734, 4 ) / ( 6734, 4 ) }
            }
            DATASET "structure" {
               DATATYPE  H5T_STD_I32LE
               DATASPACE  SIMPLE { ( 217, 3 ) / ( 217, 3 ) }
            }
         }
      }

The size of these files depends on the number of morphologies, which depends on the size of the region of interest. We’re testing 1k, 10k, 100k, 1M and 10M. We’d like to reach (just under) 100M. The 100k file is 19GB, which means about 200kB per group (on average).

File formats we’re investigating:

  • Vanilla H5F_LIBVER_V110,
  • Page allocated files with 16kb page sizes.

Thanks to Elena Pourmal and John Mainzer for suggesting page allocated files. As a baseline for comparison (and our current solution) we’re also storing each group in it’s own HDF5 file, with the same folder structure as in the HDF5 files described previously.

As for access patterns, we’re interested in reading either the entire file or a subset of the groups once. There’s two patterns:

  • random access of a subset,
  • optimal order.

The precise semantics of “optimal order” can be decided. I currently believe it is order in which the groups were created. Since that should maximize metadata reuse and access the file (mostly) in order.

Since I don’t need to get split file working, I’d like to finish the measurements over a reasonable range of parameters; and report back with reliable numbers.

However, what we’re observing is that for a 100k file, randomly accessing 10k morphologies is roughly 2x slower compared to the baseline of loading the same 10k morphologies when each morphology is stored in its own file. Accessing the first 10k morphologies in optimal order is roughly 10x faster than the baseline. With is from one MPI rank.

For concurrent read performance, to us the new feature of reading multiple datasets collectively, i.e. H5Dread_multi, appears to hold some promise (especially when accessing large fractions of the file). Would you agree it’s worth pursuing that route?

In case they’re of interest, here are the numbers we measured last night.

First the sequential runs, i.e. only one MPI rank. The container is 100k morphologies or 19GB. We’ll read 10k morphologies, either the first 10k; or randomly selected 10k (do duplicates). Runtimes are in seconds.

shuffle     storage_format             amin        mean        amax
-------------------------------------------------------------------
False       directory             95.993962  114.585441  121.863693
            paged                 11.652172   12.110191   12.939824
            v110                  11.276624   11.585234   11.912627

True        directory            155.241535  157.650023  160.042551
            paged                357.397163  364.791514  369.858405
            v110                 256.900153  263.119904  274.165728

Next we’ll measure the time it takes for multiple MPI ranks to read all morphologies. Each morphology is read once, not once per MPI rank. Same two modes of reading, either in order or shuffled.

shuffle     format     comm_size       amin       mean       amax
-----------------------------------------------------------------
False       directory         36     43.957     45.418     47.843
                             576      4.284      4.795      6.507
            paged             36      4.868      5.432      6.202
                             576      0.929      1.826      4.277
            v110              36      3.867      4.051      4.230
                             576      0.809      0.896      1.027

True        directory         36      38.28     41.233     46.601
                             576       3.69      4.808     11.696
            paged             36      78.41     91.722    106.275
                             576      22.77     30.572     43.215
            v110              36      52.30     57.925     66.406
                             576       8.14     11.189     18.516

I’d like to note that while the page allocated file seems to not have any benefit over the regular v110 file format, the difference becomes visible at higher number of MPI ranks, e.g. 4608. The paged file format reaches the effective/measured systemwide bandwidth of our parallel file system, while the v110 is between 1.5 and 2x slower. This is only true when shuffle == False, which seems intuitive, since the access pattern is very wild and the page buffer is unlikely to be effective.

For the most part these measurements seem unsurprising. To me the cost of random access of groups of datasets compared to accessing the same datasets when storing the groups as individual files is not what I would have unexpected. Nothing too concerning, but it shows why I wanted to try if split files would help (without compromising the fast access pattern).

Thanks for the description. What drives the group structure? Is /000/00295 a morphology identifier? How deeply nested are these groups?

If the leaf-level structure were the same in all cases (points and structure), why not fuse them into global datasets and add morphology descriptor (offset/range) datasets you’d keep in memory? This would perhaps look less intuitive (more like Fortran :wink:), but it will be much more effective because you are not wasting time chasing links.

The good news is you can have both views: Have those global points and structure datasets plus descriptors, and keep the current group hierarchy for those who prefer it. The only difference would be that the leaf-level datasets would contain dataset region references into the global datasets. If the global datasets were concatenations of what you have now, those region references would be just simple hyperslab selections and fit in attributes of the leal-level groups, or they could just be scalar (singleton) datasets.

Does that make sense?

For casual random access, it’s OK to traverse the link structure. Otherwise and in parallel, you’d always first cache the descriptors in memory and construct selections against the global datasets and then read (in parallel).

G.

I forgot to mention that you can get away without dataset region references. Just create virtual datasets at the leaf level. This is not the most common use of VDS, but it would maintain the appearance of a dataset. G.

What drives the group structure? Is /000/00295 a morphology identifier? How deeply nested are these groups?

“Performance” i.e. a desire to avoid 10M or 100M toplevel groups. Hence, we nest them once which since sqrt(100M) is 10k, which seems okay. (What we haven’t recently checked is if this is also a requirements for H5F_LIBVER_V110 files.)

The good news is you can have both views: […]

This is fantastic news! The major reason we’ve not gone the completely flat route is because it’s daunting. To us morphologies are in some ways what grids are to FVM. They’re the basic geometric object on top of which everything is run. Therefore, we’ve only considered the less invasive options. The threat of countless plotting-/post-processing scripts breaking because they read morphologies directly instead of using the library is quite real.

The other reason to avoid it was that I assume it’s unintuitive to non-HPC people. For HPC people it’s a lot like struct-of-array vs. array-of-struct. This is awesome, since it enables a path to only optimize those codes that need it. The rest can do “slow” random access, if reading morphologies isn’t a bottleneck anyway.

The numbers are interesting, but I need more time to stare at them. The 16KB page size strikes me as a little on the low end. What does h5stat say about the metadata-to-data ratio?

G.

Good point, the 16kB is what we found to be an optimal tradeoff in a different investigation. However, does files were smaller, IIRC. I’ll need to check bigger page sizes. Thank you!

As for metadata to raw data ratio in both the paged and vanilla v110 is 0.00396 (or 0.396 %).

The times for shuffle=True are disappointing (compared to the baseline and to vanilla v110) and an indication that we are not playing our cards well. The goal of using paged allocation would be to ensure that all metadata we care about (presumably groups in this case) ends up on one or a few pages. (That assumes link traversal is indeed the long pole in the tent.) We should have a winner if we can achieve that and keep those pages in the page buffer. There should also be an effect for shuffle=False.

I don’t know how you are creating your files, but for an experiment, you could force this with a two-pass approach where we first create all groups (and ensure all fit onto a few pages) and then add the datasets in a second pass. If that showed no effect, we’d be barking up the wrong tree.

G.

Thank you, another good suggestion! Previously, we’ve seen paged allocation to be really effective at making metadata access fast. This again was in the context of “the other project”.

Ordered by simplicity I’ll proceed with the following:

  1. Create files with larger page sizes: 64, 256, 1024 and 4096 KB.
  2. Create benchmark to measure reading metadata, say the offset and shape of each dataset.
  3. Try creating the files as you describe.

My suspicion is that the files are fragmented in the sense that we always read both the points and structure dataset right after each other. They’re very different in size. Their space will be allocated from the same pool. Therefore, since we create the file by looping over all groups and for each group we first write the points, then structure; then we proceed to the next group. I suspect the points will use a small number of pages, then leave a gap the size of multiple structure datasets. Something I’ll check after we know that paged allocation is effective at reducing the metadata access times.

Thank you for the suggestions and sorry for the silence.

Summary: Using a 4MB pagesize fixes the performance issues.

The performance issues vanish when using a pagesize that’s similar to the blocksize of GPFS, 4MB. I believe they also have quarter-blocks, which would be 1MB (this page size also worked reasonably well, but I’ll only show results for 4M).

The plot shows the runtime of reading for the random-access usecase, i.e. shuffled = True. For paged files we have two lines, one which uses the page buffer with size 512MB (dotted lines); and another that doesn’t use the page buffer (solid lines). (Note, we use the parallel version of HDF5 with a small patch to remove an overly strict assert. So far we never use MPI-I/O.) In terms of bandwidth the fastest runs are just over 10GB/s. The peak measured, multi-node bandwidth is 40 GB/s for our system.

Similarly, the performance when accessing a random subset (10%) of the file is also improved by using 4MB pages.

                                                     amin      median        amax
access_pattern  storage_format  PageB                                    


shuffled        directory          --           165.235454  167.788314  169.260345
                paged-16            0           240.687168  246.892209  267.338655
                                  512           179.929818  182.062031  187.918414
                paged-4096          0           227.527992  240.635965  242.022138
                                  512            88.981740   91.454263   94.485499
                v110               --           264.749585  302.302440  418.548688

(Median is a bold term since we only have 3 measurements.)

If we assume that we’re doing batched random access, i.e. we’re not free to pick which groups need to be read, but we can pick the order in which the groups are read, we can further improve the performance for when using a single MPI rank. In this scenario the pagebuffer really shines paged-4096 with a 512 MB page buffer takes 13s to run.

1 Like

@luc.grosheintz, thanks for sharing those numbers. They are impressive. I wish it were easier to configure the library and be in a better spot out of the box.

The solid black line is also interesting, showing the mounting pressure on the metadata server.

G.