I have 700-1000 HDF5 files and each contains about 1000 groups and 16K datasets in total. All datasets are chunked and compressed. To obtain the statistics of the datasets (i.e. dimension sizes, data types, names, etc.), I call H5Ovisit to iterate all the objects and calls H5Dopen, H5Dget_space, H5Sget_simple_extent_dims, H5Dget_type in the call back function.
I can see many system read calls (or MPI_File_read) made during the iterations. I wonder what would be the fastest way to read the metadata in my case. Does HDF5 have a setting to enable metadata prefetching? I have tried in-memory I/O (H5Pset_fapl_core) and it did give me a better performance, but this approach is only feasible for small files.
In fact, I tried both POSIX and MPI-IO drivers, just to see which one works better for me. My application requires to first obtained all file metadata and then performance parallel writes on a new file. The speed of metadata reading will decide my I/O strategy for the writes.
Here is a timing result for reading metadata from one of my HDF5 files on Cori @NERSC, running one MPI process on one Haswell node, reading one file.
File size 293 MB
Number of groups: 999
Number of datasets: 15,973
POSIX I/O driver took 39 seconds
In-memory I/O took 2 seconds
Given such a big difference, I am hoping there is some HDF5 trick that can help me reduce this gap, particular for my other larger files (ones with the same number of objects, but each is of bigger size).
What’s the file size distribution? 1,000 files may fit into the Cori Burst Buffer. You could then use GNU parallel and HDF command line tools or Python scripts to generate the statistics. You could also do Hadoop streaming (Not sure if that’s available on Cori.) G.
Thanks for the suggestion. However, the off-line approaches to collect metadata do not work for my case. I am developing an MPI C/C++ program whose first task is to gather metadata of a given set of input files. The set of input files is provided by users as a command-line argument which can be changed each time the program runs. I am looking for solutions that can be used in C programs.
Maybe there’s a crossover between staging the whole lot into BB and a clever sampling? (Staging 1,000 x 300 MB files is neither super fast nor super slow. Staging 200 or 400 might be good enough.) What BB “hit rate” would be necessary for the overall performance (including the misses where you’d have to go to Lustre) to be acceptable? (My assumption is that POSIX to BB performance is comparable to Core VFD.) G.
My result provided earlier was obtained using Lustre. I do believe it can be improved using DataWarp on Cori. But I also believe the same degree of performance gap of using H5Pset_fapl_core or not should persist when using DataWarp. As H5Pset_fapl_core enables H5Fopen to read the entire file into an internal buffer, all successive metadata reads will become just memory copy operations.
Without H5Pset_fapl_core, each metadata read (H5Dopen, etc.) will call a system read to read file file. Given the metadata blocks may disperse all over the file, system-level prefetch may not perform as efficiently as prefetch in HDF5, if HDF5 had one. Can I safely assume that such metadata prefetching is yet to be implemented in HDF5?
I can’t speak to the metadata prefetching portion. The metadata cache image work comes to mind, but that won’t work out-of-the-box (i.e., you’d have to create MCIs first) and it works only with 1.10+. I’m not convinced that the core VFD would be that much faster, if faster at all. Maybe there’s a margin for read-only workloads, but definitely not for writes. On SSD-based file systems, I haven’t seen a noticeable difference between POSIX and core VFD and don’t bother anymore w/ core VFD not least because the interface is so clunky.