HDF5 Dataset Size and Number Questions

Hi, I’m currently exploring the performance impact of using different numbers and sizes of datasets within HDF5 files (specifically HDF5-1.14.0).

While I’m utilizing h5bench for my studies, the tool has a fixed number of datasets which does not seem to be adjustable. To gain a broader understanding, I’m reaching out to seek your insights on typical practices in scientific data storage using HDF5.

Specifically, I’m interested in learning about:

(1)Typical Dataset Quantity: In your experience, what is a typical range or order of magnitude for the number of datasets commonly stored within a single HDF5 group (or file)? Are there any known limitations or performance considerations regarding the number of datasets within a group for optimal performance?

(2)Dataset Size: Similarly, for dataset size within HDF5 files, could you share any insights on typical ranges or orders of magnitude (in terms of data points or memory usage)? Are there any best practices or performance considerations you’d recommend regarding dataset size for efficient storage and retrieval?

If you have any recommendations for common and extreme case scenarios (high number of datasets or large/small dataset sizes) suitable for benchmarking, I would be very grateful.

Any insights you can provide on these points would be greatly appreciated.

I don’t know if you are referring to parallel or serial IO. Regarding parallel I/O, having many datasets can adversely affect performance. In terms of dataset sizes, using small sizes will result in abysmal parallel I/O performance. The main objective for parallel file systems is to carry out the largest I/O size with the fewest calls. That is one of the reasons why multi-dataset HDF5 APIs were introduced.

I see. I am mainly experimenting with serial I/O for now.
Is there a more common number of datasets you’ve seen people use within a group and with a file? If so how large the datasets usually are?

If not is there anywhere documented the best practice of the number and size of datasets one can use in a single HDF5 file?

There is not a single “common” number of datasets or sizes used. It depends on the specific application and data structure, which can vary widely. Here’s why:

  • Data Specificity: Scientific data, for example, might involve numerous related datasets (e.g., temperature, pressure, readings from different sensors) within a group representing a specific experiment. On the other hand, an image dataset might have just one large dataset per file.

However, some general suggestions can be made:

  • Multiple Datasets per Group: It’s common to have several datasets, especially for scientific data. This allows logical organization based on data categories within an experiment or simulation.
  • Performance Considerations: For optimal performance, it is generally a good idea to avoid an extreme number of datasets, such as millions.
  • Dataset Size Varies: Dataset size can range significantly. From kilobytes for small sensor readings to terabytes for large-scale simulations or image collections.
  • Data Organization: If your data naturally falls into distinct categories, create separate datasets for each category.

In summary, prioritize logical organization over a strict dataset count; if you avoid extremes, it should perform adequately.

Hi, @mtang11 !

You can check NASA HDF sample files in Comprehensive Examples (hdfeos.org). You can download a few HDF files and collect stats.

You’ll see that there are usually less than 10 datasets per group. It’s very rare to see 100 datasets per group.
Each dataset size is typically less than 2G.

In my opinion, NASA HDF is a gold standard in terms of managing a large collection of data in HDF. You can check how well NASA HDF works for general public through System Performance and Metrics | Earthdata (nasa.gov).

2 Likes

This is super helpful, thank you so much!

1 Like