HDF5 2.0.0: High-Performance HDF5 Data Access Directly from S3

Coming soon, HDF5 2.0.0 is the first release with well-rounded capabilities for accessing cloud optimized HDF5 files in the AWS S3 and any S3-compatible object store.

  • Read-Only S3 (ROS3) virtual file driver uses the AWS’ own library aws-c-s3 for S3 communication. Benefits:

    • Improved stability of long running processing by intelligently handling S3 responses, especially failed but non-fatal S3 requests.
    • Automatic sourcing of S3 configuration and credential information per AWS specification.
    • Support for advanced S3 authentication schemes.
    • Automatic splitting and parallelization of large S3 requests.
    • Referencing HDF5 files using S3 URIs.
    • Enabling debug information from both the ROS3 driver and the aws-c-s3 library via environment variables. This information can be very useful for understanding problems or performance bottlenecks.
  • New library defaults that significantly improve performance for cloud optimized HDF5 files by reducing the overall number of the required S3 requests:

    • Default page buffer cache size of 64 MiB when using the ROS3 driver keeps more raw file content in memory.
    • Default dataset chunk cache of 8 MiB holds more chunks.
    • Having these new defaults solves the problem of many software stacks which do not have ability for advanced HDF5 library configuration.
  • Command-line tools h5dump, h5ls, and h5stat automatically switch to the ROS3 file driver whenever the input HDF5 file is referenced with S3 URI. Example: h5dump s3://mybucket/myfile.h5.

The above enhancements come at the right time for the NASA’s ICESat-2 mission which has become the first NASA mission to publish some of their data as cloud optimized HDF5 files. This achievement represents a major step forward for cloud-based scientific computing with HDF5 data, offering:

  • More efficient and scalable access to data subsets directly from the cloud.
  • Potential cost savings in reduced cloud data egress cost.
  • Enhanced performance compared to traditional HDF5 files.

The HDF Group was proud to collaborate with the NASA National Snow and Ice Data Center Distributed Active Archive Center (NSIDC DAAC) and the ICESat-2 satellite mission on this initiative, which sets a powerful model for other NASA missions looking to optimize their large HDF5 data collections for cloud computing.

Learn more about the cloud-optimized HDF5 data from NASA here:

1 Like