External links pointing to S3 object store

Hello!

I am writing an application that needs to dynamically combine multiple HDF5-datasets from various sources into one single HDF5-file. The sources can be local HDF5 files or HDF5 files in an S3-based object store.

How does the ros3 driver propagate S3 bucket locations and credentials? Is there a way to let HDF5 know the credentials to S3 object stores? Here are two example use cases I would be interested in:

  1. I have a local HDF5 file and would like to create an external link to a dataset in an HDF5 file in an S3 object store. Is this possible?

  2. I have an HDF5 file in an S3 object store and would like to create a relative external link to a file in the same object store. Is this possible, or must it always be an absolute url/path?

I am using h5py and don’t have much in-depth knowledge about the HDF API.

Possibly related thread: External links & VFDs

It is possible to store this information in an external link. Still, currently, the library won’t be able to resolve this kind of external link. Unless I’m mistaken, the function H5Lcreate_external will let you supply pretty much arbitrary strings for the file_name and object_name parameters. (It won’t try to resolve or validate the information supplied.)

import h5py
with h5py.File('myfile.h5','w') as f:
    f['cloudy'] = h5py.ExternalLink('https://s3.us-west-2.amazonaws.com/DOC-EXAMPLE-BUCKET1','puppy.jpg')

The h5dump output looks like this

HDF5 "myfile.h5" {
GROUP "/" {
   EXTERNAL_LINK "cloudy" {
      TARGETFILE "https://s3.us-west-2.amazonaws.com/DOC-EXAMPLE-BUCKET1"
      TARGETPATH "puppy.jpg"
   }
}
}

h5ls -r yields

/                        Group
/cloudy                  External Link {https://s3.us-west-2.amazonaws.com/DOC-EXAMPLE-BUCKET1//puppy.jpg}

But that’s about where it stops. The library, currently, won’t resolve those links for you.

Yes, these are two “arbitrary” strings, and you can put anything there (not that I recommend it, though…).
Until we have library support for such external links, you can detect and resolve them in your code. But it makes no sense for everyone to reinvent the wheel. Would you like to help us develop an RFC on what use cases we want to support and how the library should behave?

@werner 's post was right up that alley and is a good starting point.

G.

Thanks for your response! I feared as much.

From my point of view, an implementation in HDF5 would have to address the following points:

  • Possibly multiple different access credentials for individual endpoints or even objects.
  • It probably might make sense to distinguish/subclass the ExternalLink to an ROS3Link or add a driver/type field for that. S3 might not be the only other external link pointing to “cloud”-storage in the future.

I will probably resolve this issue on a higher level (without the use of ExternalLinks) in my project for now. I am also using s3fs instead instead of ros3 due dependency issues, so I will have to go this route anyway. But I am happy to contribute my use cases to an RFC!

Another approach you may want to consider is using HSDS (GitHub - HDFGroup/hsds: Cloud-native, service based access to HDF data). In HSDS you can have external links or datasets that reference HDF5 files stored in S3. You can even have a dataset that concatenates multiple HDF5 datasets in different files. See the blog: https://www.hdfgroup.org/2022/12/aggregation-for-cloud-storage/ for a description of how this works.

On the authentication front, the client only needs to authenticate with HSDS. Authentication from HSDS to S3 is handled independently (e.g. via an AWS access key).

Let me know if you have questions about this approach!

1 Like

I hope these existing solutions can help you save your application development time.

Thanks for these suggestions!

HSDS certainly looks interesting. Currently, we are using CKAN (https://ckan.org/) as authentication provider and I am planning to use boto (Presigned URLs - Boto3 1.26.144 documentation) to facilitate direct access to S3 objects.