I am writing an application that needs to dynamically combine multiple HDF5-datasets from various sources into one single HDF5-file. The sources can be local HDF5 files or HDF5 files in an S3-based object store.
How does the ros3 driver propagate S3 bucket locations and credentials? Is there a way to let HDF5 know the credentials to S3 object stores? Here are two example use cases I would be interested in:
I have a local HDF5 file and would like to create an external link to a dataset in an HDF5 file in an S3 object store. Is this possible?
I have an HDF5 file in an S3 object store and would like to create a relative external link to a file in the same object store. Is this possible, or must it always be an absolute url/path?
I am using h5py and don’t have much in-depth knowledge about the HDF API.
It is possible to store this information in an external link. Still, currently, the library won’t be able to resolve this kind of external link. Unless I’m mistaken, the function H5Lcreate_external will let you supply pretty much arbitrary strings for the file_name and object_name parameters. (It won’t try to resolve or validate the information supplied.)
import h5py
with h5py.File('myfile.h5','w') as f:
f['cloudy'] = h5py.ExternalLink('https://s3.us-west-2.amazonaws.com/DOC-EXAMPLE-BUCKET1','puppy.jpg')
/ Group
/cloudy External Link {https://s3.us-west-2.amazonaws.com/DOC-EXAMPLE-BUCKET1//puppy.jpg}
But that’s about where it stops. The library, currently, won’t resolve those links for you.
Yes, these are two “arbitrary” strings, and you can put anything there (not that I recommend it, though…).
Until we have library support for such external links, you can detect and resolve them in your code. But it makes no sense for everyone to reinvent the wheel. Would you like to help us develop an RFC on what use cases we want to support and how the library should behave?
@werner 's post was right up that alley and is a good starting point.
From my point of view, an implementation in HDF5 would have to address the following points:
Possibly multiple different access credentials for individual endpoints or even objects.
It probably might make sense to distinguish/subclass the ExternalLink to an ROS3Link or add a driver/type field for that. S3 might not be the only other external link pointing to “cloud”-storage in the future.
I will probably resolve this issue on a higher level (without the use of ExternalLinks) in my project for now. I am also using s3fs instead instead of ros3 due dependency issues, so I will have to go this route anyway. But I am happy to contribute my use cases to an RFC!
On the authentication front, the client only needs to authenticate with HSDS. Authentication from HSDS to S3 is handled independently (e.g. via an AWS access key).
Let me know if you have questions about this approach!