partial Io through S3 vfd

wjiang2 · March 17, 2020, 4:23am

First of all, I am very excited about the new feature of ros3 vfd offered by libhdf5 1.12, it will be very useful to port our existing software to the cloud if it turns out to be working as expected.

I wonder how the partial read works with the remote s3 storage. For example, if I only want to load one of the data sets or subset of one big dataset from h5 file, how does libhdf5 knows to only fetch the requested bytes from s3 through ranged get? I guess it is probably not exact match between the requested subsets and the actual bytes read from s3. If so, how much extra bytes will be downloaded? Or there is currently no partial Io implemented and the entire data will be downloaded always?

Also, how soon will the s3-write be available?And how does it affect(or help) the partial IO?

Thanks!

gheber · March 17, 2020, 11:28am

Think of it this way: with a file system, the library “translates” all your H5Dread calls into combinations of seek and read calls. (It reads a bunch of metadata on the side.) The ros3 vfd replaces that with S3 GETs on byte ranges (https://docs.aws.amazon.com/AmazonS3/latest/API/API_GetObject.html). There’s no need to read the entire S3 object (blob of contiguous bytes of your file). Yes, it does partial reads the same way. Writes are harder, because there is no such thing as an S3 object PUT on byte ranges. It can be done, but is a matter of priorities and resources. In the meantime, you should check out HSDS (https://github.com/HDFGroup/hsds) or Kita (https://www.hdfgroup.org/solutions/hdf-kita/). G.

gheber · March 17, 2020, 11:30am

I forgot to mention that the ros3 vfd is also available in HDF5 1.10.6. G.

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

partial Io through S3 vfd