Virtual datasets in HSDS?

malonetc · July 13, 2022, 4:27pm

Is it possible to create a virtual dataset with data on HSDS? I’m running a local implementation of HSDS and using h5pyd to access the data but do not see the VirtualSource class that h5py has.

jreadey · July 15, 2022, 6:17pm

Sorry, that’s a HDF5 feature that’s not yet implemented on HSDS. Partly it’s a matter of not having the time yet to implement, but also I’m not entirely convinced that it’s as important a feature for HSDS as with the library. Here’s my thinking…

In the POSIX world there are certain practical constraints in working with really large files (say 100GB and up), so it’s useful to use virtual datasets to contract a larger virtual HDF5 datasets that pull in content from smaller HDF5 files.

In contrast with HSDS you can construct “files” of virtually any size since with the sharded storage model no particular objects needs to be larger than a few MBs. If you are looking to create HSDS datasets that pull data from HDF5 files, you can do a 1-1 mapping between HDF5 and HSDS datasets (this is what you get with the hsload --link option), but you can also have HSDS datasets that reference multiple HDF5 datasets in different files. See H5D_CHUNKED_REF_INDIRECT in https://github.com/HDFGroup/hsds/blob/master/docs/design/single_object/SingleObject.md.

Granted this is not quite the same as VDS (e.g. the datasets need to be aligned on chunk boundaries), but can be useful to assemble large HSDS datasets out of many HDF5 files.

Let me know if this helps. If you could outline your use case briefly, I’m prepared to be convinced on the need for VDS for HSDS!

malonetc · July 18, 2022, 4:15pm

Hello @jreadey ,

I have multiple large 3-dimensional data sets that I need to concatenate into a 4-dimensional data set in order to run statistical analyses along the 4th dimension. VDS works great for doing this as it allows me to create the 4D data set without duplicating the 3D data sets. This is important since we often create multiple 4D data sets using different subsets of the 3D data for various analyses.

Is this possible using the H5D_CHUNKED_REF_INDIRECT layout? Are there any examples on creating this type of data set in HSDS using the command line tools or h5pyd?

Thank you for your help!

jreadey · July 18, 2022, 9:01pm

For your files, is the extent in the 4th dimension aligned with the chunk shape? e.g. if the chunk layout is (x,y,z,10) is the extent of the 4th dimension always divisible by 10? If not, you’ll have “gaps” where the datasets join together which may or may not be a problem for you.

Do you need to link to existing HDF5 files or can everything be in the HSDS format?

Definitely can do this with h5pyd. Don’t have a command line option currently, but conversations like this are helpful in determining what functionality is needed.

malonetc · July 19, 2022, 1:53pm

@jreadey Yes, I believe the extent in the 4th dimension would be aligned with the chunk shape. Wouldn’t this be required since the data in 4D would be referenced from 3D data where each 3D data set would comprise a chunk along the 4th dimension? or perhaps I misinterpreted your question? Having gaps in the data set would be a problem.

Everything can be in HSDS.

For a more concrete example: I have 10,000 data sets with dimension (182, 218, 182). I want to stack these into a “virtual” or “referenced” data set with dimensions (182, 218, 182, 10,000). For my analyses I would access either 1D arrays along the 4th dimension, i.e. [0,0,0,:] or smaller 4D arrays [0:16, 0:16, 0:16, :] in parallel.

jreadey · July 19, 2022, 4:33pm

With VDS I think you can organize the datasets regardless of the chunk layout. With HSDS the “sub-datasets” need to line up on the chunk boundaries. Since in your case you are creating a 4-d virtual dataset from many 3-d datasets, the 4th chunk dimension will just be 1, so that will be fine.

Are the 10,000 datasets all in one group or do you have them organized in sub-groups?

malonetc · July 19, 2022, 4:51pm

My plan is to have them in different sub-groups.

jreadey · July 19, 2022, 5:00pm

Interesting - let me see if I can write some code along the lines of what you are looking for. The H5D_CHUNKED_REF_INDIRECT format was created sometime ago, but haven’t had time yet to add an illustrative program using it, so this might useful to others as well.

jreadey · August 11, 2022, 3:30am

Sorry for the delay - I haven’t forgotten about this!
Tried out using the indirect chunk ref format but will need some fixes for it to work with internal links. Stay tuned!

malonetc · August 11, 2022, 7:54pm

@jreadey No worries, thank you for the update. I’m looking forward to trying this out! Let me know if there’s anything I can help with.

malonetc · September 27, 2022, 11:24pm

@jreadey I just wanted to see if you’ve had any luck getting this to work? I’m looking forward to seeing it in action. Thanks again for your help.

jreadey · September 28, 2022, 6:54am

@matonetc - After some thought I think it would be best to add full VDS support to HSDS (and h5pyd). It might take some time, but I’ll let you know when it’s ready to try out.

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Virtual datasets in HSDS?