Hi guys, I am new to hdf5 and HSDS but I’ve read quite a substantial amount of topics and documentation.
From what I gather, to load a netcdf file into HSDS without copying and/or changing its contents use hsload --link and the file will be available to HSDS with the class ‘H5D_CHUNKED_REF_INDIRECT’.
My question is: how does this compare with loading without the --link flag in terms of performance to read the data afterwards?
If I use the data in a cluster with lots of nodes, would reading the file be faster if I use the native HSDS representation? (i.e. chunks on separate files vs single .nc file) I guess what I am trying to figure out is what is the performance degradation or downsides in using the --link option with a single .nc file.
Yes, that’s correct… with the
--link option the chunk data in the source file is not copied, but rather the HSDS datasets keep track of where the chunks are in the original source files.
Running hsload with --link is quite a bit faster than without (very little data needs to be copied) and you’ll save on storage costs (if you intend to keep the original HDF5 files around anyway).
Performance is usually a bit better without using linking. It depends a lot on how the HDF5 files are structured though. Best thing would be to try hsloading with and without the link option and see how the performance differs for your intended application.
Let us know what you find out!
Thanks @jreadey I’ve reach more or less the same conclusion.
One thing that was sort of a surprise was that when using --link sometimes seemed that the HSDS server (local) needed to fetch the entire .nc/.h5 file from storage (AWS), resulting in longer first run times than without --link.
Could it be the case that the server is downloading the entire file to the local cache before serving it? Would it be a problem for really large files that don’t fit in memory?
I am asking because a client has a HSDS server set up and uses the --link option, I am trying to assess the pros and cons of migrating to native HSDS format.
No, actually with hsload it’s actually hsload that is pulling information from the HDF5 file. HSDS will only read from the file when a client is actually trying to read dataset data.
When hsload needs to access a file on S3, it uses the s3fs driver. So in principle, it shouldn’t need to pull the entire file down, though how much of the file depends on the particulars of how metadata is organized within the file and how much pre-fetching s3fs is doing.
BTW, we should have some updates out soon that will improve the performance of hsload with the --link option.
@jreadey Thank you very much! I really appreciate the well thought response. I will report back our findings and conclusions if our company gets this contract for optimizing their HSDS server setup.