I have the usecase that my datasets are highly compressible. I am often dealing with a compression factor of 99%.
When uploading a .h5 file to hsds it seems to always decompress the data and upload them uncompressed. Is there a way to utilize the existing compression of the .h5 file, or an something in the realm of http compression? I saw there is an http compression option in the config.yml but it doesn’t seem to have an effect on uploading the data.
I saw there were some other people and also a whole article on data upload/ingest issues. Maybe someone has some other tips.
Also a progressbar would be nice for hsload!
Some things I have tried:
I have already switched to ingesting the data from a virtual machine in the cloud because of the significantly faster connection. I don’t quite seem to be able to saturate it though. I achieve around 1Gbit/s of upload speed from the VM to the kubernetes cluster running the hsds server. about 1/4th of what i would expect the connection speed of the VM or the cluster to be. Also the CPU utilization for decompression and compression is barely measurable with the very cheap blosclz option that I am using. The Storage account should also be far from reaching it’s IOPs limit.
In azure I have already played with copying the hsds container from one storage account to another with az copy, which is by far the fastest way to migrate. Not only because I am only working with the compressed data but also because az copy can reach transfer speeds upwards of 10Gbit/s even on the small objects created by hsds.
Originally it was gzip, which should also be supported with hsds. Later on I was switching to other compression options that are more efficient on decoding by setting --ignore filters flag in hsload.
Do you know if hsds should just work with the existing compression of an h5 file? I was not able to get it to work so far. I assume the chunk size also needs to be set the correct way in the .h5 file, otherwise hsds will rechunk the data so that it fits within the set chunk size limits.
Using http compression for PUT and POST requests is not commonly done, but it’s supported by the http protocol and make a lot of sense in your scenario. I’ll take a look at implementing it for h5pyd and hsds.
When last I did some performance testting with http compression for GET requests, my results showed that it was a bit slower for data that wasn’t very compressible. This was using gzip compression and as a result I made http_compression option disabled by default. Anyway, I’ll try it with blosclz compression and see how it does.
In any case, hsload is only sending one request at a time so it’s difficult to utilze a large percentage of the bandwidth. Have you tried running multiple hscopy’s at the same time? That should help.
Also, have you looked into using the hsload --link option? That will avoid the dataset data copy all together (but only makes sense if your HDF5 files are in the same storage account as your HSDS containe).
I’ll look into implementing the progress bar as well (unless anyone else would like to submit a PR!).
Re: using the existing compression of the h5 file…
If you use the --link option, hsds will just read the chunks directly from the h5 file. In this case, the min/max chunk sizes are ignored and hsds will just to the best it can with the given chunk layout (this is where the new feature of “hyperchunking” comes in play to make this more efficient when the h5 file uses lots of small chunks).
Thanks for the suggestions. I have read about the link option. From what I have seen the random read performance will be decreased. I will give it a try and see what the performance difference is.