hsload --link performance

jeff.harris · August 9, 2021, 3:35pm

Hello,

We have an HSDS cluster running on AWS EKS (us-west-2).

We are attempting to link to an OpenData H5 dataset (s3://nrel-pds-wtk/canada/v1.1.0bc/wtk_canada_2014.h5) also in us-west-2.

Our cluster is operating normally; we’ve run the HSDS test suites; have been able to load and link to other datasets, etc.

The issue we are seeing is that hsload --link performance is slow when reading chunks from the linked files. In utillib.py a call to get_chunk_info sometimes takes several minutes. Other times several hundred calls process per second.

The h5 file is 1.7TB. We can download the file to a local server, so s3 access isn’t an issue.

We’ve set up an EC2 instance in us-west-2 in the same VPC as the cluster with a vpc gateway to S3, but we’re still seeing the same performance issue.

Oddly, the time to download the entire h5 file to a local server is faster than hsload --link within us-west-2.

Is this expected behaviour, or are we doing something wrong?

Thanks,
Jeff

jreadey · August 11, 2021, 10:38pm

Hey,

Thanks for trying out HSDS!

Yes, the hsload --link performance can be really slow. The HDF5 library call only fetches one chunk location at a time, so it can take a while to iterate through large datasets. We are planning on looking into ways to optimize this.

It is just a one-time cost at least - once you've created the linked HSDS domain, you should get performance equivalent to what you see with non-linked domains.

I think NREL has already imported that particular file into HSDS. Isn’t this the same one as that is exported by https://developer.nrel.gov/api/hsds as: /nrel/wtk/canada/wtk_canada_2014.h5?

If you’d rather use your own HSDS cluster (for one thing, you should get better performance vs. going through the NREL gateway), you can access the HSDS domain from your server. You just need to use the bucket=“nrel-pds-hsds” option when opening the file. There are some examples of this on KitaLab in the examples/NREL folder.

Let me know if this helps!
John

jeff.harris · August 26, 2021, 8:34pm

Hi John,

Thanks for the additional info.

You are correct that we’d like to use our own HSDS cluster for performance reasons vs. the NREL gateway.

I believe I located the KitaLab examples you were referring to, but I’m running into an issue when trying to run.

I can hsls against our cluster (credentials set via hsconfig) as well as access files on our cluster via python / h5pyd with credentials explicitly set.

If I try to run the KitaLab example
hsls --bucket nrel-pds-hsds /nrel/wtk-us.h5
I get
No permission to read domain: /nrel/wtk-us.h5
and from python:
OSError: [Errno 403] Forbidden

However, I can use the aws CLI to list the domain/files:
aws s3 ls nrel-pds-hsds/nrel/wtk-us.h5
so it doesn’t look like a permission issue on the bucket.

Is there a step we’re missing to access the NREL HSDS domain from our cluster direct through the bucket?

Also, if we’re accessing via the bucket (versus hsload --link) do we lose any cluster functionality (such as scaling our readers via lambda)?

Thanks…
Jeff

jreadey · August 27, 2021, 4:40pm

Hey,
It looks like an overly strict regex was added to the code that was rejecting the bucket name with hyphens. I’ve removed the regex, and updated the master branch, so the hsls should work now.

Can you rebuild your image from master and confirm?

Interesting that you are thinking about AWS Lambda! In the next release, we’ll be moving from Lambda as “accelerators” for the SN nodes to a pure serverless solution. In any case, the --link option shouldn’t effect performance.

John

jeff.harris · September 8, 2021, 3:05am

Hi John,
Apologies for the delay – we ran into a couple of issues updating to the new master, but seem to have resolved everything.

Issues:

1. dn and sn nodes were resolving to localhost. It might be related to this kubernetes-client issue. We unsuccessfully tried pinning against a couple of different kubernetes-client versions as described in that link. What did work was an explicit call to k8s_client.Configuration().get_default_copy() in util/k8sclient.py

2. aws_access_key_id and aws_secret_access_key needed to be set. We’re using AWS EKS, and only had an IAM role configured in our override.yml. I suspect it’s a combination of the external bucket permissions and/or the permissions on our role.

What led us to this, is that we were testing on a fresh client where
this worked:
aws s3 ls s3://nrel-pds-hsds/nrel/nsrdb/ --no-sign-request
But this didn’t:
aws s3 ls s3://nrel-pds-hsds/nrel/nsrdb/

Both had worked on previous client where the access keys had been set.

That mirrored the behavior we were seeing on the cluster. After we configured the access keys in override.yml and redeployed we can now hsls against other domains. Sharing in the event it’s useful for others.

Thanks,
Jeff

jreadey · September 8, 2021, 3:49pm

Hey Jeff,
Thanks for the update!

Re: kubernetss-client - I’ve been wondering if it would be less problematic to just remove the logic for hsds pods to get the IP for the other pods and instead setup the HSDS pod to just have a SN container and a configurable number of DN containers (so all SN->DN traffic would be within one pod).

The downside is that if a client did a write to one pod and then happen to read from another pod (assuming the deployment has been scaled up), the client could get a stale version of the data. So you’d be effectively be limited to a singleton pod deployment (same as you’d have with say deploying a mysql pod).

Still I don’t think this would be that limiting in practice. To run clustered analytics apps you could deploy SN/DN containers with the app as a sidecar deployment. Would welcome input from any Kuberentes experts out there on what the best strategy is!

Re: aws access keys: Using an IAM role is best practice, but can be a pain to setup. Did you follow the instructions here: https://docs.aws.amazon.com/eks/latest/userguide/create-node-role.html?

jreadey · September 21, 2021, 6:09pm

Hi Jeff,

We had another NREL user who reported a bug in HSDS that comes up when reading linked data when the bucket parameter is used. I’ve put a fix in the master branch and it looks to be working now.

Here’s an example where the bug would come up:

import h5pyd
f = h5pyd.File("/nrel/nsrdb/v3/nsrdb_2000.h5", bucket="nrel-pds-hsds")
dset = f['wind_speed']
print(dset)  # should show: <HDF5 dataset "wind_speed": shape (17568, 2018392), type "<i2">
print(dset.chunks)
# should show: 
# {'class': 'H5D_CHUNKED_REF_INDIRECT',
# file_uri': 's3://nrel-pds-nsrdb/v3/nsrdb_2000.h5',
# 'dims': [2688, 372],
# 'chunk_table': 'd-096b7930-5dc5b556-d184-ffde30-7a0e85'}
arr = dset[0,::].  # read first column of dataset
print(f"{arr.min():6.2f}, {arr.max():6.2f}, {arr.mean():6.2f}")
# should print: "(0.00, 167.00, 17.20)"

The dset.chunks output shows that the dataset is linked to the HDF5 file: ‘s3://nrel-pds-nsrdb/v3/nsrdb_2000.h5’.

Before the bug fix, the datsaet read would return a 404 error due to the code looking in the wrong bucket (nrel-pds-hsds rather than nrel-pds-nsrdb).

In the read selection above, 5425 chunks are accessed. Since each chunk is about 2mb, that is ~10GB of S3 data that is read from S3 on each selection.

Running a quick benchmark using HSDS w/Docker on a m5.8xlarge instance I got these runtimes for various number of DN containers (equivalent to number of HSDS pods on K8s):

1 nodes:        103 s   105 mb/s
2 nodes:         56 s   193 mb/s
4 nodes:         35 s   310 mb/s
8 nodes:         28 s   387 mb/s
16 nodes:        16 s   678 mb/s

Nice near linear speedup as the number of nodes are increased. Note, you likely don’t want to add more nodes than you have cores on the system as each node is hitting 100% CPU during the read.

Didn’t get around to running on K8s and scaling the number of pods up and down. I think in that case you might have even more headroom to increase the number of pods on a multi-machine cluster.

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

hsload --link performance