Connecting HSDS to NREL data stored on S3

matthew · February 8, 2024, 2:14pm

I would like to run some analysis on the NREL wind speed gridded dataset. The existing REST API, accessed from h5pyd at this end, is slow.

I’m aiming to run my own HSDS server on an EC2 instance in the us-west-2 AWS region, so it is close to the data. I’ll probably also run the application using h5pyd there also, at least to start with.

I’ve made good progress, but have now got stuck. Any suggestions of next steps would be appreciated.

So far:

EC2 instance with AWS Linux, hosted at AWS us-west-2
virtualenv python environment
pip install of hsds and h5pyd, plus all the usual jupyter etc.

My .hscfg looks like this:
aws_s3_gateway = https://us-west-2.amazonaws.com
hs_endpoint = https://localhost:5101
hs_username = None
hs_password = None
hs_api_key = None

But when I run hsds, the error is:
root_dir not set (and no S3 or Azure connection info)

I’d be grateful if anyone can suggest how to configure the s3 endpoint, bucket etc so that hsds detects this correctly.

Thanks.

mlarson · February 8, 2024, 3:13pm

It seems like HSDS isn’t detecting your .hscfg file. By default, HSDS expects it to be in the home directory. You can specify a different directory to check with --config_dir.

matthew · February 8, 2024, 3:35pm

My hack has been to add a line to app.py as follows:
config.cfg.update(userConfig)

This copies all of the userConfig dictionary entries to config, thereby making the test on aws_s3_gateway work correctly.

.hscfg now looks like this:
aws_s3_gateway = http://s3.us-west-2.amazonaws.com
bucket_name = nrel-pds-hsds
hs_endpoint = http://localhost:5101
hs_username = None
hs_password = None
hs_api_key = None

This now allows hsds to start up on port 5101 successfully.

The next challenge is to connect to this using h5pyd. If I use the following, It just hangs, and there’s nothing in the hsds stdout/stderr.

f = h5pyd.File(“/nrel/wtk-us.h5”)

Thoughts?

jreadey · February 9, 2024, 1:55am

Hey there,
No need to hack the source code!
In hsds/admin/config/config.yml are all the configurable settings for HSDS. Some 100 different config values, but most of these you can ignore. In the yml, there’s the “aws_s3_gateway” key that defaults to null. Just set it to the aws_s3_gateway value that you are using. Since you likely don’t want to accidentally put these changes in a pull request, you can leave the config.yml alone and just add that line in the “override.yml” file in the same directory.

Alternatively, you can define an environment variable “AWS_S3_GATEWAY”, and that will get passed to the containers by the runall.sh script.

These steps are described here: http://github.com/HDFGroup/hsds/docs/docker_install_aws.md. Have you reviewed this?

The .hscfg file is just for h5pyd client side settings - it won’t have any effect on how HSDS works.

We really should have a readthedocs help for HSDS to make these things easier to find. Hopefully soon!

matthew · February 9, 2024, 1:33pm

Hi,
That link is broken for me - is this one the same thing?
hsds/docs/docker_install_aws.md at master · HDFGroup/hsds (github.com)

This is working nicely so far. A couple of notes along the way:

I had thought that hsds could be used through pip. Maybe it can, but it appears just to be the code, without all of the docker and configuration stuff that the git clone download arrives with. I’m going with the latter this time as per instructions.
Item 5 says to run setup.py but there isn’t a file with this name. I ran ./build.sh, which appeared to do a sensible thing.
On Amazon Linux, we need yum rather than apt to install things like docker.
I needed pyflakes to be installed for build.sh to run. I had a python venv with pip so just added pyflakes to that and activated it before running build.sh.
I used this page to install docker-compose on Amazon Linux:
Amazon Linux 2 - install docker & docker-compose using ‘sudo amazon-linux-extras’ command (github.com)

HSDS Runs!

I can follow the server log output to see what is happening with:
docker logs --follow hsds-sn-1

The challenge now is to configure my running HSDS server on EC2 so that it reads from the pre-existing s3 bucket as a the data back end.

I’ve added the following in hsds/admin/config/override.yml

aws_s3_gateway: https://s3.us-west-2.amazonaws.com
aws_region: us-west-2
default_public: True
bucket_name: nrel-pds-wtk
greeting: If you see this, the override is working!
aws_s3_no_sign_request: True

is it working?

It looks like there are two buckets: nrel-pds-wtk, and nrel-pds-hsds. I had thought that the hsds one would be right, but if I run hsls on this, it just locks up the server. Is this just because the data is divided into a zillion chunks?

The next attempt is to try nrel-pds-wtk as the bucket, as configured in override.yml.

The result from this is that hsls just returns / and nothing else, and even hsls -r doesn’t return any more, so no dice.

However, if I look in the server logs, it is suddenly listing out loads of interesting things, /Great_Lakes, /Hawaii and so on. All of them have a
WARN> fetch result - not found error for: /Offshore_CA
and a
WARN> get_domains - domain: /Great_Lakes not found in crawler dict

Progress report

HSDS now running nicely in a docker setup on EC2 in us-west-2.

If you have any advice on how to configure it to connect to the right bucket, so that it can successfully read e.g. the 100m layer of 2km resolution windspeed grid in the US I’d be hugely appreciative. I’m stumped so far, but feel like I’m close.

jreadey · February 12, 2024, 3:49am

Looks like you are almost there!

The bucket you want is nrel-pds-hsds. Unfortunately there’s some legacy cruft in the bucket that causes trouble with hsls /. There’s a “top_level_domains” config to get around it. Also, I think you’ll see better performance with http as the S3 endpoint. I tried this override and everything worked ok:


aws_s3_gateway: http://s3.us-west-2.amazonaws.com
aws_region: us-west-2
default_public: True
bucket_name: nrel-pds-hsds
greeting: If you see this, the override is working!
aws_s3_no_sign_request: True
top_level_domains: [/nrel]

matthew · February 15, 2024, 5:45pm

My final working recipe (from memory at least), for HSDS, running on EC2, connecting to the NREL data on S3 is as follows. With this setup, it rips through data extraction pretty nicely, as compared with the developer.nrel.gov API endpoint.

Set up an EC2 node in us-west-2 with plenty of cores and RAM, using Amazon Linux.
Make a virtualenv for python, with pyflakes, h5pyd, jupyterlab and anything else you fancy.
Get hsds from GIT, not from pip, so it comes with all the config structure.
Follow the ‘HSDS on Docker’ setup, running build.sh not setup.py, and with docker and docker-compose installed.
Configure HSDS in the override.yml file to access the nrel-pds-hsds bucket, and with the top_level_domain configured.
In the AWS console, give your user an Access Key with an ID and Secret. Add these into your .bashrc and log out and in again. The S3 bucket is public access, but HSDS seems to need some credentials to work nicely. There’s also the ‘IAM Role’ permissions route, but I didn’t get this to work (but didn’t really need to).

From there, the h5pyd client side needs to be configured (~/.hscfg) to point to your local HSDS server, which should be on http://localhost:5101, on the path /nrel/wtk-us.h5.

I’ve been using jupyterlab through VSCode over SSH on the EC2 instance, so running h5pyd and HSDS on the same machine.

Sorry this is a partial cut - there’s more detail in the discussion above. Big thanks to John at HDFGroup for his help through this.

jreadey · February 16, 2024, 7:11am

Glad to hear you got everything working!

Some final thoughts:

Setup can be a bit complicated partially because there are so many options: Docker, Kubernetes, or just regular processes. Docker and Kubernetes need a docker image file which you can build locally or pull from DockerHub. Running as processes uses the hsds package which you build yourself or install from PyPI.
For your use case I suspect Docker will work best.
The runall.sh script has options for how many SN and DN containers to run. More DN containers will generally speed things up, but don’t run more DN containers than you have cores on the machine. Use multiple SN containers if your application is multi-process and you can arrange to hit different HSDS ports from the different processes. If docker stats shows containers regularly hitting 100% CPU adding more containers should help.
Access public buckets should work with the AWS auth keys as long as you set the “aws_s3_no_sign_request” to true
I’ve fixed the reference in the doc to the setup.py
I’m a big fan of running VSCode remote to an EC2 instance myself!

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Connecting HSDS to NREL data stored on S3

HSDS Runs!

is it working?

Progress report