File storage on AWS S3


#1

Hi,

I successfully deployed HSDS on my local machine and connected it with an S3 bucket. I tested reading/writing using h5pyd and both worked. However, one thing confused me is that on my S3 dashboard, I can only see folders (e.g. /home/admin/test.h5) and json files, without data files. Since HSDS stores each chunk in one object so I think I should see as many data files as the number of chunks. This is the first time I use S3 so I’m not sure if there is anything wrong here.

Best,
Ruochen


#2

Hey,

Do you see a folder in your bucket named “db”? All the data files will be there.

In the HSDS schema, folders and domain names are stored by their path (e.g. /shared/mydata.h5 will be stored in s3://bucketname/shared/mydata.h5/domain.json). Inside domain.json there will be a key for the root group id which points to an object under the db/ path.

The rational for this approach is that it makes it easy to move or rename domain names without having to move each data object.

If you are curious, this doc: https://github.com/HDFGroup/hsds/blob/master/docs/design/obj_store_schema/obj_store_schema_v2.md, describes the storage model in some detail.

BTW, if you run: hsinfo <domain_name>, you’ll get information about the number of json objects, data objects, storage size, etc. for that domain. E.g.:

$ hsinfo /shared/bioconductor/tenx_full.h5
domain: /shared/bioconductor/tenx_full.h5
    owner:           admin
    id:              g-eed60fdd-3e56eab3-665e-8755b6-de623b
    last modified:   2021-06-19 22:22:58
    last scan:       2021-06-19 22:19:22
    md5 sum:         4fafc30a05df174ca5cc8f05e4c6e659
    total_size:      6112211528
    allocated_bytes: 6112210230
    metadata_bytes:  875
    num objects:     2
    num chunks:      105273

#3

Yeah I see the db folder and the data. Thanks a lot for the help! BTW, curious does HSDS support multiple datanodes with different configurations, e.g. one using S3 and others using POSIX.

Best,
Ruochen


#4

No, each DN node in a deployment needs to have the same configuration.
There’s nothing to stop you from having two different HSDS deployments on the same machine though. You’d have one endpoint that would server S3 data and another for Posix.
Similarly, you can have two HSDS deployments on a Kuberentes cluster. As long as the deployments are to different namespaces, the HSDS pods will just talk to pods in their own deployment.


#5

I see. Thanks a lot!


#6

Hi, I tried to deploy two different HSDS deployments by docker-compose two files separately. I changed the project name (both COMPOSE_PROJECT_NAME and container_name in yml file) to guarantee two deployments have different names. However, the service failed to start (503) after I changed those names, although all containers were created successfully. Is there anything wrong here?


#7

Looks like the compose yml is using some hard-coded container names. I’ve updated the compose files in github and made a code fix. Please try it out and let me know if this resolves the issue.

If you are using the runall.sh script, set the COMPOSE_PROJECT_NAME env var to the desired value (otherwise will default to “hsds”). You’ll also need set SN_PORT so the public ports don’t clash.

Grab the latest code from master, or pull the image from dockehub: hdfgroup/hsds:v0.7.0beta4


#10

It works perfectly now. Thank you!


#11

Awesome! Glad to hear it.


#12

Hi John, I just met the other problem: after starting two services, I tried hsinfo and hstouch to create folders. But it only works for one service. For the other one, it keeps returning Error: [Errno 400] Invalid domain name. I used hsconfigure to change endpoint before doing this and the connection was OK. For these two services, I changed all the ports (head, sn, dn, rangegate) to be different in yml files. Is there anything I did wrong here?


#13

It also doesn’t work for AWS, even if it is the first service I started.


#14

Hey,
You shouldn’t need to mess with the head port, etc. It’s only if the port is exposed on the host that’s there’s a potential for conflict. If you look at the port lines in docker-compose, e.g. for the posix version: https://github.com/HDFGroup/hsds/blob/master/admin/docker/docker-compose.posix.yml, you should see only the SN_PORT has both external and internal mappings.

(actually I goofed in my last update and forgot to remove the external port for the rangeget proxy. I’ve fixed this now.)

In general, if the COMPOSE_PROJECT_NAME is different, two containers can have the same internal port, but they’ll be on different internal networks, so shouldn’t conflict.

I setup two projects one using AWS the other using posix. Here’s what my docker ps looks like:

CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS                                             NAMES
66e0e9140ef4        hdfgroup/hsds       "/bin/bash -c 'sourc…"   12 minutes ago      Up 12 minutes       5100-5999/tcp, 0.0.0.0:32777->6101/tcp            posix_dn_1
5f2bc6a08d5b        hdfgroup/hsds       "/bin/bash -c 'sourc…"   13 minutes ago      Up 12 minutes       5100-5999/tcp, 0.0.0.0:32776->6900/tcp            posix_rangeget_1
74a28db22b8e        hdfgroup/hsds       "/bin/bash -c 'sourc…"   13 minutes ago      Up 12 minutes       5100-5999/tcp, 0.0.0.0:8080->8080/tcp             posix_sn_1
89782f803c1f        hdfgroup/hsds       "/bin/bash -c 'sourc…"   13 minutes ago      Up 12 minutes       5101-5999/tcp, 0.0.0.0:32775->5100/tcp            posix_head_1
9a57450213f5        hdfgroup/hsds       "/bin/bash -c 'sourc…"   36 minutes ago      Up 36 minutes       5100-5999/tcp, 0.0.0.0:32769->6101/tcp            aws_dn_1
6871638c30ba        hdfgroup/hsds       "/bin/bash -c 'sourc…"   36 minutes ago      Up 36 minutes       5100/tcp, 5102-5999/tcp, 0.0.0.0:5101->5101/tcp   aws_sn_1
2d07d7c617b2        hdfgroup/hsds       "/bin/bash -c 'sourc…"   36 minutes ago      Up 36 minutes       5100-5999/tcp, 0.0.0.0:6900->6900/tcp             aws_rangeget_1
6525b17c0f80        hdfgroup/hsds       "/bin/bash -c 'sourc…"   36 minutes ago      Up 36 minutes       5101-5999/tcp, 0.0.0.0:32768->5100/tcp            aws_head_1

By setting HS_ENDPOINT to http://localhost:5101 or http://localhost:8080 I can read/write to AWS S3 or local posix respectively.

Hope that helps!


#15

I still can’t run hsinfo or hstouch successfully even on a single AWS project. It still returns Error: [Errno 400] Invalid domain name.

My docker ps looks like the following:

CONTAINER ID   IMAGE           COMMAND                  CREATED          STATUS          PORTS                                                                NAMES
40cd68ff1173   hdfgroup/hsds   "/bin/bash -c 'sourc…"   46 seconds ago   Up 45 seconds   5100-5999/tcp, 0.0.0.0:49170->6101/tcp, :::49170->6101/tcp           hsds_dn_1
8ab41e338d9e   hdfgroup/hsds   "/bin/bash -c 'sourc…"   48 seconds ago   Up 47 seconds   5100/tcp, 5102-5999/tcp, 0.0.0.0:5101->5101/tcp, :::5101->5101/tcp   hsds_sn_1
f87c4f497ff1   hdfgroup/hsds   "/bin/bash -c 'sourc…"   48 seconds ago   Up 46 seconds   5100-5999/tcp, 0.0.0.0:49169->6900/tcp, :::49169->6900/tcp           hsds_rangeget_1
123ada60e5b1   hdfgroup/hsds   "/bin/bash -c 'sourc…"   48 seconds ago   Up 47 seconds   5101-5999/tcp, 0.0.0.0:49168->5100/tcp, :::49168->5100/tcp           hsds_head_1

Is HS_ENDPOINT here the hsds_endpoint in config file? I set it as http://localhost without any port while I set endpoint as http://localhost:5101 in hsconfigure.

I’m still not sure if this error is caused by AWS connection or the local connection, because this error doesn’t occur when I setup single POSIX project.


#16

If you set the env variable HS_ENDPOINT it will override what’s in the config file. Your hsds_sn_1 container is exposed on port 5101, so that is what you should use.

Try: curl http://localhost:5101/about as a sanity check.