File storage on AWS S3


#1

Hi,

I successfully deployed HSDS on my local machine and connected it with an S3 bucket. I tested reading/writing using h5pyd and both worked. However, one thing confused me is that on my S3 dashboard, I can only see folders (e.g. /home/admin/test.h5) and json files, without data files. Since HSDS stores each chunk in one object so I think I should see as many data files as the number of chunks. This is the first time I use S3 so I’m not sure if there is anything wrong here.

Best,
Ruochen


#2

Hey,

Do you see a folder in your bucket named “db”? All the data files will be there.

In the HSDS schema, folders and domain names are stored by their path (e.g. /shared/mydata.h5 will be stored in s3://bucketname/shared/mydata.h5/domain.json). Inside domain.json there will be a key for the root group id which points to an object under the db/ path.

The rational for this approach is that it makes it easy to move or rename domain names without having to move each data object.

If you are curious, this doc: https://github.com/HDFGroup/hsds/blob/master/docs/design/obj_store_schema/obj_store_schema_v2.md, describes the storage model in some detail.

BTW, if you run: hsinfo <domain_name>, you’ll get information about the number of json objects, data objects, storage size, etc. for that domain. E.g.:

$ hsinfo /shared/bioconductor/tenx_full.h5
domain: /shared/bioconductor/tenx_full.h5
    owner:           admin
    id:              g-eed60fdd-3e56eab3-665e-8755b6-de623b
    last modified:   2021-06-19 22:22:58
    last scan:       2021-06-19 22:19:22
    md5 sum:         4fafc30a05df174ca5cc8f05e4c6e659
    total_size:      6112211528
    allocated_bytes: 6112210230
    metadata_bytes:  875
    num objects:     2
    num chunks:      105273

#3

Yeah I see the db folder and the data. Thanks a lot for the help! BTW, curious does HSDS support multiple datanodes with different configurations, e.g. one using S3 and others using POSIX.

Best,
Ruochen


#4

No, each DN node in a deployment needs to have the same configuration.
There’s nothing to stop you from having two different HSDS deployments on the same machine though. You’d have one endpoint that would server S3 data and another for Posix.
Similarly, you can have two HSDS deployments on a Kuberentes cluster. As long as the deployments are to different namespaces, the HSDS pods will just talk to pods in their own deployment.


#5

I see. Thanks a lot!


#6

Hi, I tried to deploy two different HSDS deployments by docker-compose two files separately. I changed the project name (both COMPOSE_PROJECT_NAME and container_name in yml file) to guarantee two deployments have different names. However, the service failed to start (503) after I changed those names, although all containers were created successfully. Is there anything wrong here?


#7

Looks like the compose yml is using some hard-coded container names. I’ve updated the compose files in github and made a code fix. Please try it out and let me know if this resolves the issue.

If you are using the runall.sh script, set the COMPOSE_PROJECT_NAME env var to the desired value (otherwise will default to “hsds”). You’ll also need set SN_PORT so the public ports don’t clash.

Grab the latest code from master, or pull the image from dockehub: hdfgroup/hsds:v0.7.0beta4


#10

It works perfectly now. Thank you!


#11

Awesome! Glad to hear it.


#12

Hi John, I just met the other problem: after starting two services, I tried hsinfo and hstouch to create folders. But it only works for one service. For the other one, it keeps returning Error: [Errno 400] Invalid domain name. I used hsconfigure to change endpoint before doing this and the connection was OK. For these two services, I changed all the ports (head, sn, dn, rangegate) to be different in yml files. Is there anything I did wrong here?


#13

It also doesn’t work for AWS, even if it is the first service I started.