Example hsload using s3link


#1

Is there an example of loading an h5 file that’s already in s3 using the s3link option?

It looks like hsload uses pycurl to retrieve the file from s3. If so, what environment variables need to be setup for the pycurl get to succeed?


#2

Since posting, I’ve updated hdf5, h5py, and h5pyd all to the latest versions. I also installed the latest version s3fs. I now see that the option is no longer --s3link, but just --link. I am also using the s3:// path to point to the h5 files in s3.

Unfortunately, it still doesn’t work. We are using IAM roles instead of access keys. We have hsds running and working and we can do things like hsls using the roles.

Are there special environment variables or .hscfg settings that need to be set in order to use hsload with the --link option?


#3

I did a fresh install on a different machine and now from the log file it looks like it is finding the credentials, but that when it goes to build the url, it is building it using the wrong AWS region (it is using us-east-1 instead of us-west-2).

Is there a way to specify the region as a part of the hsload command?

Also, maybe related, the program errors out with the following:
File “/home/ubuntu/miniconda3/envs/hsds/lib/python3.8/site-packages/botocore-1.15.15-py3.8.egg/botocore/utils.py”, line 1257, in get_bucket_region
headers = response[‘ResponseMetadata’][‘HTTPHeaders’]
TypeError: ‘coroutine’ object is not subscriptable
sys:1: RuntimeWarning: coroutine ‘AioBaseClient._make_api_call’ was never awaited


#4

I’ve made some more progress… It looks like you need to use very specific versions of the libraries and tools in order to run the hsload command with the --link option:

  1. You must build/install your own version of h5py that links to a 1.10.6 version of the hdf5 library. The pip h5py package includes an older version of hdf5 library.
  2. When building/installing the hdf5 library, the threadsafe option cannot be used, as it will cause the h5py package to fail on import.
  3. The latest s3fs code cannot be used as it will error out when the hsload script calls into it. Using the prebuilt conda package worked fine.
  4. When using the latest version hsload in h5pyd, note that the s3 link option changes from --s3link to --link.
  5. When using the latest version hsload in h5pyd, note that the version is hardcoded to only allow 1.10.x series 6 or greater. 1.12.x will not work.
  6. You may need to run hsload multiple times on the same file because there are intermittent errors where it will fail out

Following all of the above, I am able to get hsload to mostly work. It begins loading a file into S3, creating groups, datasets, attributes, etc (at least according to the output of the tool and what I can see in S3). But after running for a while, it crashes with the following output:

Traceback (most recent call last):
File “/home/ubuntu/miniconda3/envs/hsds1/lib/python3.8/site-packages/requests-2.24.0-py3.8.egg/requests/adapters.py”, line 439, in send
resp = conn.urlopen(
File “/home/ubuntu/miniconda3/envs/hsds1/lib/python3.8/site-packages/urllib3/connectionpool.py”, line 817, in urlopen
return self.urlopen(
File “/home/ubuntu/miniconda3/envs/hsds1/lib/python3.8/site-packages/urllib3/connectionpool.py”, line 817, in urlopen
return self.urlopen(
File “/home/ubuntu/miniconda3/envs/hsds1/lib/python3.8/site-packages/urllib3/connectionpool.py”, line 817, in urlopen
return self.urlopen(
File “/home/ubuntu/miniconda3/envs/hsds1/lib/python3.8/site-packages/urllib3/connectionpool.py”, line 807, in urlopen
retries = retries.increment(method, url, response=response, _pool=self)
File “/home/ubuntu/miniconda3/envs/hsds1/lib/python3.8/site-packages/urllib3/util/retry.py”, line 439, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host=‘hsds.sliderule.beta’, port=80): Max retries exceeded with url: /?domain=%2Fdata%2FATL03_20181019065445_03150111_003_01.h5&bucket=slideruledemo (Caused by ResponseError(‘too many 500 error responses’))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/home/ubuntu/miniconda3/envs/hsds1/bin/hsload”, line 33, in
sys.exit(load_entry_point(‘h5pyd==0.7.3’, ‘console_scripts’, ‘hsload’)())
File “/home/ubuntu/miniconda3/envs/hsds1/lib/python3.8/site-packages/h5pyd-0.7.3-py3.8.egg/h5pyd/_apps/hsload.py”, line 366, in main
load_file(fin, fout, verbose=verbose, dataload=dataload, s3path=s3path, compression=compression, compression_opts=compression_opts)
File “/home/ubuntu/miniconda3/envs/hsds1/lib/python3.8/site-packages/h5pyd-0.7.3-py3.8.egg/h5pyd/_apps/utillib.py”, line 725, in load_file
fout.close()
File “/home/ubuntu/miniconda3/envs/hsds1/lib/python3.8/site-packages/h5pyd-0.7.3-py3.8.egg/h5pyd/_hl/files.py”, line 568, in close
self.PUT(req, body=body)
File “/home/ubuntu/miniconda3/envs/hsds1/lib/python3.8/site-packages/h5pyd-0.7.3-py3.8.egg/h5pyd/_hl/base.py”, line 912, in PUT
rsp = self._id._http_conn.PUT(req, body=body, params=params, format=format)
File “/home/ubuntu/miniconda3/envs/hsds1/lib/python3.8/site-packages/h5pyd-0.7.3-py3.8.egg/h5pyd/_hl/httpconn.py”, line 357, in PUT
rsp = s.put(self._endpoint + req, data=data, headers=headers, params=params, auth=auth, verify=self.verifyCert())
File “/home/ubuntu/miniconda3/envs/hsds1/lib/python3.8/site-packages/requests-2.24.0-py3.8.egg/requests/sessions.py”, line 590, in put
return self.request(‘PUT’, url, data=data, **kwargs)
File “/home/ubuntu/miniconda3/envs/hsds1/lib/python3.8/site-packages/requests-2.24.0-py3.8.egg/requests/sessions.py”, line 530, in request
resp = self.send(prep, **send_kwargs)
File “/home/ubuntu/miniconda3/envs/hsds1/lib/python3.8/site-packages/requests-2.24.0-py3.8.egg/requests/sessions.py”, line 643, in send
r = adapter.send(request, **kwargs)
File “/home/ubuntu/miniconda3/envs/hsds1/lib/python3.8/site-packages/requests-2.24.0-py3.8.egg/requests/adapters.py”, line 507, in send
raise RetryError(e, request=request)
requests.exceptions.RetryError: HTTPConnectionPool(host=‘hsds.sliderule.beta’, port=80): Max retries exceeded with url: /?domain=%2Fdata%2FATL03_20181019065445_03150111_003_01.h5&bucket=slideruledemo (Caused by ResponseError(‘too many 500 error responses’))


#5

Hey JP,

I did a bunch of code updates/image pushes yesterday, so make sure you have the latest stuff. hsinfo should show server version 0.6 and h5pyd version 0.8.0. I tested this on AWS using an IAM role and it worked (if your IAM_ROLE is not “hsds_role”, you need to set the aws_iam_role config in your hsds/admin/config/override.yml.

In hsload, the --s3link option has been replaced by --link (since it now support posix or azure paths in addition to s3). It’s require recent hdf5lib and h5py versions, so the easiest thing to do is to use the hdf5lib docker images. E.g.: docker run --rm -v ~/.hscfg:/root/.hscfg -v ~/data:/data -it hdfgroup/hdf5lib:1.10.6 bash.

Try loading the file using the docker image and see if that works. If the file is on a public S3 bucket, let me know where it is and I can try myself.

Also, you can pass the region to the server by setting the AWS_REGION environment variable.


#6

I ran the docker container and it began running, similar to my previous runs - creating groups, datasets, attributes, etc, but then it crashes. But it crashes now in a different place in the file than where it was crashing before. It looks like it doesn’t get as far in the file. When I was running my build (which still crashes in the same place, even after updating the code), hsload looked like it loaded most of the file before crashing. With the docker container it crashes after loading just a little bit of the file. Here is the verbose output provided at the end:

creating dataset /gt1l/bckgrd_atlas/tlm_top_band2, shape: (68639,), type: float32
dataset created, uuid: d-ca26b318-8f33e0ce-062e-f9f070-a01020, chunk_size: {‘class’: ‘H5D_CHUNKED_REF’, ‘file_uri’: ‘s3://slideruledemo/atl03samples/ATL03_20181019065445_03150111_003_01.h5’, ‘dims’: [10000], ‘chunks’: {‘0’: [2215933435, 7557], ‘1’: [2215940992, 7154], ‘2’: [2215948146, 6796], ‘3’: [2215954942, 5185], ‘4’: [2215960127, 6503], ‘5’: [2215966630, 7203], ‘6’: [2215973833, 5892]}}
creating group /gt1l/geolocation
creating dataset /gt1l/geolocation/altitude_sc, shape: (121754,), type: float64
Traceback (most recent call last):
File “h5py/h5o.pyx”, line 302, in h5py.h5o.cb_obj_simple
File “/usr/local/lib/python3.7/site-packages/h5py-2.10.0-py3.7-linux-x86_64.egg/h5py/_hl/group.py”, line 589, in proxy
return func(name, self[name])
File “/usr/local/lib/python3.7/site-packages/h5pyd-0.7.2-py3.7.egg/h5pyd/_apps/utillib.py”, line 654, in object_create_helper
create_dataset(obj, ctx)
File “/usr/local/lib/python3.7/site-packages/h5pyd-0.7.2-py3.7.egg/h5pyd/_apps/utillib.py”, line 392, in create_dataset
logging.debug(“annon_values: {}”.format(anon_dset[…]))
File “/usr/local/lib/python3.7/site-packages/h5pyd-0.7.2-py3.7.egg/h5pyd/_hl/dataset.py”, line 794, in getitem
rsp = self.GET(req, params=params, format=“binary”)
File “/usr/local/lib/python3.7/site-packages/h5pyd-0.7.2-py3.7.egg/h5pyd/_hl/base.py”, line 893, in GET
rsp.headers[‘Content-Length’])
File “/usr/local/lib/python3.7/site-packages/requests-2.24.0-py3.7.egg/requests/structures.py”, line 54, in getitem
return self._store[key.lower()][1]
KeyError: ‘content-length’

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File “/usr/local/bin/hsload”, line 11, in
load_entry_point(‘h5pyd==0.7.2’, ‘console_scripts’, ‘hsload’)()
File “/usr/local/lib/python3.7/site-packages/h5pyd-0.7.2-py3.7.egg/h5pyd/_apps/hsload.py”, line 355, in main
load_file(fin, fout, verbose=verbose, dataload=dataload, s3path=s3path, deflate=deflate,)
File “/usr/local/lib/python3.7/site-packages/h5pyd-0.7.2-py3.7.egg/h5pyd/_apps/utillib.py”, line 702, in load_file
fin.visititems(object_create_helper)
File “/usr/local/lib/python3.7/site-packages/h5py-2.10.0-py3.7-linux-x86_64.egg/h5py/_hl/group.py”, line 590, in visititems
return h5o.visit(self.id, proxy)
File “h5py/_objects.pyx”, line 54, in h5py._objects.with_phil.wrapper
File “h5py/_objects.pyx”, line 55, in h5py._objects.with_phil.wrapper
SystemError: returned a result with an error set


#7

Please install the latest h5pyd package (v 0.8.0). The server now supports http compression and one of the consequences is that the response headers may not include ‘Content-Length’, so the h5pyd code had to be updated to reflect that.

Let us know how it goes!


#8

Updating h5pyd to v0.8.0 worked for the hsload! The code now completes without reporting any errors. Now the problem is that when I go to read the data, it returns garbage. I checked the docker logs for the data node and all of the messages it reports look fine - it seems like it think it read the data just fine.

A couple things that may be causing trouble:

  • The HSDS code we are running is from the end of July, if we update the code to the latest that’s checked in the HSDS service no longer works with our data and returns errors when we try to read certain datasets.
  • We are using the rest-vol connector

#9

Is h5pyd able to read the data or is just the rest-vol that has problems?

With h5pyd, can you see links and attributes ok, but dataset reads are failing?


#10

With h5pyd the same garbage data is returned. So it looks like the problem is not related to rest-vol, but with either the hsload or hsds.


#11

Could you put the file at s3://slideruledemo/atl03samples/ATL03_20181019065445_03150111_003_01.h5 to a public bucket? I can try loading it myself.


#12

Thanks for looking into this. I replied to you via email.