Introduction
I’m new to the HDF community and just wanted to say that I really appreciate the clear and extensive documentation!
In particular, the following references and repositories have been really helpful:
Background & Requirments
I’ve recently been working on a project where I have an ~2GBs pandas DataFrame
from which I read small subsets of data and periodically (daily) append small sets of data on the order of MBs. I expect the dataset to remain in the single digits of GBs for the foreseeable future.
Two of the major reasons why I’m using HDF5
with Pandas
is due to the following available APIs:
The two functions above make data analysis very quick and efficient when the underlying datastore changes and avoids needing to read everything into memory.
Purpose of Question
While everything works smoothly and quickly locally, moving this to the cloud has a few complications and design decisions that need to be taken into account.
Note: assume that the .h5 file was uploaded to an S3 compatible object store.
If I horizontally scale the service that reads from my HDFStore
, I need to make the .h5
file available on every node. The problems I forsee are:
- I would need to download the 2GB+ file on every node when the service starts up. This is slow and expensive even if they’re in the same VPC.
- I considered packaging the 2GB+ file into the Docker image, but that creates a lot of bloat and makes regular updates to the underlying data troublesome.
- Given that I need to append small amounts of data to the 2GB file on a daily basis, even if I only have a single writer, that writer will need to re-upload the entire 2GB file every time.
Current State
I configured my kubernetes cluster on Digital Ocean, and am using spaces which is an S3-compatible object store. I was following the instructions at kubernetes_install_aws and got things to work with only a few modifications.
I used the test script from this thread and got the data to upload as expected.
For example, the output of: $ s3cmd ls s3://market-navigator-space --recursive | awk '{print $4}'
is:
s3://market-navigator-space/daily_data/
s3://market-navigator-space/daily_data/daily_data.h5
s3://market-navigator-space/db/62b3ec8a-add6e50b/.info.json
s3://market-navigator-space/db/66b22a72-82527446/.group.json
s3://market-navigator-space/db/66b22a72-82527446/.info.json
s3://market-navigator-space/db/66f8d441-e255d8c8/.group.json
s3://market-navigator-space/db/66f8d441-e255d8c8/.info.json
s3://market-navigator-space/db/66f8d441-e255d8c8/d/dec9-fb1267-319972/.dataset.json
s3://market-navigator-space/db/66f8d441-e255d8c8/d/dec9-fb1267-319972/0_0
s3://market-navigator-space/db/66f8d441-e255d8c8/d/dec9-fb1267-319972/100_0
...
s3://market-navigator-space/db/66f8d441-e255d8c8/d/dec9-fb1267-319972/9_0
s3://market-navigator-space/db/bc84ba97-b793c7f4/.group.json
s3://market-navigator-space/db/bc84ba97-b793c7f4/d/4837-1bd7b6-461a4d/.dataset.json
s3://market-navigator-space/db/bc84ba97-b793c7f4/d/4837-1bd7b6-461a4d/0_0
...
s3://market-navigator-space/home/.domain.json
s3://market-navigator-space/home/olshansky/.domain.json
s3://market-navigator-space/home/olshanskytestFile_fromPython.h5/.domain.json
Questions
1. Data Migration of existing .h5 files
The first two lines in the s3cmd ls
call above are:
s3://market-navigator-space/daily_data/
s3://market-navigator-space/daily_data/daily_data.h5
This is the 2GB file I’m referring to that I uploaded manually to my object store.
In order to make use of HSDS
, my understanding is that the data needs to be chunked, and the corresponding .json
metadata files need to be created. Is there a way to migrate the data?
Alternatively, I could potentially read everything into memory and do a large one-time upload using the h5pyd
interface.
2. Pandas Support
There is an unaswered thread on HSDS Pandas integration.
If I migrate to HSDS, is there a way to continue making use of pandas append
or select
APIs?
Final Thoughts
Before we dive into the details, I was wondering if my intuition is wrong regarding whether this is at all an appropriate use-case for HSDS
?
I understand that this is a very open-ended and long question and appreciate any support from the community!
Thanks,
Daniel