Migrating pandas and local HF5 to HSDS


#1

Introduction

I’m new to the HDF community and just wanted to say that I really appreciate the clear and extensive documentation!

In particular, the following references and repositories have been really helpful:

Background & Requirments

I’ve recently been working on a project where I have an ~2GBs pandas DataFrame from which I read small subsets of data and periodically (daily) append small sets of data on the order of MBs. I expect the dataset to remain in the single digits of GBs for the foreseeable future.

Two of the major reasons why I’m using HDF5 with Pandas is due to the following available APIs:

The two functions above make data analysis very quick and efficient when the underlying datastore changes and avoids needing to read everything into memory.

Purpose of Question

While everything works smoothly and quickly locally, moving this to the cloud has a few complications and design decisions that need to be taken into account.

Note: assume that the .h5 file was uploaded to an S3 compatible object store.

If I horizontally scale the service that reads from my HDFStore, I need to make the .h5 file available on every node. The problems I forsee are:

  • I would need to download the 2GB+ file on every node when the service starts up. This is slow and expensive even if they’re in the same VPC.
  • I considered packaging the 2GB+ file into the Docker image, but that creates a lot of bloat and makes regular updates to the underlying data troublesome.
  • Given that I need to append small amounts of data to the 2GB file on a daily basis, even if I only have a single writer, that writer will need to re-upload the entire 2GB file every time.

Current State

I configured my kubernetes cluster on Digital Ocean, and am using spaces which is an S3-compatible object store. I was following the instructions at kubernetes_install_aws and got things to work with only a few modifications.

I used the test script from this thread and got the data to upload as expected.

For example, the output of: $ s3cmd ls s3://market-navigator-space --recursive | awk '{print $4}' is:

s3://market-navigator-space/daily_data/
s3://market-navigator-space/daily_data/daily_data.h5
s3://market-navigator-space/db/62b3ec8a-add6e50b/.info.json
s3://market-navigator-space/db/66b22a72-82527446/.group.json
s3://market-navigator-space/db/66b22a72-82527446/.info.json
s3://market-navigator-space/db/66f8d441-e255d8c8/.group.json
s3://market-navigator-space/db/66f8d441-e255d8c8/.info.json
s3://market-navigator-space/db/66f8d441-e255d8c8/d/dec9-fb1267-319972/.dataset.json
s3://market-navigator-space/db/66f8d441-e255d8c8/d/dec9-fb1267-319972/0_0
s3://market-navigator-space/db/66f8d441-e255d8c8/d/dec9-fb1267-319972/100_0
...
s3://market-navigator-space/db/66f8d441-e255d8c8/d/dec9-fb1267-319972/9_0
s3://market-navigator-space/db/bc84ba97-b793c7f4/.group.json
s3://market-navigator-space/db/bc84ba97-b793c7f4/d/4837-1bd7b6-461a4d/.dataset.json
s3://market-navigator-space/db/bc84ba97-b793c7f4/d/4837-1bd7b6-461a4d/0_0
...
s3://market-navigator-space/home/.domain.json
s3://market-navigator-space/home/olshansky/.domain.json
s3://market-navigator-space/home/olshanskytestFile_fromPython.h5/.domain.json

Questions

1. Data Migration of existing .h5 files

The first two lines in the s3cmd ls call above are:

s3://market-navigator-space/daily_data/
s3://market-navigator-space/daily_data/daily_data.h5

This is the 2GB file I’m referring to that I uploaded manually to my object store.

In order to make use of HSDS, my understanding is that the data needs to be chunked, and the corresponding .json metadata files need to be created. Is there a way to migrate the data?

Alternatively, I could potentially read everything into memory and do a large one-time upload using the h5pyd interface.

2. Pandas Support

There is an unaswered thread on HSDS Pandas integration.

If I migrate to HSDS, is there a way to continue making use of pandas append or select APIs?

Final Thoughts

Before we dive into the details, I was wondering if my intuition is wrong regarding whether this is at all an appropriate use-case for HSDS?

I understand that this is a very open-ended and long question and appreciate any support from the community!

Thanks,
Daniel


#2

Hi,

Thanks for you questions!

To answer the second one first, Pandas doesn’t currently support HSDS as a data store, but it shouldn’t be too hard to add HSDS support to Pandas. There are two primary Python packages for reading and writing to HDF5 data: h5py, and pytables, with Pandas using the later. A few years back, there was an attempt made to merge h5py and pytables (see: https://www.hdfgroup.org/2015/09/python-hdf5-a-vision/), but it ended up being more challenging than anticipated and we are still left with the two hdf packages.

For a python HSDS client, we created the h5pyd package that mirrors the h5py api. Over time, functionality was added to h5pyd that provides some of the features in Pytables: a table class, append operations, queries, etc. Therefore it might not be too much of a stretch to extend Pandas to support HSDS stores. I’ll look into this a bit more and get back to you.

For the first question, I presumed you used the “hsload” utility to upload the data to S3 (spaces). That should be it - hsload will automatically chunkify any non-chunked datasets. You can use “hsls -H -v /daily_data/” and you should see the /daily_data/ folder and /daily_data/daily_data.h5 file (which is really comprised of the different .jsons and chunk objects in the db folder. “hsls -r /daily_data/daily_data.h5” will list the HDF objects within daily_data (similar to h5ls for regular HDF5 files). And “hsinfo /daily_data.h5” will show total size, number of objects, etc. Also, you might want to try some simple h5pyd scripts that read content from the file.

You might notice that the way pytables organizes content within HDF5 is a bit obscure. Since hsload faithfully copies what it find in the HDF5 file to the S3 format, that’s what you end up with there as well. Therefore you might be better off writing directly from your data loaded into memory using h5pyd. You’ll need to write the data in batches of ~100MB though - h5pyd doesn’t currently support writing super-large selections.

For updating the data, HSDS should work really well since readers will be able to read the data while the writer is writing it. Again, you might find the h5pyd Table append method useful. And yes, I think using HSDS is very appropriate, esp. if we can figure out how to get Pandas to work with HSDS.

Let me know if I missed anything or you have additional questions.


#3

Hey John,

I really appreciate the detailed res[pnse. I find that historical context into how/why software evolved in some way is even more informative than how it functions today.

A few years back, there was an attempt made to merge h5py and pytables (see: https://www.hdfgroup.org/2015/09/python-hdf5-a-vision/), but it ended up being more challenging than anticipated and we are still left with the two hdf packages.

Makes sense. I did come by that doc and was struggling to figure out where it fits in with the status quo.

Over time, functionality was added to h5pyd that provides some of the features in Pytables: a table class, append operations, queries, etc. Therefore it might not be too much of a stretch to extend Pandas to support HSDS stores. I’ll look into this a bit more and get back to you.

The fact that it supports append is great to know. Just wanted to say that I’d be happy and interested to contribute to this effort after you look into it!

I presumed you used the “hsload” utility to upload the data to S3 (spaces).

I did not… I used the python package to upload my test data, but I uploaded my large HDF5 file manually. This makes sense and I’ll make sure to simply change it.

Therefore you might be better off writing directly from your data loaded into memory using h5pyd.

I don’t completely understand thsi part. Is the suggestion here to download the whole blob to local disk, load it into memory, write, and reupload?

Again, you might find the h5pyd Table append method useful. And yes, I think using HSDS is very appropriate, esp. if we can figure out how to get Pandas to work with HSDS.

I’ll definitely play around with the h5pyd append function. Seems to be very appropriate and keep me posted on what you find related to pandas!

I’ll report back on the thread once I try out all the suggestions and let you know how it worked!

Thanks,
Daniel


#4

Hey Daniel,

What I was getting at with my comment “better off writing directly from your data loaded in memory” is that if you create an HDF5 file using Panda’s HDFStore you’ll find the contents a bit confusing. Just creating datasets and writing content with h5py(d) works well (though this is not built into Pandas).

This stackoverlow post outlines a straightforward approach: https://stackoverflow.com/questions/30773073/save-pandas-dataframe-using-h5py-for-interoperabilty-with-other-hdf5-readers to reading and writing dataframes to HDF5. It should work equally well with h5pyd (you can just use the “import h5pyd as h5py” trick). For the h5pyd version, you can add an append method to avoid having re-write the entire dataset.

Btw, I came across this interesting blog comparing performance for different Pandas output formats: https://towardsdatascience.com/the-best-format-to-save-pandas-data-414dca023e0d. It would be interesting to see how both h5py and h5pyd do here.