Writing large datasets with h5pyd

theo.plantefol34 · January 15, 2025, 3:32pm

Hi,

I am working on a project that involves uploading large datasets on Azure Blob Storage as HDF5 files. For testing, I am using a dataset stored in a CSV file that contains 500,000 rows and 1,024 columns. The column data types include strings of size 20, float32, float64, signed int16, int32 and int64.

I am using h5pyd library to perform the tests. One requirement is to be able to “query” the dataframe similarly to what pytables made possible for local hdf5 files. Hence I am using h5pyd Table class to write the dataset.

from pathlib import Path
import h5pyd as h5py
import pandas as pd
import numpy as np

DOMAIN_NAME = "hdf5://home/test/dataset.h5"
DATASET_PATH = Path.home() / 'Desktop' / 'data' / 'dataset.csv'
NUM_ROWS = 500_000
CHUNK_SIZE = 50
NB_CHUNKS = NUM_ROWS//CHUNK_SIZE
TABLE_NAME = "datatable"

with h5py.File(DOMAIN_NAME, "a") as file:
    for i, chunk in enumerate(pd.read_csv(DATASET_PATH, iterator=True, chunksize=CHUNK_SIZE)):
        # preprocessing
        for col in chunk:
            if chunk[col].dtype == 'object':
                chunk.loc[:, col] = chunk[col].astype('|S')
        array = chunk.to_records(index=False)
        dt = np.dtype([(col, chunk[col].dtype) for col in chunk])

        if i == 0:
            table = file.create_table(TABLE_NAME, dtype=dt)
        else:
            table = file.get(TABLE_NAME)
        table.append(array)

However, I am constrained to write data in very small chunks as you can see in my code above. When I attempt to write more than 75 lines per chunk, I get an “OSError: Request Entity Too Large:413” error.
I have tried writing 50 lines per chunks, and it works at first but after about 400 chunks it seems that the server is overwhelmed and I get "Max retries exceeded … (Caused by ResponseError(‘too many 503 error responses’)).
Furthermore, even if it did work, it would in theory take about 3 hours to write the whole dataset, which is very slow.

In the HSDS server config, I’ve tried changing the max_request_size parameter but it doesn’t seem to have any effect.

My first question is how can I increase this limit?
I would also like to know if there’s a more efficient way to achieve what I’m trying to do?

Thank you in advance for your help.

leo · January 15, 2025, 6:13pm

Hi Theo,
I must admit, that I am not familiar with the tables in h5pyd. Does it support more advanced features than a standard dataset and are they required?
Otherwise I would suggest to open the csv file in pandas and exporting it to a local .h5 file using this method. The dataframe should still be small enough to fit into memory. Then upload the .h5 file to hsds via the hsload commad from the h5pyd command line interface (see Chapter Command Line Apps in the readme).

Regarding the limits, there seem to be multiple ones in the config.yml for hsds that limit request size. I also haven’t filly grasped them yet. max_chunks_per_request might be another one to be looking at.

Also this topic should probably be moved to the hsds forum

theo.plantefol34 · January 16, 2025, 9:01am

Hi leo, thank you for you answer. Storing datasets as tables allows to select and retrieve specific parts of the dataset using pytable-style conditions, like the example below. This seems very useful for large datasets because we would not like having to download the whole file and process it locally afterwards.
The dataframe is very large though (the CSV file size is about 6 gigabytes), which is why we have to upload it in chunks.

with h5pyd.File(DOMAIN_NAME, "r") as file:
    table = file.get(TABLE_NAME)
    results = table.read_where("(col_B == b'A') & (col_D >= 1000)", limit=10)

jreadey · January 16, 2025, 10:02am

I’m taking a look at this. Seems the problem is that pandas is returning variable length string type for the string fields. When HSDS gets this, it makes a guess for chunk shape (since it doesn’t know what the actual size of the variable string elements will be) which turns out to be widely off the mark (each chunk is a few 100 mb). I’ll try fixing up the script to use fixed-length types and see how that goes.

jreadey · January 16, 2025, 2:02pm

Converting the datatype to used fixed width strings seemed to work. Testing on my laptop, the load took 140 sec. Let me know if this works for you.

Here’s the revised script:


import h5pyd as h5py
import pandas as pd
import numpy as np
import time
import logging

DOMAIN_NAME = "hdf5://home/test/dataset.h5"
DATASET_PATH = Path.home() / 'Desktop' / 'data' / 'dataset.csv'
NUM_ROWS = 500_000
CHUNK_SIZE = 400
NB_CHUNKS = NUM_ROWS//CHUNK_SIZE
TABLE_NAME = "datatable"

def convert_dt(dt, max_str_len=20):
    # convert an variable width strings with fixed width strings
    if len(dt) == 0:
        if dt.kind == 'O':
            # assume this is a string type
            return np.dtype(f"S{max_str_len}")
        return dt
    else:
        dt_out = []
        for name in dt.names:
            sub_dt = dt[name]
            name = name.strip()  # pandas can add some extra spaces
            sub_dt = convert_dt(sub_dt, max_str_len=max_str_len)
            dt_out.append((name, sub_dt))
        return np.dtype(dt_out)

#
# main
#

loglevel = logging.WARNING
logging.basicConfig(format='%(asctime)s %(message)s', level=loglevel)
                    
with h5py.File(DOMAIN_NAME, "w") as file:
    for i, chunk in enumerate(pd.read_csv(DATASET_PATH, iterator=True, chunksize=CHUNK_SIZE)):
        # preprocessing
        for col in chunk:
            if chunk[col].dtype == 'object':
                chunk.loc[:, col] = chunk[col].astype('|S')
        array = chunk.to_records(index=False)
        dt = np.dtype([(col, chunk[col].dtype) for col in chunk])
        if i == 0:
            h5_dt = convert_dt(dt)
            table = file.create_table(TABLE_NAME, dtype=h5_dt)
        else:
            table = file.get(TABLE_NAME)
        
        ts = time.time()
        table.append(array)
        te = time.time()
        row_num = i * CHUNK_SIZE
        print(f"rows {row_num} - {row_num+CHUNK_SIZE}: {(te-ts):6.2f} s")

theo.plantefol34 · January 17, 2025, 10:14am

Hi @jreadey , thank you very much for your help. I run your code and I have been able to upload my dataset in 16 minutes. Can I ask you what are the characteristics of your computer and your HSDS configuration ?

jreadey · January 18, 2025, 2:44am

Glad to hear that it worked for you!

I was just running on my laptop with a local HSDS. Writing to Azure blob storage is quite a bit slower than a SSD drive, so I wouldn’t expect it to be as fast for you.

Try playing around with the CHUNK_SIZE value and see what works best. If you can run your code and HSDS on a Azure instance that will be faster than running outside the data center

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Writing large datasets with h5pyd