Hi,
I am working on a project that involves uploading large datasets on Azure Blob Storage as HDF5 files. For testing, I am using a dataset stored in a CSV file that contains 500,000 rows and 1,024 columns. The column data types include strings of size 20, float32, float64, signed int16, int32 and int64.
I am using h5pyd library to perform the tests. One requirement is to be able to “query” the dataframe similarly to what pytables made possible for local hdf5 files. Hence I am using h5pyd Table class to write the dataset.
from pathlib import Path
import h5pyd as h5py
import pandas as pd
import numpy as np
DOMAIN_NAME = "hdf5://home/test/dataset.h5"
DATASET_PATH = Path.home() / 'Desktop' / 'data' / 'dataset.csv'
NUM_ROWS = 500_000
CHUNK_SIZE = 50
NB_CHUNKS = NUM_ROWS//CHUNK_SIZE
TABLE_NAME = "datatable"
with h5py.File(DOMAIN_NAME, "a") as file:
for i, chunk in enumerate(pd.read_csv(DATASET_PATH, iterator=True, chunksize=CHUNK_SIZE)):
# preprocessing
for col in chunk:
if chunk[col].dtype == 'object':
chunk.loc[:, col] = chunk[col].astype('|S')
array = chunk.to_records(index=False)
dt = np.dtype([(col, chunk[col].dtype) for col in chunk])
if i == 0:
table = file.create_table(TABLE_NAME, dtype=dt)
else:
table = file.get(TABLE_NAME)
table.append(array)
However, I am constrained to write data in very small chunks as you can see in my code above. When I attempt to write more than 75 lines per chunk, I get an “OSError: Request Entity Too Large:413” error.
I have tried writing 50 lines per chunks, and it works at first but after about 400 chunks it seems that the server is overwhelmed and I get "Max retries exceeded … (Caused by ResponseError(‘too many 503 error responses’)).
Furthermore, even if it did work, it would in theory take about 3 hours to write the whole dataset, which is very slow.
In the HSDS server config, I’ve tried changing the max_request_size parameter but it doesn’t seem to have any effect.
My first question is how can I increase this limit?
I would also like to know if there’s a more efficient way to achieve what I’m trying to do?
Thank you in advance for your help.