Writing dataset chunk by chunk + monitoring physical size via h5py

mdorier · August 28, 2021, 3:23pm

I would like to create a fairly large dataset in an HDF5 using h5py. The Python program that generates the data for this dataset will do so chunk by chunk (chunks of size (128, 344064) for a extensible dataset of dimensions (None, 344064)) using numpy arrays. How can I (1) append these chunks one by one to the dataset, and (2) make sure they are flushed to the file (otherwise the memory usage will end up growing to several GB).

Also the data in this dataset is highly compressible so I plan to use compressor='gzip'. I see when using h5ls -v on a file that I can get the efficient of compression, e.g.: Storage: 3607166976 logical bytes, 5787435 allocated bytes, 62327.56% utilization. In python I can get the size of the data from a dataset using dataset.nbytes, which represents the memory that would be needed to load the data into memory. So, bonus question: is there a way to know, in python and while I’m writing data chunk by chunk, the actual size on disk (or equivalently the % utilization)?

Thanks!

ajelenak · August 31, 2021, 12:51am

Hi @mdorier,

How can I (1) append these chunks one by one to the dataset, and (2) make sure they are flushed to the file (otherwise the memory usage will end up growing to several GB).

Something like:
```
dset[n * 128, :] = data
```
where n is the chunk number (first chunk n=0) and data is a NumPy array of shape (128, 344064).
Call the h5py.File.flush() method. Although this only instructs HDF5 library to flush its buffers. The rest is up to the operating system.

So, bonus question: is there a way to know, in python and while I’m writing data chunk by chunk, the actual size on disk (or equivalently the % utilization)?

See How to compute and display compression ratios of HDF5 datasets in an HDF5 file using Python · GitHub for an example how this is done.

Aleksandar

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Writing dataset chunk by chunk + monitoring physical size via h5py