Writing dataset chunk by chunk + monitoring physical size via h5py


#1

I would like to create a fairly large dataset in an HDF5 using h5py. The Python program that generates the data for this dataset will do so chunk by chunk (chunks of size (128, 344064) for a extensible dataset of dimensions (None, 344064)) using numpy arrays. How can I (1) append these chunks one by one to the dataset, and (2) make sure they are flushed to the file (otherwise the memory usage will end up growing to several GB).

Also the data in this dataset is highly compressible so I plan to use compressor='gzip'. I see when using h5ls -v on a file that I can get the efficient of compression, e.g.: Storage: 3607166976 logical bytes, 5787435 allocated bytes, 62327.56% utilization. In python I can get the size of the data from a dataset using dataset.nbytes, which represents the memory that would be needed to load the data into memory. So, bonus question: is there a way to know, in python and while I’m writing data chunk by chunk, the actual size on disk (or equivalently the % utilization)?

Thanks!


#2

Hi @mdorier,

How can I (1) append these chunks one by one to the dataset, and (2) make sure they are flushed to the file (otherwise the memory usage will end up growing to several GB).

  1. Something like:

    dset[n * 128, :] = data
    

    where n is the chunk number (first chunk n=0) and data is a NumPy array of shape (128, 344064).

  2. Call the h5py.File.flush() method. Although this only instructs HDF5 library to flush its buffers. The rest is up to the operating system.

So, bonus question: is there a way to know, in python and while I’m writing data chunk by chunk, the actual size on disk (or equivalently the % utilization)?

See https://gist.github.com/ajelenak/997c9ab4879a3a66fb5720e90d36c79e for an example how this is done.

Aleksandar