I would like to create a fairly large dataset in an HDF5 using h5py. The Python program that generates the data for this dataset will do so chunk by chunk (chunks of size
(128, 344064) for a extensible dataset of dimensions
(None, 344064)) using numpy arrays. How can I (1) append these chunks one by one to the dataset, and (2) make sure they are flushed to the file (otherwise the memory usage will end up growing to several GB).
Also the data in this dataset is highly compressible so I plan to use
compressor='gzip'. I see when using
h5ls -v on a file that I can get the efficient of compression, e.g.:
Storage: 3607166976 logical bytes, 5787435 allocated bytes, 62327.56% utilization. In python I can get the size of the data from a dataset using
dataset.nbytes, which represents the memory that would be needed to load the data into memory. So, bonus question: is there a way to know, in python and while I’m writing data chunk by chunk, the actual size on disk (or equivalently the % utilization)?