Store a group in a contiguous file space


#1

Can one store all datasets in a group into a contiguous file space?
I have an application that reads groups as units and all the datasets
use the contiguous data layout. I believe this option, if available,
can yield a good read performance.


#2

AFAIK contiguous is a layout property, of a dataset. In my understanding you wish to store some complex data with a contiguous layout: yes this is supported.

What kind of data can be stored? homogeneous | struct | blob. What remains is whether you can repack the content of an HDF5 group into homogeneous | struct | blob.

  • blob: yes, any part of a disk is a blob
  • struct: yes, a group has datasets, which can be modelled as a C/C++ struct
  • homogeneous: maybe, given all datasets are the same type, you always can augment an existing space

best: steve


#3

You may try the following trick: first, create all datasets in a group, and then write datasets one by one. Make sure allocation time is set to H5D_ALLOC_TIME_LATE.

Then you can use the H5Dget_offset function to check the offsets of raw data for each dataset - they should increase according to the creation order and an offset for a dataset N should differ from the offset for the dataset N-1 (N=2, …) by the size of raw data of the dataset N-1.

I would also suggest to use paged allocation and page buffering to keep all metadata together and avoid small metadata I/O.

I never tried the suggestion before myself, so no guarantee :slight_smile: But long time ago there was an experiment done with the same strategy for writing a dataset by chunks, and it did help with performance. Maybe someone still has Albert Cheng’s paper from the beginning of the century :slight_smile:

Good luck!

Elena


#4

Thanks Steve and Thanks Elena.
I will give your suggestions a try.


#5

Another suggestion would be to use H5Pget_meta_block_size to enlarge the blocks to a size large enough to describe the entire group. The default is 2048 bytes.

To keep a large dataset of chunks contiguous in a file, I recently set the meta block size to be 4 MB (4194304 bytes).


#6

I guess H5Pget_meta_block_size is for metadata.
But it is also useful for my case, as I have more than 1000
groups in the file. Retrieving the complete list of group
names can take a very long time.
Thanks for the good suggestion.


#7

Yes, it influences how much space is allocated for metadata each time more is needed. It also sets how much space is set aside for metadata at the beginning of the file.

Setting the value higher helps to consolidate both the data and the metadata. Your metadata is grouped into a large block and does not need to be interspersed between your datasets.