Hello,
I have been reading some posts in this forum about chunking in HDF5 and ways of optimizing read on large datasets. The emphasis is on read time optimization rather than write time optimization because of the way the code works. I have a large 3D dataset of native double datatype. The dimensions are m by n by p (in row major storage) where m is the time dimension size, n is the grid point index size and p is the field quantity component size (always fixed).
The first access use case is as follows:
1. The graphics visualization requests data for the whole grid at a given time index and component index. It would seem like the best case here would be to read a single hyperslab with stride length equal to size of field components in 3rd dimension and equal to 1 in grid point index. Count is 1 along time index and field quantity index.
2. When time index changes, the start value for 1st dimension (time index) is changed while all other parameters stay constant.
3. When field component changes the start value for 3rd dimension (field component index) is changed.
The second use case is as follows:
1. The algorithm chooses certain grid points of interest (not necessarily adjacent in memory) and the code requests a "time history" for those grid points which include all field components for each node. So the request is for data over all time indices and all components over a small subset (negligible fraction of total) of grid points. The best case here would seem to be to read a union of hyperslabs where union is over the 2nd dimension indices.
2. Step 1 is repeated for various locations in the grid many more times. Since multiple threads are running and making these requests independently, there is no possibility of further union of hyperslabs.
It would seem like the two read access patterns have somewhat conflicting needs since first access pattern has constant 1st dimension whereas second access pattern has constant 2nd dimension for each hyperslab. If pushed to make a choice I would optimize second read pattern over first since it critically affects execution time. I am also intending on using 'H5S_SELECT_OR' selection operator for union of hyperslabs before h5dread call. As previously mentioned, the write time is not very critical but read access time is. So my questions are:
1. What chunking sizes would work best? I am planning on using m by 1 by p chunk size when writing the dataset provided m is large enough to push the chunk size over 1 Mb. If smaller I would increase 2nd dimension size. Is this the right strategy?
2. Can I set cache size for the dataset to optimize read time? I read about using h5pset_chunk_cache. Since I know how many bytes each chunk I am going to request is, should I set the cache size to number of grid points times data size for each grid point? Also is this function needed only during read (since write time optimization is not an issue)?
3. What compression method if any should be used? I read in a tutorial on chunking (https://www.hdfgroup.org/HDF5/doc/Advanced/Chunking/Chunking_Tutorial_EOS13_2009.pdf) that if compression is used it does not matter what cache size is. Is this correct? Why? I did not understand the explanation in the tutorial that since entire chunk is always read (when compression is used) for each h5dread call, cache size does not matter? Any clarification on how compression helps optimize read time will be helpful.
4. Also while writing the dataset can I force chunks to be contiguous in memory to reduce any seek times?
Thank you,
Vikram