Chunking size and compression for large datasets

vikram.bhamidipati · April 2, 2016, 1:55am

Hello,

I have been reading some posts in this forum about chunking in HDF5 and ways of optimizing read on large datasets. The emphasis is on read time optimization rather than write time optimization because of the way the code works. I have a large 3D dataset of native double datatype. The dimensions are m by n by p (in row major storage) where m is the time dimension size, n is the grid point index size and p is the field quantity component size (always fixed).

The first access use case is as follows:

1. The graphics visualization requests data for the whole grid at a given time index and component index. It would seem like the best case here would be to read a single hyperslab with stride length equal to size of field components in 3rd dimension and equal to 1 in grid point index. Count is 1 along time index and field quantity index.

2. When time index changes, the start value for 1st dimension (time index) is changed while all other parameters stay constant.

3. When field component changes the start value for 3rd dimension (field component index) is changed.

The second use case is as follows:

1. The algorithm chooses certain grid points of interest (not necessarily adjacent in memory) and the code requests a "time history" for those grid points which include all field components for each node. So the request is for data over all time indices and all components over a small subset (negligible fraction of total) of grid points. The best case here would seem to be to read a union of hyperslabs where union is over the 2nd dimension indices.

2. Step 1 is repeated for various locations in the grid many more times. Since multiple threads are running and making these requests independently, there is no possibility of further union of hyperslabs.

It would seem like the two read access patterns have somewhat conflicting needs since first access pattern has constant 1st dimension whereas second access pattern has constant 2nd dimension for each hyperslab. If pushed to make a choice I would optimize second read pattern over first since it critically affects execution time. I am also intending on using 'H5S_SELECT_OR' selection operator for union of hyperslabs before h5dread call. As previously mentioned, the write time is not very critical but read access time is. So my questions are:

1. What chunking sizes would work best? I am planning on using m by 1 by p chunk size when writing the dataset provided m is large enough to push the chunk size over 1 Mb. If smaller I would increase 2nd dimension size. Is this the right strategy?

2. Can I set cache size for the dataset to optimize read time? I read about using h5pset_chunk_cache. Since I know how many bytes each chunk I am going to request is, should I set the cache size to number of grid points times data size for each grid point? Also is this function needed only during read (since write time optimization is not an issue)?

3. What compression method if any should be used? I read in a tutorial on chunking (https://www.hdfgroup.org/HDF5/doc/Advanced/Chunking/Chunking_Tutorial_EOS13_2009.pdf) that if compression is used it does not matter what cache size is. Is this correct? Why? I did not understand the explanation in the tutorial that since entire chunk is always read (when compression is used) for each h5dread call, cache size does not matter? Any clarification on how compression helps optimize read time will be helpful.

4. Also while writing the dataset can I force chunks to be contiguous in memory to reduce any seek times?

Thank you,
Vikram

nfortne2 · April 11, 2016, 9:44pm

Vikram,

1. That depends very much on whether you use compression. If you don't use compression, then it may be faster to disable the chunk cache and use something like mx1x1 chunks or mx1xp chunks (or some value between 1 and p for the third dimension). This will cause reads in the first case to read n single elements (not ideal, but at least not bandwidth intensive), and in the second case to read between 1 and p whole chunks. If you use the chunk cache with this scheme, at it is set large enough to hold all chunks for a single component, it will greatly improve the first case when the time value changes, but greatly increase bandwidth when the component changes, due to needing to read in the whole chunks, as opposed to single elements without the cache.

If you are using compression, then you most likely want to use the chunk cache (at least it won't hurt). I would think you'd want more squarish chunks here, with the size determined by whether you prioritize bandwidth (smaller chunks) or latency (larger chunks), and the shape determined by how much you prioritize one case over the other (it should "flatten" to resemble the read pattern you are prioritizing, flatten more to prioritize more at the expense of the other pattern). If you can set the chunk cache large enough to hold all chunks involved in an operation (or more than one operation), then that will greatly improve performance when subsequent reads touch the same chunks.

2. As above, the optimal chunk cache setting may be to disable it, if no compression is used. If it is not disabled, then generally the larger the better, though there is a point of diminishing returns. It should ideally be at least as large as all chunks involved in an operation.

3. That slide only referred to the fact that the chunk cache size had no effect on the results of that specific test (unless the cache were set large enough to hold the entire dataset). I agree those slides by themselves don't do a good job of explaining exactly what's going on. The chunk cache size definitely can affect performance with compression.

4. I'm not sure what you mean by this. Individual chunks are always contiguous both in memory and on disk. Do you mean placing all the chunks next to each other?

Thanks,

-Neil

···

________________________________
From: Hdf-forum <hdf-forum-bounces@lists.hdfgroup.org> on behalf of Bhamidipati, Vikram <vikram.bhamidipati@swri.org>
Sent: Friday, April 1, 2016 8:55 PM
To: hdf-forum@lists.hdfgroup.org
Subject: [Hdf-forum] Chunking size and compression for large datasets

Hello,

I have been reading some posts in this forum about chunking in HDF5 and ways of optimizing read on large datasets. The emphasis is on read time optimization rather than write time optimization because of the way the code works. I have a large 3D dataset of native double datatype. The dimensions are m by n by p (in row major storage) where m is the time dimension size, n is the grid point index size and p is the field quantity component size (always fixed).

The first access use case is as follows:

1. The graphics visualization requests data for the whole grid at a given time index and component index. It would seem like the best case here would be to read a single hyperslab with stride length equal to size of field components in 3rd dimension and equal to 1 in grid point index. Count is 1 along time index and field quantity index.

2. When time index changes, the start value for 1st dimension (time index) is changed while all other parameters stay constant.

3. When field component changes the start value for 3rd dimension (field component index) is changed.

The second use case is as follows:

1. The algorithm chooses certain grid points of interest (not necessarily adjacent in memory) and the code requests a “time history” for those grid points which include all field components for each node. So the request is for data over all time indices and all components over a small subset (negligible fraction of total) of grid points. The best case here would seem to be to read a union of hyperslabs where union is over the 2nd dimension indices.

2. Step 1 is repeated for various locations in the grid many more times. Since multiple threads are running and making these requests independently, there is no possibility of further union of hyperslabs.

It would seem like the two read access patterns have somewhat conflicting needs since first access pattern has constant 1st dimension whereas second access pattern has constant 2nd dimension for each hyperslab. If pushed to make a choice I would optimize second read pattern over first since it critically affects execution time. I am also intending on using ‘H5S_SELECT_OR’ selection operator for union of hyperslabs before h5dread call. As previously mentioned, the write time is not very critical but read access time is. So my questions are:

1. What chunking sizes would work best? I am planning on using m by 1 by p chunk size when writing the dataset provided m is large enough to push the chunk size over 1 Mb. If smaller I would increase 2nd dimension size. Is this the right strategy?

2. Can I set cache size for the dataset to optimize read time? I read about using h5pset_chunk_cache. Since I know how many bytes each chunk I am going to request is, should I set the cache size to number of grid points times data size for each grid point? Also is this function needed only during read (since write time optimization is not an issue)?

3. What compression method if any should be used? I read in a tutorial on chunking (https://www.hdfgroup.org/HDF5/doc/Advanced/Chunking/Chunking_Tutorial_EOS13_2009.pdf) that if compression is used it does not matter what cache size is. Is this correct? Why? I did not understand the explanation in the tutorial that since entire chunk is always read (when compression is used) for each h5dread call, cache size does not matter? Any clarification on how compression helps optimize read time will be helpful.

4. Also while writing the dataset can I force chunks to be contiguous in memory to reduce any seek times?

Thank you,

Vikram

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Chunking size and compression for large datasets