It seems that there have been some changes in the calculation of the hash value for chunk caching in 1.10.3, and I’m not sure whether these were intentional or not. To motivate this discussion, let’s say that I have a 100-by-200000 dataset (of doubles, but this isn’t important right now). The chunk dimensions are 22-by-44721, so in terms of chunks, the data set is 5-by-5 after rounding up.
Now, in H5Dchunk.c
, there is a calculation of “scaled dimensions”, which I presume to be the data set dimensions in terms of chunks. This is performed on line 952:
rdcc->scaled_dims[u] = dset->shared->curr_dims[u] / dset->shared->layout.u.chunk.dim[u];
Both values on the right hand side seem to be integral types, so division results in truncation. Indeed, insertion of printf
statements confirms that the computed value that is stored in scaled_dims[u]
is 4 rather than 5 in the example I’ve described above. As such, upon rounding up to the nearest power of 2, the encoding value stored in rdcc->scaled_encode_bits[u]
(on line 961) is simply 2.
This has some implications for the calculation of the hash value for each chunk when the code enters H5D__chunk_hash_val
(line 2795). The relevant lines are:
unsigned u; /* Local index variable */
val = scaled[0];
for(u = 1; u < ndims; u++) {
val <<= shared->cache.chunk.scaled_encode_bits[u];
val ^= scaled[u];
} /* end for */
I believe scaled
holds the scaled coordinates of each chunk (e.g., the top-left chunk would be [0, 0] while the bottom-right chunk would be [4, 4]). The intention seems to be to (i) multiply the slowest-changing scaled coordinate by the next scaled dimension, rounded up to the nearest power of 2; and then (ii) add the next scaled coordinate for this chunk, to obtain an index for computing the hash value. This is possible as the left-shifting should zero all bits that could be set in the next scaled coordinate.
Of course, this logic assumes that the next scaled dimension was correctly rounded up to nearest power of 2. In my example, the scaled dimension is 5 so the next power of 2 should be 8 (and the encoding should be 3). However, the truncation to define scaled_dims[u]
understates the amount of left-shifting required. This means that the left-shifting can leave non-zero bits that get xor’d with the next scaled coordinate, resulting in some non-additive behaviour that leads to some unpredictable hash values.
One can see how this affects the hashing by manually calculating these indices for my data set. We can see that the index at the end of the second row suddenly drops to zero.
0 1 2 3 4
4 5 6 7 0
8 9 10 11 12
12 13 14 15 8
16 17 18 19 20
Ordinarily, for a 5-by-5 data set, one might think that an nslots
of 5 would be sufficient to avoid hash conflicts when loading a row of the chunks into the cache. However, this is not the case due to the unpredictability of the indices, yielding hash values of:
0 1 2 3 4
4 0 1 2 0
3 4 0 1 2
2 3 4 0 3
1 2 3 4 0
You can see that a few rows have hash conflicts. In fact, it is only possible to avoid hash conflicts with an nslots
of 8 or higher:
0 1 2 3 4
4 5 6 7 0
0 1 2 3 4
4 5 6 7 0
0 1 2 3 4
This is borne out by empirical testing of row-level access after testing different values of nslots
, which only achieves optimal speed when nslots
is set to 8 or higher. Similarly, column-level access requires an nslots
of 21 or higher. Neither of this are particularly intuitive choices for nslots
.
So, my question is: is the truncation upon calculation of rdcc->scaled_dims
intentional? In 1.8.20, it was possible to compute an nslots
that guaranteed unique hash indices for all chunks overlapping a row (or column) of the data set. This was very useful for applications requiring strictly row- and column-level access to data, as I could tune the chunk cache to perform at its best while using as little memory as possible. However, the (needlessly?) unpredictable nature of the hash calculation in 1.10.3 makes it much more difficult to choose nslots
.
Disclaimer: I should add that I’ve been testing this behaviour on a version of the HDF5 library that’s been made accessible for use in R (https://github.com/grimbough/Rhdf5lib). The relevant source code was taken verbatim from the official HDF5 repository, so I believe that the issues I’ve raised are still relevant, but I’m happy to be told otherwise.