Unintended behaviour for hash values during chunk caching?


#1

It seems that there have been some changes in the calculation of the hash value for chunk caching in 1.10.3, and I’m not sure whether these were intentional or not. To motivate this discussion, let’s say that I have a 100-by-200000 dataset (of doubles, but this isn’t important right now). The chunk dimensions are 22-by-44721, so in terms of chunks, the data set is 5-by-5 after rounding up.

Now, in H5Dchunk.c, there is a calculation of “scaled dimensions”, which I presume to be the data set dimensions in terms of chunks. This is performed on line 952:

        rdcc->scaled_dims[u] = dset->shared->curr_dims[u] / dset->shared->layout.u.chunk.dim[u];

Both values on the right hand side seem to be integral types, so division results in truncation. Indeed, insertion of printf statements confirms that the computed value that is stored in scaled_dims[u] is 4 rather than 5 in the example I’ve described above. As such, upon rounding up to the nearest power of 2, the encoding value stored in rdcc->scaled_encode_bits[u] (on line 961) is simply 2.

This has some implications for the calculation of the hash value for each chunk when the code enters H5D__chunk_hash_val (line 2795). The relevant lines are:

        unsigned u;          /* Local index variable */

        val = scaled[0];
        for(u = 1; u < ndims; u++) {
            val <<= shared->cache.chunk.scaled_encode_bits[u];
            val ^= scaled[u];
        } /* end for */

I believe scaled holds the scaled coordinates of each chunk (e.g., the top-left chunk would be [0, 0] while the bottom-right chunk would be [4, 4]). The intention seems to be to (i) multiply the slowest-changing scaled coordinate by the next scaled dimension, rounded up to the nearest power of 2; and then (ii) add the next scaled coordinate for this chunk, to obtain an index for computing the hash value. This is possible as the left-shifting should zero all bits that could be set in the next scaled coordinate.

Of course, this logic assumes that the next scaled dimension was correctly rounded up to nearest power of 2. In my example, the scaled dimension is 5 so the next power of 2 should be 8 (and the encoding should be 3). However, the truncation to define scaled_dims[u] understates the amount of left-shifting required. This means that the left-shifting can leave non-zero bits that get xor’d with the next scaled coordinate, resulting in some non-additive behaviour that leads to some unpredictable hash values.

One can see how this affects the hashing by manually calculating these indices for my data set. We can see that the index at the end of the second row suddenly drops to zero.

   0    1    2    3    4
   4    5    6    7    0
   8    9   10   11   12
  12   13   14   15    8
  16   17   18   19   20

Ordinarily, for a 5-by-5 data set, one might think that an nslots of 5 would be sufficient to avoid hash conflicts when loading a row of the chunks into the cache. However, this is not the case due to the unpredictability of the indices, yielding hash values of:

   0    1    2    3    4
   4    0    1    2    0
   3    4    0    1    2
   2    3    4    0    3
   1    2    3    4    0

You can see that a few rows have hash conflicts. In fact, it is only possible to avoid hash conflicts with an nslots of 8 or higher:

   0    1    2    3    4
   4    5    6    7    0
   0    1    2    3    4
   4    5    6    7    0
   0    1    2    3    4

This is borne out by empirical testing of row-level access after testing different values of nslots, which only achieves optimal speed when nslots is set to 8 or higher. Similarly, column-level access requires an nslots of 21 or higher. Neither of this are particularly intuitive choices for nslots.

So, my question is: is the truncation upon calculation of rdcc->scaled_dims intentional? In 1.8.20, it was possible to compute an nslots that guaranteed unique hash indices for all chunks overlapping a row (or column) of the data set. This was very useful for applications requiring strictly row- and column-level access to data, as I could tune the chunk cache to perform at its best while using as little memory as possible. However, the (needlessly?) unpredictable nature of the hash calculation in 1.10.3 makes it much more difficult to choose nslots.

Disclaimer: I should add that I’ve been testing this behaviour on a version of the HDF5 library that’s been made accessible for use in R (https://github.com/grimbough/Rhdf5lib). The relevant source code was taken verbatim from the official HDF5 repository, so I believe that the issues I’ve raised are still relevant, but I’m happy to be told otherwise.


#2

I’m not familiar with the HDF5 code at all, but just like you, I think this looks strange. If this is really supposed to be the data set dimensions in terms of chunks like you say, I would expect something like

rdcc->scaled_dims[u] = (dset->shared->curr_dims[u] + dset->shared->layout.u.chunk.dim[u] - 1) /
    dset->shared->layout.u.chunk.dim[u];

Let’s wait to see what the HDF5 devs think. I think you’ve done a really nice analysis.


#3

This looks to be a problem. Thanks for finding and reporting it! I will file a bug report for it.


#4

Thanks Neil. I should also point out that the documentation for HDF5 chunk caching is somewhat ambiguous:

https://support.hdfgroup.org/HDF5/doc/Advanced/Chunking/index.html
https://support.hdfgroup.org/HDF5/doc/_topic/Chunking/

In particular, there is a line that says:

To determine the hash value for a chunk, the chunk is first assigned a unique index that is the linear index into a hypothetical array of the chunks. That is, the upper-left chunk has an index of 0, the one to the right of that has an index of 1, and so on.

This seems to suggest that the index would increment by 1 from the right-most chunk on one row (of chunks) to the left-most chunk on the following row. So, for the 3-by-3 example in the documentation, one might think:

 0 1 2
 3 4 5
 6 7 8

However, the source code indicates that the index actually jumps by the nearest power of 2 across rows:

 0  1  2 
 4  5  6
 8  9 10

I guess this information won’t make much difference to most users, but the kind of people who would read these documentation files are also the type of people who would want to know the (correct) details.