Is there any way to view (not just recompute) the Fletcher32 checksum in a file?
My use case is basically this: I have a database of my H5 files, and I want to be sure that none of the files have changed without the database knowing about it. Computing the MD5 sum of the entire file takes much too long, and I actually don’t care if — say — a new attribute has been added, or the file’s mtime has changed; but I do want to know if the datasets themselves have the same data. For example, I might realize that some scaling was off by a factor of 2, go in and change those bits in one dataset without changing the shape, or possibly even the number of bytes in the file — and then fail to update the database for one reason or another. I need to be able to detect such a change. And since Fletcher32 is already being computed by HDF5 with every change to our data, it would be very convenient.
Did you see this post from 2018? It doesn’t answer your question, but explains error checking behavior. I have the impression you can’t get the checksum value. Data corruption detection and/or correction
I figured out how to do it. The necessary high-level functions were introduced in HDF5 v1.10.5 and will be in h5py 3.0 — specifically, the (H5D)get_num_chunks and (H5D)get_chunk_info functions.
Here’s a simple example showing how to use these functions to get the checksums for every chunk in the data dataset of test.h5. Note that we need both h5py capabilities and seek/read capabilities — which is why I used this weird way of opening the file.
import numpy as np
with open('test.h5', 'rb') as stream:
with h5py.File(stream, 'r') as f:
ds = f['data']
assert ds.fletcher32, ('Dataset does not have Fletcher-32 checksums')
checksums = np.zeros((ds.id.get_num_chunks(),), dtype=np.uint32)
for i in range(checksums.size):
chunk_info = ds.id.get_chunk_info(i)
offset = chunk_info.byte_offset + chunk_info.size - 4
checksums[i] = np.frombuffer(stream.read(4), dtype=np.uint32)
This code works with the current master branch of h5py. I also tried another method using read_direct_chunk, but that’s more cumbersome and reads all of the data, only to use the last 4 bytes of each chunk. Other than the chunk info, this code only reads exactly 4 bytes per chunk, and is thus probably about as efficient as it can be.
Thanks, I didn’t see that. But I get the opposite impression.
Obviously, the checksum is stored in the file somewhere, so that it can be read and compared to the checksum computed directly from the data. (It’s just that first part that I want to do.) Looking at the source for the h5check program they mention was enough to help me find these lines in the HDF5 source, which are seemingly what’s reading the bytes.
So basically, I just want to know the best (reliable, user-friendly, and future-proof) way to get those bytes from a given dataset. Are there high-level functions to do this or at least get me closer to my goal?