The HDF5 library uses Bob Jenkins’ lookup3 for metadata hashing:
https://burtleburtle.net/bob/c/lookup3.c
@koziol selected over crc-32 and fletcher-32 in 2006 for the following reason:
Add ‘loookup3’ checksum routine and switch to using it for metadata checksums - it’s just as “strong” as the CRC32 and about 40% faster in general (with some compiler optimizations, it’s nearly as fast as the fletcher-32 algorithm).
Recent interest in lookup3 seems mainly spurred by HDF5.
Since 2006, crc32 is now hardware accelerated:
crc32c is also commonly found as packages for many languages, spurred in part by the hardware acceleration:
Is lookup3 actually still much faster than crc-32c?
Given the hardware acceleration and availability of crc-32c, should we allow for crc-32c as an alternate hash other than Bob Jenkins’ lookup3 in HDF5 2.0?
2 Likes
This looks interesting. I believe using hardware acceleration in the core library would be a first for us, but not a deal breaker. It looks like the lookup3 algorithm is implemented directly in the HDF5 library so there’s no need to be concerned about availability of external software. Is the motivation for the change mainly for performance or for interacting with the checksum with third party software (i.e. outside of the HDF5 library)? Do you know of a use case where the time spent in the checksum algorithm is a significant fraction of the overall time?
1 Like
Interestingly, it looks like crc is actually also implemented in the HDF5 library, though it doesn’t seem to be used anywhere except in a test (that uses private functions).
My initial motiviation is (meta)data transparency. If I’m looking at the bytes of a HDF5 file, I want to understand what they mean and where they came from. I can mostly navigate the bytes by looking at the file format specification, but the checksum is the most mysterious part. I would need an implementation of Jenkins’ lookup3 to calculate the checksum. This can be a challenge if I’m not using C. In 2025, it seems that crc32c is much easier to find.
From there, then the interest comes from using 3rd party software to interact with HDF5. In particular, I would like to write a simple HDF5 file without needing to depend on the HDF5 library. If I’m just writing HDF5, I do not necessarily need to understand all of the HDF5 specification. I just need to understand the part that I’m writing. The particular scenario where this comes up for me is when I’m writing from a highly performance sensitive detector device. I may be using LabView, or I may be using an embedded system. Often I may need to use particular I/O APIs rather than the generic ones that the HDF5 C library uses. In this scenario, it may be easier for me to write a particular sequence of bytes than load the HDF5 library.
Performance of the checksum does not seem to be a major bottleneck for me. However, it seems that it may be a criteria for why lookup3 was chosen over crc32c. I’m wondering if that selection criteria still holds true in 2025.
The choice of 32-bit checksum does not seem to be a critical part of the design of HDF5. While introducing new checksum variants may make it difficult for old programs to read new HDF5 files, it may make it easier to integrate HDF5 into new environments.
In summary, my primary interest here is transparency. The choice of lookup3 over crc32c makes the format seem opaque since lookup3 implementations are harder to find than crc32c implementations. If we can increase transparency and understanding of the format, that creates opportunities for others to interact and perhaps even experiment with HDF5 as a format, allowing its use in novel places.