I am writing some string data(S75) to datasets and using core driver with lzf compression. I am using python with h5py 3.15 and hdf5 2.0.0.
Occasionally, what I have seen is that the chunk in the h5 file overwrites another chunk and hence when we read this overwritten chunk back, the lzf filter complains. I am not doing anything fancy, I open the file using core driver, write the data using ds[index] = some_data and then the file is flushed back. The data is moved around here and there depending upon the order that needs to be maintained.
The chunks 11 and 12 have same offset but different size, reading chunk 11 fails but reading chunk 12 works. I’ve not been able to reproduce this outside. It happens once in a while as part of some jobs which are writing to the h5 file.
I wanted to check if something like this is possible and is a valid scenario? I think not. What could be the possible reasons for this?
We have not been able to reproduce this but we have seen memory related errors in jobs where this happens. We actually got the data that was written in the job and tried to replay outside the job but that also didn’t reproduce.
That definitely doesn’t sound like a valid or expected scenario—two chunks ending up with the same offset (but different sizes) strongly suggests some kind of corruption rather than normal HDF5 behavior.
A few things come to mind based on your setup:
The core driver keeps everything in memory and only flushes at the end, so if there’s any issue with how memory is being managed (especially with resizing or reallocation), it could lead to inconsistencies like overlapping chunk addresses.
Since you mentioned data being “moved around depending upon order,” I’d double-check whether there’s any implicit dataset resizing, chunk reallocation, or rewriting happening that might interact badly with compression.
LZF itself is pretty simple and usually not the cause—it’s more likely exposing the corruption when decompression fails.
Also worth checking: are there any concurrent writes (even indirectly, like multiprocessing or reused file handles)? HDF5 isn’t thread/process safe unless explicitly configured, and that can lead to exactly this kind of intermittent corruption.
The fact that you can’t reproduce it easily makes me think it could be timing-related or dependent on memory pressure.
One thing I’ve found helpful in tricky cases like this is isolating patterns or edge cases (chunk sizes, write order, dataset resizing, etc.)—tools like a name generator can even help quickly create structured/randomized dataset names or test cases to stress different layouts and reproduce rare bugs.
If you haven’t already, you might also try:
Disabling compression temporarily to see if the issue persists
Writing with the default (sec2) driver instead of core
Enabling HDF5 debug logs (if possible)
Verifying file integrity with h5dump or h5check after write
If chunk offsets are truly colliding, that’s almost certainly a bug or misuse somewhere—HDF5 should never assign the same address to two different chunks in a valid file.
The algorithm that is used to write data deals with chunked datasets and explicit resizing of datasets is definitely done depending upon the data to be written. Depending upon some conditions, the data might also get rewritten.