Checking about corruption issues with hdf5 core driver

I am writing some string data(S75) to datasets and using core driver with lzf compression. I am using python with h5py 3.15 and hdf5 2.0.0.
Occasionally, what I have seen is that the chunk in the h5 file overwrites another chunk and hence when we read this overwritten chunk back, the lzf filter complains. I am not doing anything fancy, I open the file using core driver, write the data using ds[index] = some_data and then the file is flushed back. The data is moved around here and there depending upon the order that needs to be maintained.

An example of offset and sizes for some chunks:

chunk_index: 8 dtype: |S75
offset: 3333932 size: 70786
chunk_index: 9 dtype: |S75
offset: 3404718 size: 73350
chunk_index: 10 dtype: |S75
offset: 3478068 size: 72137
chunk_index: 11 dtype: |S75
offset: 3550205 size: 70829
chunk_index: 12 dtype: |S75
offset: 3550205 size: 37129

The chunks 11 and 12 have same offset but different size, reading chunk 11 fails but reading chunk 12 works. I’ve not been able to reproduce this outside. It happens once in a while as part of some jobs which are writing to the h5 file.

I wanted to check if something like this is possible and is a valid scenario? I think not. What could be the possible reasons for this?

Hi,

Is it possible for you to give us a reproducer in C?

We have not been able to reproduce this but we have seen memory related errors in jobs where this happens. We actually got the data that was written in the job and tried to replay outside the job but that also didn’t reproduce.

That definitely doesn’t sound like a valid or expected scenario—two chunks ending up with the same offset (but different sizes) strongly suggests some kind of corruption rather than normal HDF5 behavior.

A few things come to mind based on your setup:

  • The core driver keeps everything in memory and only flushes at the end, so if there’s any issue with how memory is being managed (especially with resizing or reallocation), it could lead to inconsistencies like overlapping chunk addresses.

  • Since you mentioned data being “moved around depending upon order,” I’d double-check whether there’s any implicit dataset resizing, chunk reallocation, or rewriting happening that might interact badly with compression.

  • LZF itself is pretty simple and usually not the cause—it’s more likely exposing the corruption when decompression fails.

  • Also worth checking: are there any concurrent writes (even indirectly, like multiprocessing or reused file handles)? HDF5 isn’t thread/process safe unless explicitly configured, and that can lead to exactly this kind of intermittent corruption.

The fact that you can’t reproduce it easily makes me think it could be timing-related or dependent on memory pressure.

One thing I’ve found helpful in tricky cases like this is isolating patterns or edge cases (chunk sizes, write order, dataset resizing, etc.)—tools like a name generator can even help quickly create structured/randomized dataset names or test cases to stress different layouts and reproduce rare bugs.

If you haven’t already, you might also try:

  • Disabling compression temporarily to see if the issue persists

  • Writing with the default (sec2) driver instead of core

  • Enabling HDF5 debug logs (if possible)

  • Verifying file integrity with h5dump or h5check after write

If chunk offsets are truly colliding, that’s almost certainly a bug or misuse somewhere—HDF5 should never assign the same address to two different chunks in a valid file.

The algorithm that is used to write data deals with chunked datasets and explicit resizing of datasets is definitely done depending upon the data to be written. Depending upon some conditions, the data might also get rewritten.