I provided an implementation of a method to iterate over only allocated chunks here:
The C function H5Dchunk_iter
only iterates over allocated chunks.
Here is the method get_allocated_chunk_info
:
def get_allocated_chunk_info(dataset):
chunk_info = []
dataset.id.chunk_iter(lambda v: chunk_info.append(v))
return chunk_info
Here is how to use it.
In [93]: with h5py.File("pytest.h5", "w") as f:
...: dset = f.create_dataset("test", (1024*4,1024*4), chunks=(16,16))
...: dset[0,0] = 1
...: dset[16,16] = 1
...: print(dset.id.get_num_chunks())
...:
2
In [94]: def get_allocated_chunk_info(dataset):
...: chunk_info = []
...: dataset.id.chunk_iter(lambda v: chunk_info.append(v))
...: return chunk_info
...:
In [95]: with h5py.File("pytest.h5", "r") as f:
...: out = get_allocated_chunk_info(f["test"])
...:
In [96]: out
Out[96]:
[StoreInfo(chunk_offset=(0, 0), filter_mask=0, byte_offset=4016, size=1024),
StoreInfo(chunk_offset=(16, 16), filter_mask=0, byte_offset=5040, size=1024)]
This method is highly efficiently allowing you to query 65,536 chunks in 0.272 seconds and only requiring less than 9 bytes per chunk.
In [100]: %%time
...: with h5py.File("pytest.h5", "w") as f:
...: dset = f.create_dataset("test", (1024*4,1024*4), chunks=(16,16))
...: dset[:] = 1
...: print(dset.id.get_num_chunks())
...:
65536
CPU times: user 4.03 s, sys: 399 ms, total: 4.43 s
Wall time: 4.48 s
In [101]: %%time
...: with h5py.File("pytest.h5", "r") as f:
...: print(f["test"].id.get_num_chunks())
...: out = get_allocated_chunk_info(f["test"])
...:
65536
CPU times: user 255 ms, sys: 20 ms, total: 275 ms
Wall time: 272 ms
In [103]: sys.getsizeof(out) / 1024
Out[103]: 549.3046875