Efficient access of sparse, chunked datasets

I have a sparse dataset in chunked storage written to HDF5 using h5py. Only the chunks that have data are written to disk so there must be some metadata that identifies the non-empty chunks.

I want to open the file and operate on the non-empty chunks. However, dset.iter_chunks() will loop over all chunks. There must be a list of non-empty chunks somewhere. How do I access it?

Hi @rtkeskitalo - getting the allocated chunk details is part of h5py’s low-level API. You get to the low level via a .id attribute on the dataset object. So you can do something like this:

[ds.id.get_chunk_info(i) for i in range(ds.id.get_num_chunks())]

Each chunk info object has a chunk_offset attribute which tells you where in the dataset that chunk starts.

1 Like

Thanks! That was exactly what I was looking for.

Iterating over the non-empty chunks this way is efficient but a little clunky. I wonder if dset.iter_chunks() should have a keyword argument to skip empty chunks or if there should be another method to iterate over them?

That sounds reasonable. I’ve opened an issue about it, and you’re welcome to make a PR. h5py has contributor docs that might help.

Note that unallocated isn’t precisely the same as empty. You can have allocated but empty chunks if you write empty data to them, or if you change the allocation time. And the dataset can have a fill value, so unallocated space might logically be full of -1 or whatever you’ve set the fill value to, rather than zeros. In a lot of scenarios this distinction won’t matter, of course, but if it’s an option in h5py we’ll need to ensure it’s clearly described. :slightly_smiling_face:

I provided an implementation of a method to iterate over only allocated chunks here:

The C function H5Dchunk_iter only iterates over allocated chunks.

Here is the method get_allocated_chunk_info:

def get_allocated_chunk_info(dataset):
     chunk_info = []
     dataset.id.chunk_iter(lambda v: chunk_info.append(v))
     return chunk_info

Here is how to use it.

In [93]: with h5py.File("pytest.h5", "w") as f:
    ...:     dset = f.create_dataset("test", (1024*4,1024*4), chunks=(16,16))
    ...:     dset[0,0] = 1
    ...:     dset[16,16] = 1
    ...:     print(dset.id.get_num_chunks())
    ...: 
2

In [94]: def get_allocated_chunk_info(dataset):
    ...:     chunk_info = []
    ...:     dataset.id.chunk_iter(lambda v: chunk_info.append(v))
    ...:     return chunk_info
    ...: 

In [95]: with h5py.File("pytest.h5", "r") as f:
    ...:     out = get_allocated_chunk_info(f["test"])
    ...: 

In [96]: out
Out[96]: 
[StoreInfo(chunk_offset=(0, 0), filter_mask=0, byte_offset=4016, size=1024),
 StoreInfo(chunk_offset=(16, 16), filter_mask=0, byte_offset=5040, size=1024)]

This method is highly efficiently allowing you to query 65,536 chunks in 0.272 seconds and only requiring less than 9 bytes per chunk.

In [100]: %%time
     ...: with h5py.File("pytest.h5", "w") as f:
     ...:     dset = f.create_dataset("test", (1024*4,1024*4), chunks=(16,16))
     ...:     dset[:] = 1
     ...:     print(dset.id.get_num_chunks())
     ...: 
65536
CPU times: user 4.03 s, sys: 399 ms, total: 4.43 s
Wall time: 4.48 s

In [101]: %%time
     ...: with h5py.File("pytest.h5", "r") as f:
     ...:     print(f["test"].id.get_num_chunks())
     ...:     out = get_allocated_chunk_info(f["test"])
     ...: 
65536
CPU times: user 255 ms, sys: 20 ms, total: 275 ms
Wall time: 272 ms

In [103]: sys.getsizeof(out) / 1024
Out[103]: 549.3046875