Read only part of large compound dataset member


#1

Hello,

My h5 file has a dataset with a simple table structure with three columns and about 8000 rows. I believe it was created using the high-level H5TB set of functions.
The members of the compound dataset (i.e. the columns of the table) are:

  • 32 bit integer
  • 64 bit integer
  • Array [102400] of 8-bit unsigned integer

Here’s the problem: For each row, I only need to read the first ~100 bytes of the array.

With H5TBread_fields_index I am able to read the whole column, however that basically means reading the whole dataset and performance-wise that isn’t what I’m looking for. I tried defining my own compound datatype, my own opaque datatype, and read it via hyperslabs etc. but to no avail. H5Dread would always throw an error.

Is there a way to read this array only partially for each row?


#2

If by ‘the array’ you mean the “Array [102400] of 8-bit unsigned integer”-field of the compound, that’s currently not supported. What’s in those magic “first ~100 bytes”? Can you define another adjacent field with just that signature content?

G.


#3

Thanks for the quick support. I was fearing that reading a field only partially isn’t supported.

The first ~100 bytes are a header for the payload data that follows. I need to look at all the headers in a file to know what’s in it (basically index the file), so I can do “random” access to the rows later on.

I know that this would be much easier if the file structure would be designed better, e.g. by putting all the headers in their own column. But I wasn’t responsible for that part and now I have to work with what already exists.


#4

It’s a little fiddly, but you can certainly ready those bytes directly from the file (e.g., using pread or readv).
Depending on the layout (contiguous, chunked), you can obtain the data offsets in the file and read what you need. Not pretty, but effective. If you did this repeatedly, you could keep that map on the side for reuse.

G.


#5

Hi, thanks for the suggestion. Is there a way to get the offset-into-the-file for data inside dataset fields? In my example the offset in bytes of the start of my array?


#6

G.