Get Object Header size


#1

Hi,

I’m trying to create a Compound Dataset with several hundred fields and I get the error:

H5Oalloc.c line 1312 in H5O__alloc(): object header message is too large

I read there is a limit (in HDF5-1.8) for the object header (64 KB): https://portal.hdfgroup.org/pages/viewpage.action?pageId=48808714

First Question: Does this apply to versions higher than HDF5-1.8? I’m using 1.10.4.

Second Question: How can I get the size of the object header?
I tried it with .getSize() of my CompType (C++), but with this I get about 16 KB with a working number of fields. When I add a filed at this stage I get the
said error. So…

Third Question: What is stored in the object header?
Are the field names responsible for the gap to 64 KB?

Thank you very much!

Edit: I tried H5O_GET_INFO1. But I can’t interpret the result. As the documentary says:

Please be aware that the information held by H5O_hdr_info_t may only be useful to developers with extensive HDF5 experience.

https://portal.hdfgroup.org/display/HDF5/H5O_GET_INFO1


#2

Yes. To the best of my knowledge, this restriction is still there in HDF 1.12.0 with respect to the HDF5 file format. If you look at the file format specification, you will find it in sections IV.A.1.a. and IV.A.1.b. that the size of header message data has to fit into two bytes, i.e., the current maximum is 65,535.

H5Oget_info2 returns a H5O_info1_t. Be sure to specify H5O_INFO_HDR among the fields to be retrieved. (HDF5 1.10)

H5Oget_native_info returns a H5O_hdr_info_t, which I believe also has the information you are asking. (HDF5 1.12+)

Quoting IV.A.

The header information of an object is designed to encompass all of the information about an object, except for the data itself. This information includes the dataspace, the datatype, information about how the data is stored on disk (in external files, compressed, broken up in blocks, and so on), as well as other information used by the library to speed up access to the data objects or maintain a file’s integrity. Information stored by user applications as attributes is also stored in the object’s header. The header of each object is not necessarily located immediately prior to the object’s data in the file and in fact, may be located in any position in the file. The order of the messages in an object header is not significant.

Object headers are composed of a prefix and a set of messages. The prefix contains the information needed to interpret the messages and a small amount of metadata about the object, and the messages contain the majority of the metadata about the object.

In the case of a datatype, the field names are part of the message encoding of a compound datatype (see section IV.A.2.d.). So, yes, they count against the 64 KB.

Can you tell us more about your use case? To be honest, I have yet to see a compelling use case for compound types with hundreds of fields. The implications for tools (interoperability) and performance are profound, and I’m sure you’ve thought about that. Are you really accessing whole records (all fields) most of the time? While on paper you can do partial I/O (accessing only a subset of fields), in practice you’ll lose a lot of read performance. Have you thought about at least breaking up your datatype monolith into column groups, i.e., use an (HDF5) group of datasets of compound types, and in the compounds really group only the fields that you are certain will always be accessed together?

Also, what are the field types involved?

Best, G.


#3

I will try H5Oget_info2 tomorrow, when I’m back at my Computer.

To my use case:

My compound type consists out of one timestamp field (double) and a varying number of double fields (sensor signals) and variable length string fields (information like current state/phase).
The number of the fields varies from file to file depending how many sensors and other information are recorded.
At the beginning I calculated with about 50-100, now they are increasing…

Your are right, I’m reading the dataset most of the time column-wise. I can’t tell which fields will be accessed together, except for the time field, which I mostly need.

I thought about splitting the compound dataset to one double dataset and one string dataset. Until now, I didn’t want to change the structure because of the dependencies to our hdf5-reading applications. But maybe now is the time…

Also because of performance issues with the increasing number of fields.

For me the write performance is a little more important than the read performance. Is the write performance better with the splitted approach? And how about one dataset per signal? Is it possible to make a general statement?

Do you have some advice?
Thank you very much.


#4

The hit on write performance is not as severe if all signals/fields are acquired in unison. If that changes in the future, you’d have a problem, because the channels/signals are artificially coupled in the compound. Having different types (e.g., the vlen string) will reduce performance and certainly, compression performance (if that applies) will be reduced with larger records.

It’s difficult to give general recommendations. Compound and an HDF5 group of individual columns are, if you wish, two ends or extremes of a spectrum. They perform best in a very narrow set of circumstances. You probably want stay in the middle of the road and have perhaps column families (compounds of signals always read together) that are blocked in time, i.e., you want to partition your data horizontally and vertically. That will give you what I would call “best performance on average.” It’s not gonna outperform the extremes in their sweet spots, but pretty much everywhere else.

G.


#5

Okay, now I could read the size of the object header.
The field “total” in the struct “space” in the struct “H5O_hdr_info_t” was what I needed.
I can see how the size changes when I add fields to the compound but also when I change the length of the names.
The limit in my case is about 800 fields, depending on the length of the field names.

Again about the structure of the file:
I’m still thinking about creating a normal dataset for each signal, so a 1D float array. Also the time in a separate dataset and all signals grouped in one group.
Would this still be an extreme solution if I could not group signals in a meaningful way in one compound or 2D float dataset? I could group them by data type, but that would not correspond to later use. I can’t tell which signals will be read together later for plotting or exporting, for example.

Sometimes signals have to be converted to a text file and therefore read and written line by line, but if it takes longer there, I can live with it. For this I would also (try to) choose the chunk size optimal.


#6

No, nothing wrong with that. If the signals are typically read individually, by all means, put them into separate datasets. I would be worried if you had 100,000s of signals in the same group.

G.


#7

I would be worried if you had 100,000s of signals in the same group.

No, there will be a maximum of 1000 to 2000 signals.

Thank you again for your advice!