Yes. To the best of my knowledge, this restriction is still there in HDF 1.12.0 with respect to the HDF5 file format. If you look at the file format specification, you will find it in sections IV.A.1.a. and IV.A.1.b. that the size of header message data has to fit into two bytes, i.e., the current maximum is 65,535.
H5Oget_info2 returns a H5O_info1_t
. Be sure to specify H5O_INFO_HDR
among the fields to be retrieved. (HDF5 1.10)
H5Oget_native_info returns a H5O_hdr_info_t
, which I believe also has the information you are asking. (HDF5 1.12+)
Quoting IV.A.
The header information of an object is designed to encompass all of the information about an object, except for the data itself. This information includes the dataspace, the datatype, information about how the data is stored on disk (in external files, compressed, broken up in blocks, and so on), as well as other information used by the library to speed up access to the data objects or maintain a file’s integrity. Information stored by user applications as attributes is also stored in the object’s header. The header of each object is not necessarily located immediately prior to the object’s data in the file and in fact, may be located in any position in the file. The order of the messages in an object header is not significant.
Object headers are composed of a prefix and a set of messages. The prefix contains the information needed to interpret the messages and a small amount of metadata about the object, and the messages contain the majority of the metadata about the object.
In the case of a datatype, the field names are part of the message encoding of a compound datatype (see section IV.A.2.d.). So, yes, they count against the 64 KB.
Can you tell us more about your use case? To be honest, I have yet to see a compelling use case for compound types with hundreds of fields. The implications for tools (interoperability) and performance are profound, and I’m sure you’ve thought about that. Are you really accessing whole records (all fields) most of the time? While on paper you can do partial I/O (accessing only a subset of fields), in practice you’ll lose a lot of read performance. Have you thought about at least breaking up your datatype monolith into column groups, i.e., use an (HDF5) group of datasets of compound types, and in the compounds really group only the fields that you are certain will always be accessed together?
Also, what are the field types involved?
Best, G.