Get Object Header size

Hi,

I’m trying to create a Compound Dataset with several hundred fields and I get the error:

H5Oalloc.c line 1312 in H5O__alloc(): object header message is too large

I read there is a limit (in HDF5-1.8) for the object header (64 KB): https://portal.hdfgroup.org/pages/viewpage.action?pageId=48808714

First Question: Does this apply to versions higher than HDF5-1.8? I’m using 1.10.4.

Second Question: How can I get the size of the object header?
I tried it with .getSize() of my CompType (C++), but with this I get about 16 KB with a working number of fields. When I add a filed at this stage I get the
said error. So…

Third Question: What is stored in the object header?
Are the field names responsible for the gap to 64 KB?

Thank you very much!

Edit: I tried H5O_GET_INFO1. But I can’t interpret the result. As the documentary says:

Please be aware that the information held by H5O_hdr_info_t may only be useful to developers with extensive HDF5 experience.

https://portal.hdfgroup.org/display/HDF5/H5O_GET_INFO1

Yes. To the best of my knowledge, this restriction is still there in HDF 1.12.0 with respect to the HDF5 file format. If you look at the file format specification, you will find it in sections IV.A.1.a. and IV.A.1.b. that the size of header message data has to fit into two bytes, i.e., the current maximum is 65,535.

H5Oget_info2 returns a H5O_info1_t. Be sure to specify H5O_INFO_HDR among the fields to be retrieved. (HDF5 1.10)

H5Oget_native_info returns a H5O_hdr_info_t, which I believe also has the information you are asking. (HDF5 1.12+)

Quoting IV.A.

The header information of an object is designed to encompass all of the information about an object, except for the data itself. This information includes the dataspace, the datatype, information about how the data is stored on disk (in external files, compressed, broken up in blocks, and so on), as well as other information used by the library to speed up access to the data objects or maintain a file’s integrity. Information stored by user applications as attributes is also stored in the object’s header. The header of each object is not necessarily located immediately prior to the object’s data in the file and in fact, may be located in any position in the file. The order of the messages in an object header is not significant.

Object headers are composed of a prefix and a set of messages. The prefix contains the information needed to interpret the messages and a small amount of metadata about the object, and the messages contain the majority of the metadata about the object.

In the case of a datatype, the field names are part of the message encoding of a compound datatype (see section IV.A.2.d.). So, yes, they count against the 64 KB.

Can you tell us more about your use case? To be honest, I have yet to see a compelling use case for compound types with hundreds of fields. The implications for tools (interoperability) and performance are profound, and I’m sure you’ve thought about that. Are you really accessing whole records (all fields) most of the time? While on paper you can do partial I/O (accessing only a subset of fields), in practice you’ll lose a lot of read performance. Have you thought about at least breaking up your datatype monolith into column groups, i.e., use an (HDF5) group of datasets of compound types, and in the compounds really group only the fields that you are certain will always be accessed together?

Also, what are the field types involved?

Best, G.

I will try H5Oget_info2 tomorrow, when I’m back at my Computer.

To my use case:

My compound type consists out of one timestamp field (double) and a varying number of double fields (sensor signals) and variable length string fields (information like current state/phase).
The number of the fields varies from file to file depending how many sensors and other information are recorded.
At the beginning I calculated with about 50-100, now they are increasing…

Your are right, I’m reading the dataset most of the time column-wise. I can’t tell which fields will be accessed together, except for the time field, which I mostly need.

I thought about splitting the compound dataset to one double dataset and one string dataset. Until now, I didn’t want to change the structure because of the dependencies to our hdf5-reading applications. But maybe now is the time…

Also because of performance issues with the increasing number of fields.

For me the write performance is a little more important than the read performance. Is the write performance better with the splitted approach? And how about one dataset per signal? Is it possible to make a general statement?

Do you have some advice?
Thank you very much.

The hit on write performance is not as severe if all signals/fields are acquired in unison. If that changes in the future, you’d have a problem, because the channels/signals are artificially coupled in the compound. Having different types (e.g., the vlen string) will reduce performance and certainly, compression performance (if that applies) will be reduced with larger records.

It’s difficult to give general recommendations. Compound and an HDF5 group of individual columns are, if you wish, two ends or extremes of a spectrum. They perform best in a very narrow set of circumstances. You probably want stay in the middle of the road and have perhaps column families (compounds of signals always read together) that are blocked in time, i.e., you want to partition your data horizontally and vertically. That will give you what I would call “best performance on average.” It’s not gonna outperform the extremes in their sweet spots, but pretty much everywhere else.

G.

Okay, now I could read the size of the object header.
The field “total” in the struct “space” in the struct “H5O_hdr_info_t” was what I needed.
I can see how the size changes when I add fields to the compound but also when I change the length of the names.
The limit in my case is about 800 fields, depending on the length of the field names.

Again about the structure of the file:
I’m still thinking about creating a normal dataset for each signal, so a 1D float array. Also the time in a separate dataset and all signals grouped in one group.
Would this still be an extreme solution if I could not group signals in a meaningful way in one compound or 2D float dataset? I could group them by data type, but that would not correspond to later use. I can’t tell which signals will be read together later for plotting or exporting, for example.

Sometimes signals have to be converted to a text file and therefore read and written line by line, but if it takes longer there, I can live with it. For this I would also (try to) choose the chunk size optimal.

No, nothing wrong with that. If the signals are typically read individually, by all means, put them into separate datasets. I would be worried if you had 100,000s of signals in the same group.

G.

I would be worried if you had 100,000s of signals in the same group.

No, there will be a maximum of 1000 to 2000 signals.

Thank you again for your advice!

I have the same issue. I am using compound datatype with C++ API. I have not figured out how to calculate the object header size. For some datasets, I might encounter this exact error at the bottom of the error stack. Here is snippet of my code:
try
{
H5::Exception::dontPrint();
H5::CompType value = *newCompoundType;
hid_t err_stck_id = H5Eget_current_stack();
std::cout << “current error stack id: " << err_stck_id << std::endl;
newDataSet = new H5::DataSet(file->createDataSet(dataSetName.c_str(),
value,
space,
createParameters));
charArray = new char[dataSz];
copyCacheDataToCharArrayTable(cache, charArray);
newDataSet->write(charArray, *newCompoundType);
}
catch(const H5::Exception& error)
{
error.printErrorStack();
//get header size
std::cout << “in the catch\n”;
// H5O_info_t dsetInfo;
// if (newDataSet)
// std::cout << “dataset is not empty\n”;
// H5Oget_info(newDataSet->getId(), &dsetInfo); //, H5O_INFO_HDR);
// std::cout <<“dataset " << dataSetName << " header size: " << dsetInfo.hdr.space.total << “\n”
// <<dsetInfo.fileno << “\t” << dsetInfo.addr << “\t”<< dsetInfo.type << “\n”
// <<dsetInfo.rc << “\t” << dsetInfo.ctime << “\t” << dsetInfo.num_attrs << “\n”
// <<dsetInfo.hdr.version << “\t” <<dsetInfo.hdr.nmesgs << “\t” << dsetInfo.hdr.nchunks<<”\n”
// <<dsetInfo.hdr.space.total << “\t” <<dsetInfo.hdr.space.mesg << “\t” << dsetInfo.hdr.space.meta <<“\t” <<dsetInfo.hdr.space.free<<“\n”
// <<dsetInfo.meta_size.attr.heap_size <<“\t” <<dsetInfo.meta_size.attr.index_size << “\n”
// <<dsetInfo.meta_size.obj.heap_size << “\t” <<dsetInfo.meta_size.obj.index_size<<“\n”;
//File *stream;
//H5::Exception::printErrorStack();
H5E_minor_t min_num = 0;
//std::cout << error_stack_id << std::endl;
//H5Ewalk2(error_stack_id, H5E_WALK_UPWARD, visitErrorStackLevel, &min_num);
//H5::Exception::walkErrorStack(H5E_WALK_DOWNWARD, visitErrorStackLevel, &min_num);
H5Ewalk2(H5E_DEFAULT, H5E_WALK_UPWARD, visitErrorStackLevel, NULL);
error.walkErrorStack(H5E_WALK_DOWNWARD, &visitErrorStackLevel, &min_num);

     if(newDataSet)
        delete newDataSet;
     if(charArray)
        delete []charArray;
     space.close();
     delete newCompoundType;
     cleanUpAndCloseAll();
     std::string errMsg = "\n===================================ERROR========================================\n"
                          "Error generating the following dataset: " + dataSetName + " in file: " + filename 
                          + "\n" + "while doing: " + error.getFuncName() + " with error message: " + error.getDetailMsg()
                          + "\n" + error.getCDetailMsg() + "\n" + error.getCFuncName() 
                          + "\n========================END OF ERROR REPORT=====================================\n";
     std::throw_with_nested( std::runtime_error(errMsg));         
  }

I am using try{} catch{} block. My initial goal is to error.walkErrorStack(…) to the bottom of the stack to catch “object header message is too large”, then I will create group with each column in the compound datatype as a dataset in the group. But the error.walkErrorStack(…) does nothing. So I tried error.printErrorStck(), nothing is printed by this statement in the catch block. Then I commented out “H5::Exception::dontPrint()” in the try block. No error stack is printed in stdout. That kind of got me that error stack is cleared in catch block{}.
Then I tried @gheber suggestion to get header size using H5Oget_info. H5Oget_info2(dataSetId…) needs to newDataSet->getId(). Since I got exception because of the object header message too large, my newDataSet is empty. If I uncomment the block of code for H5Oget_info, I will be sure get another segment fault error.

I am completely stuck now on how to solve this problem.

My questions are:

  1. Once H5 exception is caught in catch{} block, is the error stack already cleared?
  2. How to preserve error stack in the catch block so that I could walk the error stack to check whether the error is due to “object header message too large”?
  3. I am using HDF5-1.10.6. If this is a known bug for HDF5 exception handling, how to estimate the header size based on the list of column headers(a list of strings) and datatypes for each column?
    I don’t have any knowledge of number of columns and how many characters for column header for each column before I run my program, which could vary from run to run.

@ltz_07866 Have you tried to comment out this statement: H5::Exception::dontPrint()?

After constructing your compound datatype, you can use H5Tencode to estimate the size of the underlying datatype message. (I don’t know if that function is exposed in the C++ wrapper.) OK?

G.

@ltz_07866 I’m sorry for the lateness. Yes, H5Tencode’s C++ wrapper is H5::DataType::encode(), but you probably already figured that.