Reference Bit Field

anthony.j.ashford · January 18, 2023, 12:58pm

Hello,

I am continuing my work on creating a HDF5 parser using C#. I am able to parse documents and am now on the task of extracting out the data contained in the file.

While extracting a Dataset that is comprised of a Compound Bit Field, I am having troubles with one of the members which is a Reference Bit Field. The problem I am having is the documentation does not specify where to access the data associated with the field, i.e. an address in the file. All I am able to ascertain is that it is a reference to a location in the Dataset (Version 1/Type 1 for the Reference Bit Field).

There are two other like named members in the Compound Bit Field that appear to be associated with the reference (or maybe a coincidence). The three members are, “VarDataSize”, “VarDataOffset”, and “VarDataRef” (the Reference Bit Field). The “VarDataSize” and “VarDataOffset” do have data and that is why I am assuming that they are somehow associated with the Reference Bit Field.

Can I get a more detailed explanation of where the actual referenced data is located? When looking at the Dataset in the HDFView application, there is associated data, however, I can not figure out where in the Dataset to retrieve it.

Also, out of curiosity, are the two other members that look like they are related to the Relative Bit Field member automatically added when a Reference Bit Field is added to a Dataset?

As always, Thanks for any information that can be provided.

gheber · January 18, 2023, 2:17pm

I’m not sure I understand what kind of datatype that is. Can you send us the output of

h5dump -pH -d <DATASET> <FILE>

of the dataset in question?

G.

anthony.j.ashford · January 18, 2023, 4:22pm

Good Morning,

I ran the command you recommended and it did not show the actual Datatype of the Dataset, so I took a screen shot of the HDFView application with the corresponding Datatype information for the Dataset. The Datatype in the Compound Datatype is the last one displayed, “VarDataRef”, with the Type, “Reference”.

Hope this helps and if there is anything else needed please reach out. Thanks.ReferenceBitField_Dump.zip (431.6 KB)

gheber · January 18, 2023, 5:22pm

I see. The dataset uses a so-called committed or named datatype, which is linked at /DataTypes/Events/Data. Would you mind sending the output of

h5dump -t /DataTypes/Events/Data <FILE>

?

G.

anthony.j.ashford · January 18, 2023, 6:07pm

Thanks for the quick response. Attached is the output from the command you sent. Thanks again.DataTypes.txt (1.8 KB)

gheber · January 18, 2023, 8:01pm

Thanks. OK, here’s our datatype:

DATATYPE "/DataTypes/Events/Data" H5T_COMPOUND {
   H5T_IEEE_F64LE "frameTime_msecSinceEpoch";
   H5T_IEEE_F64LE "simulationTime_msecSinceEpoch";
   H5T_STD_I32LE "DatumCount";
   H5T_STD_I32LE "PDUCount";
   H5T_STD_I32LE "RequestId";
   H5T_STRING {
      STRSIZE 256;
      STRPAD H5T_STR_NULLTERM;
      CSET H5T_CSET_ASCII;
      CTYPE H5T_C_S1;
   } "OriginatorId";
   H5T_STRING {
      STRSIZE 256;
      STRPAD H5T_STR_NULLTERM;
      CSET H5T_CSET_ASCII;
      CTYPE H5T_C_S1;
   } "ReceiverId";
   H5T_STRING {
      STRSIZE 256;
      STRPAD H5T_STR_NULLTERM;
      CSET H5T_CSET_ASCII;
      CTYPE H5T_C_S1;
   } "DatumType";
   H5T_STD_I32LE "DatumId";
   H5T_STD_I32LE "FixedData";
   H5T_STD_I32LE "VarDataSize";
   H5T_STD_U64LE "VarDataOffset";
   H5T_REFERENCE { H5T_STD_REF_DSETREG } "VarDataRef";
}

Perhaps PDU stands for “protocol data unit?” Maybe the interesting part is the last six fields:

   H5T_STRING {
      STRSIZE 256;
      STRPAD H5T_STR_NULLTERM;
      CSET H5T_CSET_ASCII;
      CTYPE H5T_C_S1;
   } "DatumType";
   H5T_STD_I32LE "DatumId";
   H5T_STD_I32LE "FixedData";
   H5T_STD_I32LE "VarDataSize";
   H5T_STD_U64LE "VarDataOffset";
   H5T_REFERENCE { H5T_STD_REF_DSETREG } "VarDataRef";

Presumably, DatumType tells us how to interpret the other fields, although it looks odd. What does the Var prefix stand for, the noun, the adjective ‘variable,’ or something else (e.g., variance)?

The VarDataRef field is a so-called dataset region reference. Logically, think of it as a pair

dataset region reference "=" (dataset object reference, selection in referenced dataset)

A typical 2D use case (image) could be a subset of pixels, a bookmark for a feature in the image, or a subset of samples, e.g., peaks in a time series. The selection can be a combination of hyperslab selections, or a point selection (of any dimension). For your low-level considerations, it is again important to understand that you are dealing with non-fixed-size data, and that means the dataset region reference (not necessarily the referenced data!) is stored in a global heap, where the current implementation puts all data. It’s unclear what the VarDataSize and VarDataOffset fields represent. Perhaps, assuming the destination dataset is one-dimensional, for a simple contiguous range of samples, a dataset region reference boils down to an offset and range (plus the path to the dataset).

OK?
G.

anthony.j.ashford · January 18, 2023, 8:38pm

Hi,

For some clarification, Protocol Data Units (PDU) is used by the Distributed Interactive Simulation (DIS). Basically the organizations we support pass these along the network and various tools process/display the information. I know, Alphabet Soup.

The “VarDataSize” and “VarDataOffset” are probably of no use since I processed another Dataset that has a Compound Bit Field Datatype with a Reference Bit Field as one of it’s members and it does not have the Size/Offset information.

I have all the chunks for the Dataset since I need to go through all the data and save it to a PostgreSQL database. The main problem I am experiencing is I do not know where in the Dataset/Global Heap to access the data and the description for the Reference Bit Field (Class 7 of the Datatype classes). The HDF5 specification documentation does not have any Properties listed for this data class to point to where the data is located, just that it either references an object or a/the dataset.

The good news is it looks like I am able to process the Datasets correctly except for the Reference Bit Fields. If you should come across where this undocumented information is, Please forward it.

I will step through my debugged version of the, “h5dump”, and see if I can find out what it is doing in the mean time.

Thanks for all your assistance and hope to hear back from you.

gheber · January 19, 2023, 12:15am

The storage of values of reference types is described in sections VIII.[A,B,C]. of the specification. What’s the superblock version of your files? In the simplest case, the value of an attribute or dataset element of type H5T_STD_REF_DSETREG is just a tuple of an offset (object address) and the Global Heap ID of the heap where the encoded dataspace selection is stored. In later format specifications, it is possible to have object references to objects in other files as well as attributes, and the referent of dataset region references can be in other HDF5 files. In other words, the value of the VarDataRef field is just that: a tuple of two fixed size components, where the second component refers to a global heap where the actual selection is stored.

Does that make sense? Maybe I’m not answering your question… Apologies.

G.

anthony.j.ashford · January 19, 2023, 10:02am

Hi,

I think the referenced sections are exactly what I was looking for. I am seeing in my processing that although the Datatype explanation of the Reference Bit Field does not have any Properties, the sections you gave seem to layout how to access the data. I also just checked and I am passing 12 bytes to my routine that would normally extract the data like the other classes in the Datatype. I will code up the Reference Bit Field to parse the passed in 12 bytes and see if I can extract the actual data from the parsed information.

If this works, the documentation should probably have a link to the sections you referenced, kind of how when explaining Chunks in the Datalayout has a link to the various indexing methods.

Thanks for all the help, at least now I have a better understanding about how data for Reference Bit Fields is addressed.

Will update you on the results. Thanks Again!

P.S.

Forgot, the Superblock is Version 0.

anthony.j.ashford · January 19, 2023, 12:38pm

Hello Again,

Parsing the 12 bytes of data that mentioned in the previous post, the address does reference a valid Global Heap with a valid index. The resulting data at the address of the referenced index points to 40 bytes of data.

The next question is how should this be parsed? It looks like there are possibly three pieces of data, the first is 8 bytes, the second is 16 bytes, and the third is 16 bytes.

I tried to create an Object Header and Datatype from the data with incorrect objects produced. It is almost like the information is like the Variable Bit Field Attribute where instead of the actual data, the data was 16 bytes split into three pieces that references someplace else.

Any ideas on what the data in the Global Heap is pointing to, i.e. how to parse that block of 40 bytes that the Global Heap is pointing to?

As a final comment, the documentation seems incorrect for section VIII. C… For the reference to a dataset region, as you mentioned, and what I am seeing, the information is the Address/Index for a Global Heap ID, whereas the documentation states in the Layout/Fields it is the address to an object being referenced and a Dataspace Selection.

Hope to hear from you about how the data referenced by the Global Heap should be parsed.

Thanks as always, looks like I am almost there with processing the Reference Bit Field.

anthony.j.ashford · January 19, 2023, 9:53pm

Hi,

One last post after clearing my head and understanding somewhat of what is going on. The 12 bytes passed in do reference the Global Heap ID. After looking things over, the data that the Global Heap has looks to be the data that is used for the Object Address and the Dataspace Selection Information mentioned in the Layout for the Reference Dataset Region.

As I go through things, the Dataspace Selection Information Start/End increment correctly (at least it appears that way). So I have the address of the Object being referenced, and the starting offset within the object. Now comes the last bit of information that I am missing.

I have the starting address. I have the offset. I do not have what type of data that is being pointed at. Any ideas where I can find this last piece of information?

As I stated in my last post, the documentation is somewhat off in that I had to use the data provided by the chunked data to get the Global Heap ID, then get the data from that and then it provided the object address and Dataspace Selection Information. Just glad I was able to figure out most of what is going on.

Thanks for steering me in the right direction, now if you have any ideas about what type of data is actually saved in the location so I can parse that, it would be much appreciated. Sorry for all the confusion I am introducing and many Thanks for all the assistance!

gheber · January 20, 2023, 12:07am

Glad to hear that you are making progress.

The starting address should be the OHDR address. The datatype information is in the object header of the referenced dataset, i.e., there should be a datatype message (0x0003) in that object header. (There are no dataset region references to objects of other types or attributes.)

OK?

G.

anthony.j.ashford · January 20, 2023, 12:06pm

Hi,

I tried to create an Object Header with the address and the offset from the Dataspace Selection, and for the first item a Datatype of Opaque type was found. Upon continuing, all the Object Headers were blank.

Am almost there (somewhat), but I think the problem lies in how I am using the Dataspace Selection. For all the items, the Dataspace Selection is a Version 1 Hyperslab with Rank 1 and Number Of Blocks 1. What I am doing is taking the Object Address and adding the Starting Offset for Rank #1, Block #1 and using that address to try to start parsing an Object Header. I am obviously interpreting things incorrectly since all the Object Headers generated are blanks except the first one that has s Dataspace (unexpected) and an Opaque Datatype which does not look correct.

Does my logic for getting the address of the Object Header look correct or am I (as I believe) interpreting the Dataspace Selection incorrectly?

Looking forward to your next thoughts about this. Seems more difficult than expected. Thanks for the feedback.

gheber · January 21, 2023, 3:40pm

You need to answer two kinds of questions (in that order).

Logical

What is the element type (datatype) of the dataset referenced in the dataset region reference?
What is the shape (dataspace) of the dataset?

Opaque is a datatype class. An instance of this class, a specific opaque datatype, is the set of byte sequences of a fixed length. Optionally, such a dataype is tagged with an ASCII (0-127) label, e.g., a MIME type. Assuming your datatype is in the opaque class, which is its size, and does it have a label?

There should be a dataspace message (0x0001). What is the dataset shape? Presumably 1D, but what is the extent (number of elements)?

Physical

How is the dataset laid out? There should be a data layout message (0x0008). What is the layout class?

The (hyperslab) selection is expressed in dataset elements (not in bytes). If your opaque type consists of 15-byte blocks, a dataset element is of that size, and we need to consider that when translating hyperslab offsets, strides, counts, and blocks (all measured in elements) into the storage layout. This is straightforward for compact and contiguous layouts. For chunked layout, we must first identify the chunks (sized in elements) covered by the selection and then locate those chunks in the file via the dataset chunk index (typically B-tree).

Do you have a small sample file you can share with us? Otherwise, this is a rather abstract discussion.

G.

anthony.j.ashford · January 23, 2023, 10:56am

Hi,

The dataset that has the references has a datatype of Compound Bit Field where the last element is a Reference Bit Field. This data set has a total of 8320 elements.

When I process the first record for the dataset, and process the first Reference Bit Field, it appears to access a Object Header that has a Dataspace of dimensionality, 1, Size, 2098086. The Datatype is an Opaque Bit Field, Version 1, Size 1, ASCII Tag Value, “\u0005”. Those are the only two messages in the Object Header, i.e. there is no Datalayout message.

As I process further records to get the Referenced Object Header, they are empty since the data in the file for the Object Address and Dataspace Selection is pointing to null bytes.

The data that appears in the HDFView application for the Dataset enclosing the Compound Bit Field appears to be integer. Also, from my previous post, the “VarDataOffset” value that I mentioned seems to track with the Dataspace Selection Start Offset value for the records.

I have sent an email to query the Classification of the document I am processing. Hopefully, I will be able to forward it to you. By the looks of things, from what I am seeing from my processing and what the HDFView is displaying, the Dataspace Selection is getting processed correctly. Thanks for you assistance.

anthony.j.ashford · January 23, 2023, 3:51pm

Hi Again,

Made progress. Found out the problem I was having was processing the Opaque Datatype message. The Size was one byte, and the Length of the ASCII Tag is zero length. I was processing the ASCII Tag as a null terminated string with null padding to the eight byte boundary. That is why the ASCII Tag looked like “\u0005”. That value was actually the start of a Fill Value Message. Documentation should probably describe the ASCII Tag as an optional property.

Given that, I have 5 messages, Dataspace, Datatype, Fill Value, Datalayout, and Modification Time.

I am now able to pull all the data for the associated chunks. Now comes the further questions. Given the Size of the Opaque Datatype is one byte, how am I supposed to interpret that byte, i.e. is it a character, and unsigned byte, or a signed byte? What if it has a size greater than the one byte, is it an integer, floating point, string, etc.? Also, what is the ASCII Tag used for? Given the ASCII Tag I have for the Opaque Datatype being blank, does it have some significance?

Finally, I have some questions as far as the Hyperslab and associated chunks go. As I stated above, I have all the chunk data sequentially loaded one chunk after another. Given the progressive starting offset values for each Reference, I am seeing the same value in those offsets (24, the same value for the first record). Any recommendations on how to apply the Dataspace Selection starting offset, i.e. given I have all the data loaded sequentially one chunk after the next, or should the data be loaded in a different way. These are chunks of rank 1 BTW.

Thanks for your input, it continuously get’s me thinking and closer to working out the access. At least, I now have the data (what looks like the correct data), just now need to apply the selection starting index correctly. Thanks for the assistance.

gheber · January 24, 2023, 1:40am

It’s impossible to tell what the data producer had in mind. Opaque, in this case, means just bytes. Bytes don’t have signs. (We are not talking about an integer type.) A common use case could be a byte stream coming from some measurement or recording device. If it were a block device, the block size (> 1) might be a good size for the opaque type. The tag is intended to leave a clue, such as the MIME type, but that’s the data producer’s choice. It’s a lost opportunity to be helpful.

I’m not sure I’m following, so let’s try an example.

Assume we have a dataset of an opaque datatype of size one, and the 1D dataspace has an extent of 100 elements. (“a byte array of 100 elements”) Let’s also assume that each chunk has 15 elements. That means our dataset is stored across 7 chunks, where the last chunk contains only 10 elements. (The full chunk is allocated in the file, but the library would ensure that only data “in range” would be accessed.)

Consider a hyperslab selection that covers blocks of 3 elements adjacent to every 20th element. In hyperslab notation (start, stride, count, block), this could be represented as (20, 20, 4, 3). The elements selected would be at positions (20,21,22,40,41,42,60,61,62,80,81,82). Only 4 of the 7 chunks contain relevant data, the second (15-29), third (30-44), fifth (60-74), and sixth (75-89) chunks. The offsets within the chunks vary: 5, 10, 0, 5, respectively, i.e., we would read three elements at offset 5 on the second chunk, three elements at offset 10 on the third chunk, etc.

Does that make sense?
G.

anthony.j.ashford · January 24, 2023, 11:00am

Hi,

Thanks for the clarification of the Opaque Type. I am only guessing at the type of the one byte being treated as unsigned integer since that is how the HDFView is presenting the data. All the values fit within the -128 thru 127 range of numbers. Given your explanation of how the data producer of the document could have left a clue on how to interpret the data, I am wondering if the HDFView made the arbitrary choice to display the byte in that fashion.

Your example for the Hyperslab is almost what I actually have. The Dataspace is 2098086 elements, and the V1 BTree data contains three internal nodes where the first two nodes children contain 57 chunks and the third contains 15 chunks.

Each of the Reference Bit Fields has a Dataspace Selection that is a Version 1 Hyperslab Selection that looks like the Version 3 Irregular Selection. If you can provide an example with the Irregular Hyperslab Selection it would be much appreciated. All I really have of use is the Rank, Number Of Blocks, and Start/End Offsets for the Rank/Number Of Blocks combination.

Since I am seeing a Rank of 1 and the Number Of Blocks of 1, I have been applying the Single Start Offset value to the Chunked data. By the looks of things, that is incorrect.

Thanks for your time on this and the Hyperslab example, for the Regular Hyperslab, that look reasonable. If I can get an Irregular example, it would be great. Thanks again!

gheber · January 24, 2023, 1:48pm

Blocks are not chunks (except by some weird coincidence). Blocks are logical and independent of the chunked layout. In other words, you can re-chunk a dataset, and all the dataset region references remain intact. That’s because the selections are expressed against the dataspace, not the chunk layout.

The offsets, etc., encoded in the hyperslab selection apply to the dataspace, not the chunk. That’s why you must calculate 1) which chunks overlap the hyperslabs (blocks) and 2) how the global offsets translate to local offsets.

Let’s stick w/ our previous example and assume my irregular hyperslab has three blocks [12,22], [36,56], [90,91]. The chunks overlapping that selection are the first, second, third, fourth, and seventh. The offset in the first chunk would be 12, 0 in the second, 6 in the third, 0 in the fourth, and 0 in the seventh. Not that we would read only parts of the respective chunks, i.e., the three last elements of the first chunk, the seven leading elements of the second chunk, etc.

OK?
G.

anthony.j.ashford · January 24, 2023, 3:15pm

Hi Again,

Thanks for the clarification of Blocks vs Chunks. I am still rather lost on how to apply the Dataspace Selection Starting Offset to the data I have.

Some clarification on what I am currently doing. As I process the HDF5 file, when I come across a V1 Btree containing data, I go through and process each chunk (that is what the Data Layout is using) and using the address contained in the chunk and the Size/Length, I copy that data into an array of bytes. I then process that array of bytes using the Datatype specific to the Dataset until the size of the buffer is reached and no other elements can be parsed.

In my current example, the Datatype is a Compound Datatype that happens to have a Reference Datatype as one of it’s elements. In the end, for each chunk, I end up with a list of items where each item is a parsed Compound Datatype.

As I am processing, I combine the lists generated by each chunk so I end up with a contiguous list of Datatype elements that represents the Dataspace.

Everything works fine and all the Datasets are generated except for the Reference elements. Those go through the same type of Object Header for the Reference and I end up with a list of single byte elements since the Datatype being used is the Opaque Type and is one byte in size.

The chunks are contiguous, i.e. the chunks offset value follows the previous chunks across all child nodes that contain chunks.

Currently, I have not had to worry about indexing since all the data in the file is being sent to a PostgreSQL database.

My problem is getting the correct byte value withing the Dataspace given the chunks for the Referenced data are parsed into a list of the Opaque Type which is just a single byte.

Given this example of a list of single byte values that represent the Datatype values in the Dataset, how would you use the single Starting Offset to get the single byte? All I can think of is I am not arranging the parsed data correctly when going between the separate chunks.

Since I am just referencing one byte each time a Reference type is used in the Compound Type elements, can I get a quick example of this situation?

Forgot to mention that everything has a Dimensionality of one. Thanks as always.

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Reference Bit Field

Logical

Physical