Writing Variable Length Data To Compound Dataset

Hello,

I am writing a program that converts google protocol buffer messages into the HDF5 file format via the CPP interface, and I seem to be hitting an issue when attempting to copy strings + other variable length fields.

I have a compound type compType that I add a member to as follows

// Note: I am calculating the offset by hand since I do not have the necessary structures at compile time
H5::DataType v = H5::StrType(H5::PredType::C_S1, H5T_VARIABLE);
compType.insertMember(fieldName, currentOffset, v);
currentOffset += sizeof(char *);

I write the structured data to a buffer containg the rest of my compound type as follows

char * value = calloc(1, src.length()+1);
memcpy(value, src, src.length());
buffer.replace(offset, sizeof(char *), &value, sizeof(char *));

Then I write the buffers containing my compound data to a dataset

dataSet.write(buffer, compType);

This process works for all of the fixed length data types I want to support, but I seem to get either bad data or segfaults when writing no matter what approach I use for strings / var length types. Wondering what the correct approach to do something like this would be / what type to use + what goes in the binary buffer?

NOTE: please don’t suggest HDFql, it’s not available to me

EDIT: the approach above segfaults inside some vlen conversion function that I don’t have debug symbols for. Another approach I have tried is as follows:

Create the compound type

// Note: I am calculating the offset by hand since I do not have the necessary structures at compile time
H5::VarLenType v = H5::VarLenType(H5::PredType::C_S1);
compType.insertMember(fieldName, currentOffset, v);
currentOffset += sizeof(hvl_t);

Copy the data from std::string

hvl_t vlInfo;
vlInfo.len = value.length() + 1;
vlInfo.p = calloc(1, value.length() + 1);
memcpy(vlInfo.p, value.c_str(), value.length());

buffer.replace(offset, sizeof(hvl_t), (char *)&vlInfo, sizeof(hvl_t));

Write to file

dataSet.write(buffer, compType);

This approach does not segfault but all of my strings show up as ERROR in HDFView

image

image

Hi Nicholas,

I don’t think you want sizeof(char *), that’s only the size of the pointer.
Also, the last argument of insertMember is the datatype of the new member.

Hi,

It was my understanding for variable length types you either use a pointer or an item of hvl_t. Regardless I don’t really find this answer to be all that helpful since I know what I was doing was wrong.

I was really hoping the C++ lib has the ability to put variable length types inside a compound type, and that someone could give advice on how to do that not just tell me the code I said isn’t working doesn’t work.

I understand that not having a minimal example to work off of can make this process more difficult so I’ve cobbled together a basic example of what I’m trying to do:

#include <H5Cpp.h>
#include <vector>
#include <string>

// NOTE: THIS CODE SEGFAULTS SO LONG AS THIS IS TRUE
bool recreateIssue = true;

constexpr uint32_t dataspaceRank = 1;
constexpr uint32_t datapointCount = 10;
constexpr hsize_t dataspaceDims[] = { 10 };

std::vector<std::string> stringData = 
{
    "1",
    "22",
    "333",
    "4444",
    "55555",
    "666666",
    "7777777",
    "88888888",
    "999999999",
    "0000000000"
};

std::string empty = "";

std::vector<uint32_t> uintData = 
{
    1,2,3,4,5,6,7,8,9,0
};

int main()
{
    // Open file
    H5::H5File file("test.hdf", H5F_ACC_TRUNC);

    // Create compound type
    // PLEASE NOTE: I am not using HOFFSET since in my real program I don't know the structure of the data until runtime
    H5::CompType compType( (size_t)100 );

    compType.insertMember("uint1", 0, H5::PredType::NATIVE_UINT32);
    compType.insertMember("str", sizeof(uint32_t), H5::StrType(0, H5T_VARIABLE));
    compType.insertMember("uint2", sizeof(uint32_t) + sizeof(char *), H5::PredType::NATIVE_UINT32);
    compType.setSize(sizeof(uint32_t) + sizeof(char *) + sizeof(uint32_t));

    // Create dataset/space
    H5::DataSpace dataSpace(dataspaceRank, dataspaceDims);

    H5::DataSet dataSet = file.createDataSet("test", compType, dataSpace);

    // Organize data into buffer

    std::string databuffer(compType.getSize() * datapointCount, ' ');

    uint32_t offset = 0;
    for(uint32_t i = 0; i < datapointCount; i++)
    {
        databuffer.replace(offset, sizeof(uint32_t), (char *)&uintData[i], sizeof(uint32_t));
        offset += sizeof(uint32_t);

        if(recreateIssue)
            databuffer.replace(offset, sizeof(char *), stringData[i].c_str(), sizeof(char *));
        else
            databuffer.replace(offset, sizeof(char *), empty.c_str(), sizeof(char *));
        
        offset += sizeof(char *);

        databuffer.replace(offset, sizeof(uint32_t), (char *)&uintData[i], sizeof(uint32_t));
        offset += sizeof(uint32_t);
    }

    // Write buffer to file

    dataSet.write(databuffer, compType);

    file.close();

    return 0;
}

This was the only approach I could find that people said worked online, but it segfaults so long as the strings actually have data in them. I’ve also attached a copy of the backtrace for additional information.

#0  0x00007ffff699ce57 in __strlen_avx2 () from /lib64/libc.so.6
#1  0x00007ffff7a85812 in H5T__conv_vlen () from /lib64/libhdf5.so.103
#2  0x00007ffff7a7a29b in H5T_convert () from /lib64/libhdf5.so.103
#3  0x00007ffff7a840df in H5T__conv_struct_opt () from /lib64/libhdf5.so.103
#4  0x00007ffff7a7a29b in H5T_convert () from /lib64/libhdf5.so.103
#5  0x00007ffff790e931 in H5D__scatgath_write () from /lib64/libhdf5.so.103
#6  0x00007ffff78f6426 in H5D__contig_write () from /lib64/libhdf5.so.103
#7  0x00007ffff790a260 in H5D__write () from /lib64/libhdf5.so.103
#8  0x00007ffff790a9aa in H5Dwrite () from /lib64/libhdf5.so.103
#9  0x00007ffff7621673 in H5::DataSet::write(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, H5::DataType const&, H5::DataSpace const&, H5::DataSpace const&, H5::DSetMemXferPropList const&) const () from /lib64/libhdf5_cpp.so.103
#10 0x0000000000402bc4 in main () at /home/nicholas.desmarais/Documents/hdf_example/src/main.cpp:75

Not a C++ expert, but from my reading of std::string::replace, it looks like the line

databuffer.replace(offset, sizeof(char *), stringData[i].c_str(), sizeof(char *));

Will copy the first sizeof(char *) bytes from the string data to the compound data buffer, when it needs to copy the address of the string data. Maybe something like

char *tmp_c_str = stringData[i].c_str();
databuffer.replace(offset, sizeof(char *), (char *)&tmp_c_str, sizeof(char *));

would work?