Group Overhead - File size problem for hierarchical data


#1

We want to use HDF5 as file format for a scientific device which records data. Besides the data, we want to store all the necessary meta data of a file, including program configurations. Our idea was to serialize our object tree, where every object has its members as subgroups. Elementary members like ints or doubles would also be a group with their value being stored as attribute inside that group.

We found that a group has quite a memory overhead (was it 2 or 20 kb?). As our object tree has several thousand members, this is problematic - a file which should be 5 Mb gets blown up to 120 Mb. Is there a way to reduce this overhead? Would compression via hdf5 help here?


#2

group entry size

I found the avg size to be 800 bytes per entry, the following program generates N = 400'000 random strings then accumulates the total space needed. Then we move on to create N groups, measure the disk space taken up them calculate the difference. Eyeballing it I got 700 ~ 800 bytes per entry; did I miss something? Project may be downloaded from this gitHUB page

#include "include/h5cpp/all"

#include <string>
#include <iterator>
#include <filesystem>

int main(int argc, char **argv) {
    std::string path = "groups.h5";
    size_t N = 400'000, strings_size=0;
    auto names = h5::utils::get_test_data<std::string>(N);
    for(auto a: names) strings_size+= a.size() * sizeof(std::string::value_type);

    { // code block will enforce RAII, as we need file closed to measure size
    h5::fd_t fd = h5::create(path, H5F_ACC_TRUNC);

    h5::gr_t root{H5Gopen(fd, "/", H5P_DEFAULT)}; // using H5CPP RAII
    for(size_t n=0; n < N; n++)
        h5::gr_t{H5Gcreate(root, names[n].data(), H5P_DEFAULT, H5P_DEFAULT, H5P_DEFAULT)}; 
    } // HDF5 file is closed, grab size
    namespace fs = std::filesystem;
    size_t file_size = fs::file_size(
        fs::current_path() / fs::path(path));
    std::cout << file_size <<"-"<< strings_size <<"=" << (file_size - strings_size) / 1'000'000 << "MB\n\n";
    std::cout << "avg: " << (file_size - strings_size) / N <<"bytes/group entry\n\n";
    return 0;
}

results:

g++ -I./include -o group-test.o   -std=c++17 -DFMT_HEADER_ONLY -c group-test.cpp
g++ group-test.o -lhdf5  -lz -ldl -lm  -o group-test
./group-test
79466744-1748345=77MB

avg: 777bytes/group entry

#3

I like @steven 's example. There’s also a dependence on the library version + file format spec:

import h5py, os

def doit(fpath, libver):
    print('FMT: %s'%libver)
    with h5py.File(fpath,'w',libver=libver) as f:
        f.flush()
        print('hdf5 file size: %d bytes'%os.path.getsize(fpath))
        f.create_group("/0/1/2/3/4/5/6/7/8/9")
        f.flush()
        print('hdf5 file size: %d bytes'%os.path.getsize(fpath))

doit('earliest.h5', 'earliest')
doit('latest.h5', 'latest')

yields

FMT: earliest
hdf5 file size: 800 bytes
hdf5 file size: 11120 bytes
FMT: latest
hdf5 file size: 195 bytes
hdf5 file size: 1665 bytes

I’m not sure where your overhead is coming from/how you measured that.

G.


#4

Maybe not relevant to the poster’s question, but I was curious to see how the storage size increased with the number of groups in HSDS.

Here’s a python program that creates an HDF5 file or HSDS domain with lots of empty groups:

import h5pyd
import h5py
import sys

if len(sys.argv) < 3 or sys.argv[0] in ('-h', '--help'):
    print("usage: python make_lots_o_groups.py filepath cnt")
    sys.exit(1)

file_path = sys.argv[1]

group_count = int(sys.argv[2])

print("group_count:", group_count)

if file_path.startswith("hdf5://"):
    f = h5pyd.File(file_path, 'w')
else:
    f = h5py.File(file_path, 'w')

for i in range(group_count):
    name = f"grp_{i:08d}"
    f.create_group(name)
f.close()

Runtime with HSDS is about 18x slower than with HDF5. That’s overhead of all those out-of-process calls!

Anyway, with HSDS storage size is coming out to ~320 bytes/group. That’s not unexpected considering an empty group json will look like this (note: HSDS stores metadata as json objects):

{"id": "g-c3fc44a1-77d0841c-4b74-cc29ff-580c94", "root": "g-c3fc44a1-77d0841c-4b74-cc29ff-580c94", "created": 1649693958.0479171, "lastModified": 1649693958.0479171, "links": {}, "attributes": {}}

196 bytes per group.
We also need to store the link to the group which would be something like:

{"grp_00000000": {"class": "H5L_TYPE_HARD", "id": "g-b65ac838-41fed07f-0748-d69cff-4ad588", "created": 1649692590.3088968}

123 bytes per link.
Note with HSDS will always store timestamps with the objects (compare with HDF5 lib where I think timestamps are not stored by default).

There are some tricks we could do to reduce the storage size (e.g. store compressed files), but I suspect for most users 99.9% of the data will be as chunk not metadata, so the this wouldn’t be too useful.


#5

It sounds like you might be overstructuring. I do not understand what you are doing with subgroups. I suggest do not use subgroups for the sole purpose of storing metadata. Simple metadata is most efficient as attributes, when attached directly to the dataset or parent group that it is directly associated with. For example, pseudocode, where “:” means attached attribute:

dataset pressure (ntimes)

pressure:instrument = “Klystron 9000 #B00731
pressure:units = “kg m^-1 s^-2”
pressure:bias = 0.034