Maybe not relevant to the poster’s question, but I was curious to see how the storage size increased with the number of groups in HSDS.
Here’s a python program that creates an HDF5 file or HSDS domain with lots of empty groups:
import h5pyd
import h5py
import sys
if len(sys.argv) < 3 or sys.argv[0] in ('-h', '--help'):
print("usage: python make_lots_o_groups.py filepath cnt")
sys.exit(1)
file_path = sys.argv[1]
group_count = int(sys.argv[2])
print("group_count:", group_count)
if file_path.startswith("hdf5://"):
f = h5pyd.File(file_path, 'w')
else:
f = h5py.File(file_path, 'w')
for i in range(group_count):
name = f"grp_{i:08d}"
f.create_group(name)
f.close()
Runtime with HSDS is about 18x slower than with HDF5. That’s overhead of all those out-of-process calls!
Anyway, with HSDS storage size is coming out to ~320 bytes/group. That’s not unexpected considering an empty group json will look like this (note: HSDS stores metadata as json objects):
{"id": "g-c3fc44a1-77d0841c-4b74-cc29ff-580c94", "root": "g-c3fc44a1-77d0841c-4b74-cc29ff-580c94", "created": 1649693958.0479171, "lastModified": 1649693958.0479171, "links": {}, "attributes": {}}
196 bytes per group.
We also need to store the link to the group which would be something like:
{"grp_00000000": {"class": "H5L_TYPE_HARD", "id": "g-b65ac838-41fed07f-0748-d69cff-4ad588", "created": 1649692590.3088968}
123 bytes per link.
Note with HSDS will always store timestamps with the objects (compare with HDF5 lib where I think timestamps are not stored by default).
There are some tricks we could do to reduce the storage size (e.g. store compressed files), but I suspect for most users 99.9% of the data will be as chunk not metadata, so the this wouldn’t be too useful.