Flat vs Nested Best Practices


#1

I’m experimenting with HDF5 / h5py and looking for advice on how deep to nest my data.

My typical lab workflow is to collect data in a session, during which I’ll test a device in several configs. Typically I’ll repeat measurements for each config. If I use this natural division of data, I end up with very deep nesting:

/session/device/config/repetition/meas_type

This seems onerous to traverse, and also makes me wonder which level I’d attach to attrs to (always at the end, a mixture of levels?)

Any advice on this stuff? Flatter is better? Nested is better? I’m leaning towards some encoding, like:

/session/device-config/repetition/meas_type

…and always attaching attrs to the device-config level.

Any tips appreciated!

–John Brodie


#2

“looking for advice on how deep to nest my data” …

Here is some generic advice that is not specific to HDF5. For data collection in general, I recommend many small separate files, rather than complex file structure and nesting. This allows well established system methods for file integrity, backup, cataloging, and performance. Design a file naming convention with appropriate hierarchical identification. If necessary for following applications or archiving, you can always aggregate the collection files into larger units at a later time. On the other hand, remember that utilities like tar and zip are excellent aggregators, in their own way.

I suggest at minimum, a separate file for each device in each session. Going all the way to a separate file for each repetition may or may not be overkill.

In general, I suggest attributes at multiple levels, on the level where they apply collectively. Put device-related attributes at the device level, config-related attributes at the config level, and so on. Try to standardize names of important attributes at each level, to aid future aggregation.


#3

Thanks, this sounds like good advice. I’ll give it a shot.