Working with split hdf5 files

Paul_Anton_Letnes · November 8, 2016, 8:42pm

Hi!

First, thanks for creating hdf5, which is incredibly helpful for so many people!

I'm currently working on hyperspectral images. We've got a camera that writes one frame at a time into a rank-3 hdf5 dataset; the slowest varying index of the dataset is the frame number. To avoid corrupt files, we currently split each recording (think something along a video recording) into separate hdf5 files, approx. 1 GB in size (configurable). Working with the split files/dataset is doable, but less elegant than putting one big dataset into one big file, obviously.

- Is there a way to ensure the integrity of partial recordings (against power loss, software crashes, you name it) without splitting them into all these small files?
- Is there a way to create a "master file" that uses symbolic/external links to "link together" all the datasets (one per file) into something that looks like a dataset from a hdf5 user (h5py, matlab, ...)? I've noticed the "drivers" [1] that talk about split files, but I'm uncertain whether each sub-file is a valid hdf5 file? H5FD_MULTI superficially looks like what we need.
- Can virtual datasets [2] be used from older (1.8.x) clients? Will it work for this purpose?
- Or are we missing some great idea or feature in HDF5?

Cheers,
Paul

[1] <https://support.hdfgroup.org/HDF5/Tutor/filedrvr.html#predef>
[2] <https://support.hdfgroup.org/HDF5/docNewFeatures/NewFeaturesVirtualDatasetDocs.html>

miller86 · November 8, 2016, 9:08pm

Hi!

First, thanks for creating hdf5, which is incredibly helpful for so many people!

I'm currently working on hyperspectral images. We've got a camera that writes one frame at a time into a rank-3 hdf5 dataset; the slowest varying index of the dataset is the frame number. To avoid corrupt files, we currently split each recording (think something along a video recording) into separate hdf5 files, approx. 1 GB in size (configurable). Working with the split files/dataset is doable, but less elegant than putting one big dataset into one big file, obviously.

- Is there a way to ensure the integrity of partial recordings (against power loss, software crashes, you name it) without splitting them into all these small files?

Won't judicious use of H5Fflush() do the trick?

- Is there a way to create a "master file" that uses symbolic/external links to "link together" all the datasets (one per file) into something that looks like a dataset from a hdf5 user (h5py, matlab, ...)? I've noticed the "drivers" [1] that talk about split files, but I'm uncertain whether each sub-file is a valid hdf5 file? H5FD_MULTI superficially looks like what we need.

Based on what you've written, I was going to suggest the 'family' driver, https://support.hdfgroup.org/HDF5/doc/RM/RM_H5P.html#Property-SetFaplFamily. It'd probably be best of you could size things such that an integral number of frames goes into a given file. But, I suspect in general, that won't be possible.

- Can virtual datasets [2] be used from older (1.8.x) clients? Will it work for this purpose?

Virtual Data Sets are new to HDF5. I don't think 1.8 series supports them, or ever will. But, they will solve your desire to create a "single dataset" view of the data.

Given how big these seem to be, are you really thinking any sequential (e.g. non-parallel) app like matlab is really going to be able to do much with a single dataset view of this data?

- Or are we missing some great idea or feature in HDF5?

Depending what you need to do, mounting one file within another, https://support.hdfgroup.org/HDF5/doc/RM/RM_H5F.html#File-Mount, external links, https://support.hdfgroup.org/HDF5/doc/RM/RM_H5L.html#Link-CreateExternal, or object references, https://support.hdfgroup.org/HDF5/doc/RM/RM_H5R.html#Reference-Create might offser some of what you need.

Cheers,
Paul

[1] https://support.hdfgroup.org/HDF5/Tutor/filedrvr.html#predef
[2] https://support.hdfgroup.org/HDF5/docNewFeatures/NewFeaturesVirtualDatasetDocs.html

···

From: Hdf-forum <hdf-forum-bounces@lists.hdfgroup.org> on behalf of Paul Anton Letnes <pa@letnes.com>
Reply-To: Paul Anton Letnes <pa@letnes.com>, HDF Users Discussion List <hdf-forum@lists.hdfgroup.org>
Date: Tuesday, November 8, 2016 at 12:42 PM
To: "hdf-forum@lists.hdfgroup.org" <hdf-forum@lists.hdfgroup.org>
Subject: [Hdf-forum] Working with split hdf5 files

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Working with split hdf5 files