HDFds - proposed conventions for using HDF5 for data sharing

I'm part of a group which is working to develop standards for sharing
neuroscience electrophysiology data using HDF5. As part of this effort we
developed proposed conventions for using HDF5 for data sharing which are
independent of any domain. A main goal of these conventions is to provide
a standard way of specifying schemata that describe data and metadata
within an HDF5 file. We named these proposed conventions, HDFds - (ds for
data sharing).

I'm concerned that some of what is proposed might overlap with previous
work or that there might be better ways of achieving the desired
functionality. I would very much appreciate any feedback about these
proposed conventions, either to the mailing list or to me directly.

Thanks,
Jeff

hdfds-v0_5.pdf (175 KB)

Hi Jeff,

I had a look at your document I have some comments:

(page 2) It looks like you intend for your "schema" to be a human-readable
description of how the data are organized and not a formal specification
that can be used to validate an HDF5 file via a check tool of some sort (as
in an XML schema). Although I'm sure this is informally useful to you,
this lack of a machine-verifiable formal specification would be a major
weakness of HDF5ds.

(page 3) HDF5 not specifying how user metadata should be structured is not
really a "limitation" of HDF5. Different users will have differing ideas
about what metadata is important so we don't lock people into a particular
arrangement.

(page 3) You can store mixed types in an attribute using a compound type.

(page 3) Encoding your metadata as a JSON object is similar to storing
parseable strings in database tables - you aren't leveraging the strength
of the platform. In the grand scheme of things, it probably isn't a big
deal to store your metadata as JSON strings (especially if they are small
and infrequently accessed) and maybe that fits well into your code, but the
more HDF5-centered way to store that metadata would be as a several
independent attributes.

(page 4) HDF5 specifies references to nodes as absolute paths. You can use
region references to refer to subsets of a dataset. HDF5 also supports
external links to other files.

(page 4) The term "settings" is probably too experimentally oriented for
general HDF5 use. What does "settings" mean in a file that stores
phylogenetic tree data or patient history data?

(page 5) The idea of associating attributes to collections of objects in
the HDF5 file is an interesting one, though I'm not sure how to cleanly
handle that off the top of my head. Definitely something to keep in mind.
I would want to handle that inside the library, though, and not via easily
broken parseable string attributes.

Unfortunately, I can't really weigh in on what sort of similar work has
been done in this area (others at THG will have to do that) but there's
clearly a need for some sort of formal, verifiable HDF5 schema.

Cheers,

Dana

···

On Tue, Jan 22, 2013 at 1:42 PM, Jeff Teeters <jteeters@berkeley.edu> wrote:

I'm part of a group which is working to develop standards for sharing
neuroscience electrophysiology data using HDF5. As part of this effort we
developed proposed conventions for using HDF5 for data sharing which are
independent of any domain. A main goal of these conventions is to provide
a standard way of specifying schemata that describe data and metadata
within an HDF5 file. We named these proposed conventions, HDFds - (ds for
data sharing).

I'm concerned that some of what is proposed might overlap with previous
work or that there might be better ways of achieving the desired
functionality. I would very much appreciate any feedback about these
proposed conventions, either to the mailing list or to me directly.

Thanks,
Jeff

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Thanks for the answer, I would like to also add some suggestions

Hi Jeff,

I had a look at your document I have some comments:

(page 2) It looks like you intend for your "schema" to be a human-readable
description of how the data are organized and not a formal specification
that can be used to validate an HDF5 file via a check tool of some sort (as
in an XML schema). Although I'm sure this is informally useful to you, this
lack of a machine-verifiable formal specification would be a major weakness
of HDF5ds.

Perhaps one can use hdf5 to XML and verify the XML

(page 3) HDF5 not specifying how user metadata should be structured is not
really a "limitation" of HDF5. Different users will have differing ideas
about what metadata is important so we don't lock people into a particular
arrangement.

I think this is a stronghold of hdf5,
Instead of specifying metadata structure it might make more sense to
let everyone use a wider range of data structure that better
represents their data, but instead specify a structure for
meta-meta-data.
What I mean is each company can describe its meta-data inside the file
itself using standard nodes, remember hdf5 also accepts symbolic/soft
links (shortcuts).
For example I can record my channels on "/channels/channel01/dataset"
while another company may record in "/recording/1"
Now if standards specifies something like
"/experiment/info/channel/continuous/1" each company should only
create the shortcut to its own data.

(page 3) You can store mixed types in an attribute using a compound type.

Yes and the nice thing about the mixed type [1] is that it will appear
as a table when viewed in tools such as HDFViewer

(page 4) The term "settings" is probably too experimentally oriented for
general HDF5 use. What does "settings" mean in a file that stores
phylogenetic tree data or patient history data?

Yes, maybe something like /experiment/info can be standardized (with
the link approach)

(page 5) The idea of associating attributes to collections of objects in
the HDF5 file is an interesting one, though I'm not sure how to cleanly
handle that off the top of my head. Definitely something to keep in mind.
I would want to handle that inside the library, though, and not via easily
broken parseable string attributes.

I think this can be also achieved using soft links, you can record the
attribute once and reference it in multiple nodes

dashesy
Unless explicitly stated, the opinions expressed in this email do not
represent the official position of the company I work for.

[1] One reference implementation (using structured attributes):

···

On Wed, Jan 30, 2013 at 6:31 PM, Dana Robinson <derobins@hdfgroup.org> wrote: