Hi Matt,
I find Werner's and George's remarks disappointing. As Werner says,
HDF5 cannot be sufficient and applications must agree on what to do
with the contents of a datafile: the HDF5 format supplies syntax for
data containers, but the applications need to agree on the semantics.
That is not my concern.
OK, good, sometimes people stumble at this point and expect "magic" to happen at the "semantic" level just because they are using HDF5 at the "syntax" level.
If I read George's comments (and the comments of others in this
thread) correctly, the reason for "non-robust workflows" (in the sense
that an HDF5 file created somewhere may not be usable everywhere) is
taken to be that different applications use different versions of the
library, which have different APIs. But the HDF Group releases
different versions at a rate faster than 1 per year, and claim to
support a 1.6* series and a 1.8.* series. It would seem to be that
differences in library versions would be expected. Indeed, when I
read the Compatibility consideration pages, I see that forward and
backward compatibility are (as far as possible) design goals, though
the tables shown with "may create forward-compatibility conflicts"
suggests that nearly everything may create a conflict.
I completely agree with you, it's unfortunate that introducing new features that require file format changes generates this issue. We have worked hard to minimize the effects by always defaulting to the most backward compatible version of the file format possible to encode the feature that an application uses. So, an application that used a particular set of API routines and gets migrated to a new version of the HDF5 library will continue to create files that use the same file format version, as long as no new API routines are used. (modulo bugs, of course)
Also, we have spaced out the major releases (1.6.0, 1.8.0, the upcoming 1.10.0, etc.) by several years, with the interim releases only adding supporting/minor API routines and providing bug fixes. Hopefully, that's not a concern that you'll need to worry about.
The suggested remedies for problems from different versions are a)
contact the vendor to support a specific version of the API, and b)
choose carefully which versions of the API (and applications) to use.
These suggestion do, then, acknowledge that an HDF5 file created
somewhere may not be usable everywhere. The suggestions are not very
practical solutions for files intended to exchange and store data such
that they can be shared widely across platforms, applications, and
time. Neither are "support contracts": If I create a data file and
two years later colleague A on continent B cannot read that file into
analysis application C written with Programming language D by person
E, which person needs the support contract, and from whom?
You have outlined one of the difficulties, yes. However, we have designed the file format and library to always support reading earlier versions of objects in the file format and therefore all the files that have been produced with earlier versions of the library will always be able to be read by later versions of the library.
So, the main problem for application developers is when they are sending a file produced with new features on a new library to someone else who is using an older version of the library. Again, there's not much we can do in this case - one user wanted and used a feature that isn't available in the earlier version of the library that the second user has.
No one has explained a simple method to repair a file that has been
corrupted by HDFView2.5 with HDF5 1.8.2 or to even detect the
corruption: None of the h5 utilities detected it. It is pretty clear
to me that there is not much testing that happens for file
compatibilities across the **supported** API versions. I apologize
for sounding unappreciative fo the work done, and feel there is no
better alternative to HDF5, but is this really considered excellent?
Maybe I missed this in your earlier messages, but was the file actually corrupted by HDFView? Is it no longer readable? If so, could you provide access to a copy of it, so we can work with you to address the issue.
You are correct about testing between supported versions though, I think that's an area we could stand to improve in.
I believe my other question ("Can one detect which version of the API
was used to write a file") and its response ("No") actually has some
importance here. There are non-trivial differences between the 1.6
and 1.8 formats and APIs and, apparently, no practical way to test
for these changes or whether an object in a file can be read by a
specific library. This is described as a feature, not a bug, which I
find astonishing. If version information existed for files or
objects, many of the problems of version mismatch would at least be
detectable, so that an application could at least say "sorry I can't
read a link object". That there were format differences between 1.6
and 1.8 that may create conflicts suggests that this will happen
again. That a version system does not not exist, suggests that there
were no lessons learned from the introduction of non-detectable
conflicts.
The version information is contained in the files, that's not a problem. What I think you want is a utility that will check the format of each file to verify the version of the objects within it, which I mentioned is on our development path. Is there something else you'd like to see?
Of course, version information would not solve the corruption of
data files, but any significant testing would have detected that.
This is all somewhat disappointing. I believe that using HDF5 is
probably the only viable format for exchanging large scientific data
sets, but I am not excited about the idea of having to continually
fight version problems due to poor design and sloppy practices.
I appreciate your feedback here and hope you stay active on the forum and continue to provide it. Here's some of the things I think you are asking for:
- A utility which checks a file to determine if it can be read by a particular version of the HDF5 library.
- Better compatibility testing to read files produced with one version of the library with another version.
Are there other suggestions you would add?
Quincey
···
On Dec 2, 2009, at 11:42 AM, Matt Newville wrote: