HDFView altering files such that they cannot be read with IDL

newville · December 1, 2009, 3:45pm

Hi,

I'm investigating how and whether to use HDF5 for storing and
exchanging data from synchrotron x-ray microprobes. Actually, several
people have been working on this for awhile at
http://groups.google.com/group/mahid -- I'm a late-comer to the group.
Many of the people interested in exchanging these types of data have
applications written either in Python or IDL. So, one of my first
tests was to see how well IDL (which uses HDF5 v1.6) and Python could
exchange HDF5 data. Even given the known features of HDF5 1.8 that
are not supported in version 1.6, the results were not very
encouraging.

First, a minor issue and complaint: it appears that HDF5 files are
NOT tagged with the minor version number. That is, an application
using the v1.6 library (IDL) cannot tell whether a valid HDF5 file
might have features beyond v1.6. Is that correct? Not having version
tags seems like a drawback for a file format designed and advertised
to be portable across different applications. If this is indeed the
case, is there any hope of having this remedied? If not, what is the
recommended way to check for "future" features in a file that are
beyond the scope of the current API?

Unfortunately, there is a much bigger issue, which turned out to be
due to the HDFView program itself. Upon reading an HDF5 file created
with HDF5 1.8.4 (Using Python and h5py v 1.21), HDFView 2.5 (on both
Linux and Windows) alters the HDF5 file. The data contents of the
file are not changed (h5dump gives identical results), and the size of
the file does not change, but the file header (between bytes 160 and
192) does change. The alteration appears to happen on reading the
file -- no attempt is made to plot or view anything more than the tree
structure of the data. The small change in the header does prevent
IDL from opening the file:

~> idl
IDL Version 7.0 (linux x86 m32). (c) 2007, ITT Visual Information Solutions
Installation number: 94742-1.
Licensed for use by: Dept License

> fh = h5f_open('altered.h5')
% H5F_OPEN: unable to open file: Object Name:"altered.h5"
% Execution halted at: $MAIN$

whereas the unaltered file opens and all the data in it can be read
just fine. The simple test file does not contain any v1.8 specific
features.

For earlier versions of IDL (version 6.3), this altered file causes
IDL to give a segmentation fault on h5f_open(). Since IDL appears to
use thin wrappers around the HDF5 API, I would not be surprised if
these altered files prevented them from being read with other
applications. More details on this, including example files are at

http://cars9.uchicago.edu/pybeamline/DataFormats/H5UsageNotes/HDF5AndIDL

Has anyone here seen similar problems or know of a way around this?

Thanks,

--Matt Newville <newville at cars.uchicago.edu>

Quincey_Koziol · December 1, 2009, 6:30pm

Hi Matt,

Hi,

I'm investigating how and whether to use HDF5 for storing and
exchanging data from synchrotron x-ray microprobes. Actually, several
people have been working on this for awhile at
http://groups.google.com/group/mahid -- I'm a late-comer to the group.
Many of the people interested in exchanging these types of data have
applications written either in Python or IDL. So, one of my first
tests was to see how well IDL (which uses HDF5 v1.6) and Python could
exchange HDF5 data. Even given the known features of HDF5 1.8 that
are not supported in version 1.6, the results were not very
encouraging.

First, a minor issue and complaint: it appears that HDF5 files are
NOT tagged with the minor version number. That is, an application
using the v1.6 library (IDL) cannot tell whether a valid HDF5 file
might have features beyond v1.6. Is that correct? Not having version
tags seems like a drawback for a file format designed and advertised
to be portable across different applications. If this is indeed the
case, is there any hope of having this remedied? If not, what is the
recommended way to check for "future" features in a file that are
beyond the scope of the current API?

This is a "feature", not a "bug". A file can be modified by multiple different versions of the HDF5 library and each object in the file might or might not be able to be read by a given version of the library. That said, we have planned two features that will help address the issue:
- The H5Pset_libver_bounds() routine will be able to constrain the objects created/modified to conform to versions that can be read by different versions of the library.

- We are planning to add a feature to the h5check tool that can accept low & high bounds of the library version and check if a file has versions of the data structures outside of that range.

Quincey

···

On Dec 1, 2009, at 9:45 AM, Matt Newville wrote:

Unfortunately, there is a much bigger issue, which turned out to be
due to the HDFView program itself. Upon reading an HDF5 file created
with HDF5 1.8.4 (Using Python and h5py v 1.21), HDFView 2.5 (on both
Linux and Windows) alters the HDF5 file. The data contents of the
file are not changed (h5dump gives identical results), and the size of
the file does not change, but the file header (between bytes 160 and
192) does change. The alteration appears to happen on reading the
file -- no attempt is made to plot or view anything more than the tree
structure of the data. The small change in the header does prevent
IDL from opening the file:

~> idl
IDL Version 7.0 (linux x86 m32). (c) 2007, ITT Visual Information Solutions
Installation number: 94742-1.
Licensed for use by: Dept License

> fh = h5f_open('altered.h5')
% H5F_OPEN: unable to open file: Object Name:"altered.h5"
% Execution halted at: $MAIN$

whereas the unaltered file opens and all the data in it can be read
just fine. The simple test file does not contain any v1.8 specific
features.

For earlier versions of IDL (version 6.3), this altered file causes
IDL to give a segmentation fault on h5f_open(). Since IDL appears to
use thin wrappers around the HDF5 API, I would not be surprised if
these altered files prevented them from being read with other
applications. More details on this, including example files are at

http://cars9.uchicago.edu/pybeamline/DataFormats/H5UsageNotes/HDF5AndIDL

Has anyone here seen similar problems or know of a way around this?

Thanks,

--Matt Newville <newville at cars.uchicago.edu>

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

gnwiii · December 1, 2009, 5:44pm

Hi,

I'm investigating how and whether to use HDF5 for storing and
exchanging data from synchrotron x-ray microprobes. Actually, several
people have been working on this for awhile at
http://groups.google.com/group/mahid -- I'm a late-comer to the group.
Many of the people interested in exchanging these types of data have
applications written either in Python or IDL. So, one of my first
tests was to see how well IDL (which uses HDF5 v1.6) and Python could
exchange HDF5 data. Even given the known features of HDF5 1.8 that
are not supported in version 1.6, the results were not very
encouraging.

First, a minor issue and complaint: it appears that HDF5 files are
NOT tagged with the minor version number. That is, an application
using the v1.6 library (IDL) cannot tell whether a valid HDF5 file
might have features beyond v1.6. Is that correct? Not having version
tags seems like a drawback for a file format designed and advertised
to be portable across different applications. If this is indeed the
case, is there any hope of having this remedied? If not, what is the
recommended way to check for "future" features in a file that are
beyond the scope of the current API?

Unfortunately, there is a much bigger issue, which turned out to be
due to the HDFView program itself. Upon reading an HDF5 file created
with HDF5 1.8.4 (Using Python and h5py v 1.21), HDFView 2.5 (on both
Linux and Windows) alters the HDF5 file. The data contents of the
file are not changed (h5dump gives identical results), and the size of
the file does not change, but the file header (between bytes 160 and
192) does change. The alteration appears to happen on reading the
file -- no attempt is made to plot or view anything more than the tree
structure of the data. The small change in the header does prevent
IDL from opening the file:

Does the file change if you use "Open Read Only" in HDFView? I never
use the "Open", and most of the files I use have "read-only" permissions.

I can confirm that using "Open" with hdfview on Mac OS X 10.5.8 does
indeed change the file:

gala:R-2.10 gwhite$ cmp *.hdf
ex1-orig.hdf ex1-hdfview.hdf differ: char 73, line 3

decimal 73 is octal 111:

gala:R-2.10 gwhite$ od -x ex1-hdfview.hdf |grep ^0000100
0000100 0060 0000 0000 0000 0000 0000 0000 0000
gala:R-2.10 gwhite$ od -x ex1-orig.hdf |grep ^0000100
0000100 0060 0000 0000 0000 0001 0000 0000 0000

Note: ex1.hdf is the file created by running the example for hdf5load
in R's hdf5
library.

~> idl
IDL Version 7.0 (linux x86 m32). (c) 2007, ITT Visual Information Solutions
Installation number: 94742-1.
Licensed for use by: Dept License

> fh = h5f_open('altered.h5')
% H5F_OPEN: unable to open file: Object Name:"altered.h5"
% Execution halted at: $MAIN$

whereas the unaltered file opens and all the data in it can be read
just fine. The simple test file does not contain any v1.8 specific
features.

For earlier versions of IDL (version 6.3), this altered file causes
IDL to give a segmentation fault on h5f_open(). Since IDL appears to
use thin wrappers around the HDF5 API, I would not be surprised if
these altered files prevented them from being read with other
applications. More details on this, including example files are at

http://cars9.uchicago.edu/pybeamline/DataFormats/H5UsageNotes/HDF5AndIDL

Has anyone here seen similar problems or know of a way around this?

HDF-based workflows are not robust. It helps to have current versions and
support contracts so you can share the pain with your vendor, and most
non-trivial applications seem to end up using a private hdf library because
you can't rely on anyone else's compile to have the right combination of
options, 3rd party libs, etc.

···

On Tue, Dec 1, 2009 at 11:45 AM, Matt Newville <newville@cars.uchicago.edu> wrote:

--
George N. White III <aa056@chebucto.ns.ca>
Head of St. Margarets Bay, Nova Scotia

Peter_Cao · December 1, 2009, 6:19pm

Hi,

This is a known issue for HDF5 1.8.2 and HDFView 2.5 was built on HDF5 1.8.2.
The problem is fixed with the current HDF5 release (version 1.8.4).

We just put a beta release for HDFView 2.6, which is built on HDF5 1.8.4.
The problem should be fixed. Also, in HDFView 2.6, there is an option under
"Tools" --> "User Options" to allow users to set default file access mode.

Thanks
--pc

George N. White III wrote:

···

On Tue, Dec 1, 2009 at 11:45 AM, Matt Newville > <newville@cars.uchicago.edu> wrote:


Hi,

I'm investigating how and whether to use HDF5 for storing and
exchanging data from synchrotron x-ray microprobes. Actually, several
people have been working on this for awhile at
http://groups.google.com/group/mahid -- I'm a late-comer to the group.
Many of the people interested in exchanging these types of data have
applications written either in Python or IDL. So, one of my first
tests was to see how well IDL (which uses HDF5 v1.6) and Python could
exchange HDF5 data. Even given the known features of HDF5 1.8 that
are not supported in version 1.6, the results were not very
encouraging.

First, a minor issue and complaint: it appears that HDF5 files are
NOT tagged with the minor version number. That is, an application
using the v1.6 library (IDL) cannot tell whether a valid HDF5 file
might have features beyond v1.6. Is that correct? Not having version
tags seems like a drawback for a file format designed and advertised
to be portable across different applications. If this is indeed the
case, is there any hope of having this remedied? If not, what is the
recommended way to check for "future" features in a file that are
beyond the scope of the current API?

Unfortunately, there is a much bigger issue, which turned out to be
due to the HDFView program itself. Upon reading an HDF5 file created
with HDF5 1.8.4 (Using Python and h5py v 1.21), HDFView 2.5 (on both
Linux and Windows) alters the HDF5 file. The data contents of the
file are not changed (h5dump gives identical results), and the size of
the file does not change, but the file header (between bytes 160 and
192) does change. The alteration appears to happen on reading the
file -- no attempt is made to plot or view anything more than the tree
structure of the data. The small change in the header does prevent
IDL from opening the file:

Does the file change if you use "Open Read Only" in HDFView? I never
use the "Open", and most of the files I use have "read-only" permissions.

I can confirm that using "Open" with hdfview on Mac OS X 10.5.8 does
indeed change the file:

gala:R-2.10 gwhite$ cmp *.hdf
ex1-orig.hdf ex1-hdfview.hdf differ: char 73, line 3

decimal 73 is octal 111:

gala:R-2.10 gwhite$ od -x ex1-hdfview.hdf |grep ^0000100
0000100 0060 0000 0000 0000 0000 0000 0000 0000
gala:R-2.10 gwhite$ od -x ex1-orig.hdf |grep ^0000100
0000100 0060 0000 0000 0000 0001 0000 0000 0000

Note: ex1.hdf is the file created by running the example for hdf5load
in R's hdf5
library.

~> idl
IDL Version 7.0 (linux x86 m32). (c) 2007, ITT Visual Information Solutions
Installation number: 94742-1.
Licensed for use by: Dept License

> fh = h5f_open('altered.h5')
% H5F_OPEN: unable to open file: Object Name:"altered.h5"
% Execution halted at: $MAIN$

whereas the unaltered file opens and all the data in it can be read
just fine. The simple test file does not contain any v1.8 specific
features.

For earlier versions of IDL (version 6.3), this altered file causes
IDL to give a segmentation fault on h5f_open(). Since IDL appears to
use thin wrappers around the HDF5 API, I would not be surprised if
these altered files prevented them from being read with other
applications. More details on this, including example files are at

http://cars9.uchicago.edu/pybeamline/DataFormats/H5UsageNotes/HDF5AndIDL

Has anyone here seen similar problems or know of a way around this?

HDF-based workflows are not robust. It helps to have current versions and
support contracts so you can share the pain with your vendor, and most
non-trivial applications seem to end up using a private hdf library because
you can't rely on anyone else's compile to have the right combination of
options, 3rd party libs, etc.

newville · December 1, 2009, 7:16pm

George, Peter,

Thanks for the confirmation, and for the notice that the problem has
been seen before.

I don't think expecting users to ALWAYS make files read-only or to
ALWAYS open them in the non-default mode of "read-only" in a program
called ***View (it's not called HDFEdit!!!) is likely to be a
completely robust way to make sure that files are not altered in such
a way that they can no longer be read.

George N. White III wrote:

HDF-based workflows are not robust. It helps to have current versions and
support contracts so you can share the pain with your vendor, and most
non-trivial applications seem to end up using a private hdf library because
you can't rely on anyone else's compile to have the right combination of
options, 3rd party libs, etc.

Yikes. I thought having HDFView alter files was disturbing, but
hearing that "workflows are not robust" is not exactly encouraging....
I read the "Why use HDF?" statement on the hdfgroup web page:

    Many HDF adopters have very large datasets, very fast access
requirements, or
    very complex datasets. Others turn to HDF because it allows them
to easily share
    data across a wide variety of computational platforms using
applications written in
    different programming languages. Some use HDF to take advantage of the many
    open-source and commercial tools that understand HDF.

To mean that HDF5 claims to be designed to be able to act as a base
file format for exchanging complex datasets between different computer
platforms and applications. So that, for example, one should be able
to write an HDF5 file on linux with language X and read that file on
Windows with applications Y and Z. This seems to work, at least
mostly. Perhaps I have optimistically interpreted the
advertisement??

--Matt Newville <newville at cars.uchicago.edu>

Quincey_Koziol · December 1, 2009, 7:25pm

Hi Matt,

George, Peter,

Thanks for the confirmation, and for the notice that the problem has
been seen before.

I don't think expecting users to ALWAYS make files read-only or to
ALWAYS open them in the non-default mode of "read-only" in a program
called ***View (it's not called HDFEdit!!!) is likely to be a
completely robust way to make sure that files are not altered in such
a way that they can no longer be read.

It was a bug in the underlying HDF5 library which was re-writing some metadata in the file, so it wasn't HDFView's fault...

George N. White III wrote:

HDF-based workflows are not robust. It helps to have current versions and
support contracts so you can share the pain with your vendor, and most
non-trivial applications seem to end up using a private hdf library because
you can't rely on anyone else's compile to have the right combination of
options, 3rd party libs, etc.

Yikes. I thought having HDFView alter files was disturbing, but
hearing that "workflows are not robust" is not exactly encouraging....
  I read the "Why use HDF?" statement on the hdfgroup web page:

   Many HDF adopters have very large datasets, very fast access
requirements, or
   very complex datasets. Others turn to HDF because it allows them
to easily share
   data across a wide variety of computational platforms using
applications written in
   different programming languages. Some use HDF to take advantage of the many
   open-source and commercial tools that understand HDF.

To mean that HDF5 claims to be designed to be able to act as a base
file format for exchanging complex datasets between different computer
platforms and applications. So that, for example, one should be able
to write an HDF5 file on linux with language X and read that file on
Windows with applications Y and Z. This seems to work, at least
mostly. Perhaps I have optimistically interpreted the
advertisement??

Hmm, I'm curious why George would say that "HDF-based workflows are not robust" also... We _do_ have a strong commitment to making them robust and have heard generally good reports on that front. In the cases where people have had concerns or issues, we have worked very hard to address them.

Quincey

···

On Dec 1, 2009, at 1:16 PM, Matt Newville wrote:

werner · December 1, 2009, 7:38pm

I just can throw in in defense of HDF5 that I can still happily read
and deal with large HDF5 files that I've created 10+ years ago
(>tens of GB, which was immense at this time).

A problem might be though with commercial applications that ship
ancient HDF5 versions with them and have a slow update release
cycles or let their customers pay for such updates...

Werner

···

On Tue, 01 Dec 2009 13:25:23 -0600, Quincey Koziol <koziol@hdfgroup.org> wrote:

George N. White III wrote:

HDF-based workflows are not robust. It helps to have current versions and
support contracts so you can share the pain with your vendor, and most
non-trivial applications seem to end up using a private hdf library because
you can't rely on anyone else's compile to have the right combination of
options, 3rd party libs, etc.

Yikes. I thought having HDFView alter files was disturbing, but
hearing that "workflows are not robust" is not exactly encouraging....
  I read the "Why use HDF?" statement on the hdfgroup web page:

   Many HDF adopters have very large datasets, very fast access
requirements, or
   very complex datasets. Others turn to HDF because it allows them
to easily share
   data across a wide variety of computational platforms using
applications written in
   different programming languages. Some use HDF to take advantage of the many
   open-source and commercial tools that understand HDF.

To mean that HDF5 claims to be designed to be able to act as a base
file format for exchanging complex datasets between different computer
platforms and applications. So that, for example, one should be able
to write an HDF5 file on linux with language X and read that file on
Windows with applications Y and Z. This seems to work, at least
mostly. Perhaps I have optimistically interpreted the
advertisement??

  Hmm, I'm curious why George would say that "HDF-based workflows are not robust" also... We _do_ have a strong commitment to making them robust and have heard generally good reports on that front. In the cases where people have had concerns or issues, we have worked very hard to address them.

--
___________________________________________________________________________
Dr. Werner Benger Visualization Research
Laboratory for Creative Arts and Technology (LCAT)
Center for Computation & Technology at Louisiana State University (CCT/LSU)
211 Johnston Hall, Baton Rouge, Louisiana 70803
Tel.: +1 225 578 4809 Fax.: +1 225 578-5362

newville · December 1, 2009, 7:59pm

Hi Werner,

I just can throw in in defense of HDF5 that I can still happily read
and deal with large HDF5 files that I've created 10+ years ago
(>tens of GB, which was immense at this time).

Yes, I was surprised to see the "non-robust workflow" comment. I'm not
convinced by it.

A problem might be though with commercial applications that ship
ancient HDF5 versions with them and have a slow update release
cycles or let their customers pay for such updates...

I would agree with that, but unfortunately this is not a case of
dealing with ancient HDF5 versions. The current version of IDL (7.1)
was released in May 2009. It ships with HDF5 1.6.7, which was
released in Jan 2008. HDF5 1.6.8 was released in Nov 2008, so that
IDL is between six and twelve months behind, although it did not jump
to the HDF5 1.8 series. For a vendor, I would hardly call that "way
behind" or ancient.

In any event, as we have seen, the cause of the problem in question
was actually the HDF5 1.8.* library, not IDL or the HDF5 1.6 library.
Perhaps in this case the vendor's conservative position was justified.

--Matt Newville <newville at cars.uchicago.edu>

···

On Tue, Dec 1, 2009 at 1:38 PM, Werner Benger <werner@cct.lsu.edu> wrote:

dsdale24 · December 1, 2009, 7:39pm

It was a pretty strong assertion, too strong in my opinion based on
the provided context and subjective qualifications. No offense meant
to George.

Darren

···

On Tue, Dec 1, 2009 at 2:25 PM, Quincey Koziol <koziol@hdfgroup.org> wrote:

George N. White III wrote:

HDF-based workflows are not robust. It helps to have current versions and
support contracts so you can share the pain with your vendor, and most
non-trivial applications seem to end up using a private hdf library because
you can't rely on anyone else's compile to have the right combination of
options, 3rd party libs, etc.

Hmm, I'm curious why George would say that "HDF-based workflows are not
robust" also... We _do_ have a strong commitment to making them robust and
have heard generally good reports on that front. In the cases where people have
had concerns or issues, we have worked very hard to address them.

Daniel_Kahn1 · December 1, 2009, 8:03pm

Yikes. I thought having HDFView alter files was disturbing, but
hearing that "workflows are not robust" is not exactly encouraging....

I’d be interested to know what a “workflow” is and what it means to be
“robust”. I’m not trained in software engineering. Is it a technical
term? Anyone?


    I read the "Why use HDF?" statement on the hdfgroup web page:
Many HDF adopters have very large datasets, very fast access
requirements, or
very complex datasets. Others turn to HDF because it allows them
to easily share
data across a wide variety of computational platforms using
applications written in
different programming languages. Some use HDF to take advantage of the many
open-source and commercial tools that understand HDF.
To mean that HDF5 claims to be designed to be able to act as a base
file format for exchanging complex datasets between different computer
platforms and applications. So that, for example, one should be able
to write an HDF5 file on linux with language X and read that file on
Windows with applications Y and Z. This seems to work, at least
mostly. Perhaps I have optimistically interpreted the
advertisement??

I don’t think so. NASA has been using HDF (4 and 5) for a number of
missions. The cost (and, I hope, value) of the data should be measured
in the billions of $. I’m not of sure of the volume of data, but it is
significant. That said, there have been bumps in the road, most
recently going from version HDF5 version 1.6.x to 1.8.x, presumably
because of some additions to the object model in the internals of the
files. Of course, problems can only be fixed if they are discovered
and this discovery is sometimes by users. You will likely see more if
you follow this forum–it is the nature of the business–though I
expect problems become increasingly esoteric as the library matures.

Data producers will need to evaluate different data storage methods
based on their own needs. For many of us that will be HDF5, as its
flexibility and ubiquity outweigh its defects. Dealing with the
different permutations of OS’s, compilers, libraries, analysis programs
and data formats will be a chore regardless of your choice.

With specific reference to IDL, we found that one of those bumps
required us to go to at least IDL version 7.0.6.

Cheers,

–dan

···

Hdf-forum@hdfgroup.org http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

-- Daniel Kahn
Science Systems and Applications Inc.
301-867-2162

gnwiii · December 2, 2009, 1:25pm

I think thg does excellent work, and the lack of robustness is a
reflection of the
difficulty of the goals in "Why use HDF?" in today's computing
environments. HDF is
being used, so my answer to "Why use HDF?" is that a) many data
providers supply that
format and b) using hdf makes software easier to write/maintain than
ad-hoc or application-specific formats.

In practice applications and OS distros are often very slow to update,
so unless you build the applications yourself, you get older
libraries. My work often involves collaboration with
people scattered around the globe at different organizations with
different IT standards
and update schedules.

To me, a robust workflow means: I can send a recipe to a colleague
and expect it to work the same way it did on my system. Many of the
other users have access to only one
platform (Mac OS X or linux). In practice, it is difficult to run
the workflow entirely on one platform. ATM I have some hdf4 files
that I can't open using hdfview on Ubuntu Linux (amd64, reported here
previously) and some that fail using some (widely available) versions
of h4toh5 on Mac OS X. Many open source tools implement only the
parts of the hdf formats that the
author needed, so not all hdf5 files can be read using, e.g., octave
or R. This means a workflow that works for one dataset may fail for
another apparently similar dataset.

Some key applications:

NASA SeaDAS -- large, freely available, system with sources, using IDL
runtime. Includes hdf4 and hdf5 libraries in the IDL runtime as well
as C+Fortran with different hdf4 and hdf5 libraries.
SeaDAS supports batch processing using C and Fortran programs, and
there is sometimes
a need to build locally modified versions. NASA uses Intel compilers
on linux and Mac OS X,
but not all sites have these, and there are sometimes problems getting
the programs to work using free compilers.

Macports for open source tools, including gdl, h4toh5, octave, ...
some tools still use HDF 1.6, and you can't install HDF 1.8
together with 1.6:

$ port info hdf5
hdf5 @1.6.9 (science)
Variants: fortran, g95, gcc42, gcc43, optimized, szip, threadsafe

Description: HDF5 general purpose library and file format for storing
scientific data
Homepage: The HDF5® Library & File Format - The HDF Group - ensuring long-term access and usability of HDF data and supporting users of HDF technologies

Library Dependencies: zlib
Conflicts with: hdf5-18
Platforms: darwin
License: unknown
Maintainers: openmaintainer@macports.org, jochen@macports.org

···

On Tue, Dec 1, 2009 at 3:25 PM, Quincey Koziol <koziol@hdfgroup.org> wrote:

Hi Matt,

On Dec 1, 2009, at 1:16 PM, Matt Newville wrote:

George, Peter,

Thanks for the confirmation, and for the notice that the problem has
been seen before.

I don't think expecting users to ALWAYS make files read-only or to
ALWAYS open them in the non-default mode of "read-only" in a program
called ***View (it's not called HDFEdit!!!) is likely to be a
completely robust way to make sure that files are not altered in such
a way that they can no longer be read.
   It was a bug in the underlying HDF5 library which was re\-writing some metadata in the file, so it wasn&#39;t HDFView&#39;s fault\.\.\.
George N. White III wrote:

HDF-based workflows are not robust. It helps to have current versions and
support contracts so you can share the pain with your vendor, and most
non-trivial applications seem to end up using a private hdf library because
you can't rely on anyone else's compile to have the right combination of
options, 3rd party libs, etc.

Yikes. I thought having HDFView alter files was disturbing, but
hearing that "workflows are not robust" is not exactly encouraging....
I read the "Why use HDF?" statement on the hdfgroup web page:

Many HDF adopters have very large datasets, very fast access
requirements, or
very complex datasets. Others turn to HDF because it allows them
to easily share
data across a wide variety of computational platforms using
applications written in
different programming languages. Some use HDF to take advantage of the many
open-source and commercial tools that understand HDF.

To mean that HDF5 claims to be designed to be able to act as a base
file format for exchanging complex datasets between different computer
platforms and applications. So that, for example, one should be able
to write an HDF5 file on linux with language X and read that file on
Windows with applications Y and Z. This seems to work, at least
mostly. Perhaps I have optimistically interpreted the
advertisement??
   Hmm, I&#39;m curious why George would say that &quot;HDF\-based workflows are not robust&quot; also\.\.\.  We \_do\_ have a strong commitment to making them robust and have heard generally good reports on that front\.  In the cases where people have had concerns or issues, we have worked very hard to address them\.

--------------------------------
I think this is because both hdf5-1.6 and hdf5-18 provide binaries
with the same
names. I get around it, e.g., for h4toh5, using a local Portfile that adds
-DH5_USE_16_API, but the resulting program fails on some of the hdf files
we use, although the linux binaries from hdfgroup work.

BEAM (ESA remote sensing) requires Java 1.6, which is only 64-bit on Mac OS X,
The developers have not been able to come up with working hdf.
Note that hdfview
on Mac OS X uses java 1.5 (which is 32-bit).

In summary, the reason hdf workflows are not robust is mainly an issue of
system management and funding (to get support contracts so you can ask
suppliers to fix bugs in commercial tools). Where things fall apart is that
many users don't have neither time nor expertise to deal with the problems.
I often end up looking for a workaround, but I know that a whack of work is
needed to understand a glitch if I want others to be able to follow my recipe.

--
George N. White III <aa056@chebucto.ns.ca>
Head of St. Margarets Bay, Nova Scotia

epourmal · December 1, 2009, 8:39pm

I am sorry. I should post information we got from the IDL people about two weeks ago a little bit earlier
Unfortunately the patch ITTVis released still doesn't address the issue since the patch is based on 1.8.3 and bug was fixed in 1.8.4.

I would suggest to contact IDL people and ask them for another patch. They should be able to address the issue pretty quickly.

Elena

···

========================
Subject: Patch for IDL 7.1 HDF5, netCDF4, CDF is now available

Dear IDL user,

In the past you have expressed interest in the IDL scientific data format support. A patch update to IDL 7.1/7.1.x is now available on the download page of our website at http://www.ittvis.com/Downloads/ProductDownloads.aspx\. The patch updates support for HDF5 1.8.3, netCDF4 and CDF 3.3. More details about this patch can be found in the release notes that are available with the download files.

We appreciate your continued use of IDL. Please let us know if you have any comments about this patch release. Thank you.

============================
On Dec 1, 2009, at 1:59 PM, Matt Newville wrote:

Hi Werner,

On Tue, Dec 1, 2009 at 1:38 PM, Werner Benger <werner@cct.lsu.edu> wrote:

I just can throw in in defense of HDF5 that I can still happily read
and deal with large HDF5 files that I've created 10+ years ago
(>tens of GB, which was immense at this time).

Yes, I was surprised to see the "non-robust workflow" comment. I'm not
convinced by it.

A problem might be though with commercial applications that ship
ancient HDF5 versions with them and have a slow update release
cycles or let their customers pay for such updates...

I would agree with that, but unfortunately this is not a case of
dealing with ancient HDF5 versions. The current version of IDL (7.1)
was released in May 2009. It ships with HDF5 1.6.7, which was
released in Jan 2008. HDF5 1.6.8 was released in Nov 2008, so that
IDL is between six and twelve months behind, although it did not jump
to the HDF5 1.8 series. For a vendor, I would hardly call that "way
behind" or ancient.

In any event, as we have seen, the cause of the problem in question
was actually the HDF5 1.8.* library, not IDL or the HDF5 1.6 library.
Perhaps in this case the vendor's conservative position was justified.

--Matt Newville <newville at cars.uchicago.edu>

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

dsdale24 · December 1, 2009, 8:10pm

I think of a basic hdf5 workflow as loading data from an hdf5 file,
manipulating or processing it, and perhaps saving data back to an hdf5
file. I interpret "robust" to mean "can be relied upon".

···

On Tue, Dec 1, 2009 at 3:03 PM, Daniel Kahn <daniel_kahn@ssaihq.com> wrote:

Yikes. I thought having HDFView alter files was disturbing, but
hearing that "workflows are not robust" is not exactly encouraging....

I'd be interested to know what a "workflow" is and what it means to be
"robust". I'm not trained in software engineering. Is it a technical term?
Anyone?

werner · December 2, 2009, 2:52pm

This is true, in the same sense as .avi movies or .tiff files don't
provide a robust workflow. Not every file in these formats can be
read by any application supporting these formats, as they are rather
containers, not full specifications of the contents. It is indeed
often misleading that many people think that by specifying data in
HDF5 format, all problems are solved. Rather, it does also depend in
which layout they are stored in HDF5, and if this specific layout
is supported by the concrete applications. In this sense, a workflow
can only be robust among "compatible" applications, which includes
agreement on a common data layout scheme among them. For which HDF5
is a great essential pre-requisite, but not yet the final solution.

Werner

···

On Wed, 02 Dec 2009 07:25:16 -0600, George N. White III <gnwiii@gmail.com> wrote:

On Tue, Dec 1, 2009 at 3:25 PM, Quincey Koziol <koziol@hdfgroup.org> wrote:

Hi Matt,

On Dec 1, 2009, at 1:16 PM, Matt Newville wrote:

George N. White III wrote:

HDF-based workflows are not robust. It helps to have current versions and
support contracts so you can share the pain with your vendor, and most
non-trivial applications seem to end up using a private hdf library because
you can't rely on anyone else's compile to have the right combination of
options, 3rd party libs, etc.

Yikes. I thought having HDFView alter files was disturbing, but
hearing that "workflows are not robust" is not exactly encouraging....
  I read the "Why use HDF?" statement on the hdfgroup web page:

   Many HDF adopters have very large datasets, very fast access
requirements, or
   very complex datasets. Others turn to HDF because it allows them
to easily share
   data across a wide variety of computational platforms using
applications written in
   different programming languages. Some use HDF to take advantage of the many
   open-source and commercial tools that understand HDF.

To mean that HDF5 claims to be designed to be able to act as a base
file format for exchanging complex datasets between different computer
platforms and applications. So that, for example, one should be able
to write an HDF5 file on linux with language X and read that file on
Windows with applications Y and Z. This seems to work, at least
mostly. Perhaps I have optimistically interpreted the
advertisement??

       Hmm, I'm curious why George would say that "HDF-based workflows are not robust" also... We _do_ have a strong commitment to making them robust and have heard generally good reports on that front. In the cases where people have had concerns or issues, we have worked very hard to address them.

I think thg does excellent work, and the lack of robustness is a
reflection of the
difficulty of the goals in "Why use HDF?" in today's computing
environments. HDF is
being used, so my answer to "Why use HDF?" is that a) many data
providers supply that
format and b) using hdf makes software easier to write/maintain than
ad-hoc or application-specific formats.

In practice applications and OS distros are often very slow to update,
so unless you build the applications yourself, you get older
libraries. My work often involves collaboration with
people scattered around the globe at different organizations with
different IT standards
and update schedules.

To me, a robust workflow means: I can send a recipe to a colleague
and expect it to work the same way it did on my system. Many of the
other users have access to only one
platform (Mac OS X or linux). In practice, it is difficult to run
the workflow entirely on one platform. ATM I have some hdf4 files
that I can't open using hdfview on Ubuntu Linux (amd64, reported here
previously) and some that fail using some (widely available) versions
of h4toh5 on Mac OS X. Many open source tools implement only the
parts of the hdf formats that the
author needed, so not all hdf5 files can be read using, e.g., octave
or R. This means a workflow that works for one dataset may fail for
another apparently similar dataset.

--
___________________________________________________________________________
Dr. Werner Benger Visualization Research
Laboratory for Creative Arts and Technology (LCAT)
Center for Computation & Technology at Louisiana State University (CCT/LSU)
211 Johnston Hall, Baton Rouge, Louisiana 70803
Tel.: +1 225 578 4809 Fax.: +1 225 578-5362

gnwiii · December 2, 2009, 1:44pm

Hi Werner,

I just can throw in in defense of HDF5 that I can still happily read
and deal with large HDF5 files that I've created 10+ years ago
(>tens of GB, which was immense at this time).

Yes, I was surprised to see the "non-robust workflow" comment. I'm not
convinced by it.

A problem might be though with commercial applications that ship
ancient HDF5 versions with them and have a slow update release
cycles or let their customers pay for such updates...

I would agree with that, but unfortunately this is not a case of
dealing with ancient HDF5 versions. The current version of IDL (7.1)
was released in May 2009. It ships with HDF5 1.6.7, which was
released in Jan 2008. HDF5 1.6.8 was released in Nov 2008, so that
IDL is between six and twelve months behind, although it did not jump
to the HDF5 1.8 series. For a vendor, I would hardly call that "way
behind" or ancient.

There are also lags at any sites getting/installing updates, which may
be tied to fiscal periods, etc. In my field (ocean remote sensing) NASA
SeaDAS and ESA BEAM are two key applications. End users are tied
to the release schedules of those development groups. An update to
SeaDAS has been expected "real soon now" for months, but the current
version is using IDL 7.0, although many sites with IDL licenses have updated
to 7.06, considered a "safe" upgrade for an app written for 7.0.

In any event, as we have seen, the cause of the problem in question
was actually the HDF5 1.8.* library, not IDL or the HDF5 1.6 library.
Perhaps in this case the vendor's conservative position was justified.

If everyone waits for someone else to try the new version and find the
bugs progress is slowed. Systems like macports, openpkg, etc should,
in principle, make it simple to apply updates to open souce tools, but in
practice are getting stuck with some apps depending on hdf5-1.6 and
others on hdf5-1.8.

There are times when it would be nice to be able to switch between
static and dynamic linking, e.g., so you can have one tool that
needs a newer library version without waiting for the linux distro to
catch up.

···

On Tue, Dec 1, 2009 at 3:59 PM, Matt Newville <newville@cars.uchicago.edu> wrote:

On Tue, Dec 1, 2009 at 1:38 PM, Werner Benger <werner@cct.lsu.edu> wrote:

--
George N. White III <aa056@chebucto.ns.ca>
Head of St. Margarets Bay, Nova Scotia

newville · December 2, 2009, 5:42pm

I find Werner's and George's remarks disappointing. As Werner says,
HDF5 cannot be sufficient and applications must agree on what to do
with the contents of a datafile: the HDF5 format supplies syntax for
data containers, but the applications need to agree on the semantics.
That is not my concern.

If I read George's comments (and the comments of others in this
thread) correctly, the reason for "non-robust workflows" (in the sense
that an HDF5 file created somewhere may not be usable everywhere) is
taken to be that different applications use different versions of the
library, which have different APIs. But the HDF Group releases
different versions at a rate faster than 1 per year, and claim to
support a 1.6* series and a 1.8.* series. It would seem to be that
differences in library versions would be expected. Indeed, when I
read the Compatibility consideration pages, I see that forward and
backward compatibility are (as far as possible) design goals, though
the tables shown with "may create forward-compatibility conflicts"
suggests that nearly everything may create a conflict.

The suggested remedies for problems from different versions are a)
contact the vendor to support a specific version of the API, and b)
choose carefully which versions of the API (and applications) to use.
These suggestion do, then, acknowledge that an HDF5 file created
somewhere may not be usable everywhere. The suggestions are not very
practical solutions for files intended to exchange and store data such
that they can be shared widely across platforms, applications, and
time. Neither are "support contracts": If I create a data file and
two years later colleague A on continent B cannot read that file into
analysis application C written with Programming language D by person
E, which person needs the support contract, and from whom?

No one has explained a simple method to repair a file that has been
corrupted by HDFView2.5 with HDF5 1.8.2 or to even detect the
corruption: None of the h5 utilities detected it. It is pretty clear
to me that there is not much testing that happens for file
compatibilities across the **supported** API versions. I apologize
for sounding unappreciative fo the work done, and feel there is no
better alternative to HDF5, but is this really considered excellent?

I believe my other question ("Can one detect which version of the API
was used to write a file") and its response ("No") actually has some
importance here. There are non-trivial differences between the 1.6
and 1.8 formats and APIs and, apparently, no practical way to test
for these changes or whether an object in a file can be read by a
specific library. This is described as a feature, not a bug, which I
find astonishing. If version information existed for files or
objects, many of the problems of version mismatch would at least be
detectable, so that an application could at least say "sorry I can't
read a link object". That there were format differences between 1.6
and 1.8 that may create conflicts suggests that this will happen
again. That a version system does not not exist, suggests that there
were no lessons learned from the introduction of non-detectable
conflicts.

Of course, version information would not solve the corruption of
data files, but any significant testing would have detected that.

This is all somewhat disappointing. I believe that using HDF5 is
probably the only viable format for exchanging large scientific data
sets, but I am not excited about the idea of having to continually
fight version problems due to poor design and sloppy practices.

--Matt Newville <newville at cars.uchicago.edu>

werner · December 2, 2009, 6:03pm

I did not mean my comment to be disappointing, rather stating out the
fact that HDF5 is as general as a file system (actually, even more than
just that), so you can't expect more from it than what a file system
provides: a platform that provides a syntax to describe data. And HDF5
does that in the best known way (in my view). It however does not solve
the semantics problem across applications that have different expectations
on the properties of the data which they want to deal with, in the same
manner as an application that can read a specific file, can't necessarily
read any kind of file generically. This I would rather consider to be an
issue of awareness, not of design problems. Maybe it would help to not
call all HDF5 files with ".h5" extension to be more clear on differences.

Version issues, this could possibly be improved, by adding some versioning
information to each object, or group. I'm doing such in my layer on top
of HDF5, such that some groups can be in an "old" data layout, but data
in a new layout can be added, and it's identifiable which parts have been
written by which version of the library. I like this approach as well rather
than a "global" version number, but it could well be mixed, but computing
a "global version range" based on such "local version numbers".

Software bugs, can't comment on that. It just happens...

Werner

···

On Wed, 02 Dec 2009 11:42:24 -0600, Matt Newville <newville@cars.uchicago.edu> wrote:

I find Werner's and George's remarks disappointing. As Werner says,
HDF5 cannot be sufficient and applications must agree on what to do
with the contents of a datafile: the HDF5 format supplies syntax for
data containers, but the applications need to agree on the semantics.
That is not my concern.

If I read George's comments (and the comments of others in this
thread) correctly, the reason for "non-robust workflows" (in the sense
that an HDF5 file created somewhere may not be usable everywhere) is
taken to be that different applications use different versions of the
library, which have different APIs. But the HDF Group releases
different versions at a rate faster than 1 per year, and claim to
support a 1.6* series and a 1.8.* series. It would seem to be that
differences in library versions would be expected. Indeed, when I
read the Compatibility consideration pages, I see that forward and
backward compatibility are (as far as possible) design goals, though
the tables shown with "may create forward-compatibility conflicts"
suggests that nearly everything may create a conflict.

The suggested remedies for problems from different versions are a)
contact the vendor to support a specific version of the API, and b)
choose carefully which versions of the API (and applications) to use.
These suggestion do, then, acknowledge that an HDF5 file created
somewhere may not be usable everywhere. The suggestions are not very
practical solutions for files intended to exchange and store data such
that they can be shared widely across platforms, applications, and
time. Neither are "support contracts": If I create a data file and
two years later colleague A on continent B cannot read that file into
analysis application C written with Programming language D by person
E, which person needs the support contract, and from whom?

No one has explained a simple method to repair a file that has been
corrupted by HDFView2.5 with HDF5 1.8.2 or to even detect the
corruption: None of the h5 utilities detected it. It is pretty clear
to me that there is not much testing that happens for file
compatibilities across the **supported** API versions. I apologize
for sounding unappreciative fo the work done, and feel there is no
better alternative to HDF5, but is this really considered excellent?

I believe my other question ("Can one detect which version of the API
was used to write a file") and its response ("No") actually has some
importance here. There are non-trivial differences between the 1.6
and 1.8 formats and APIs and, apparently, no practical way to test
for these changes or whether an object in a file can be read by a
specific library. This is described as a feature, not a bug, which I
find astonishing. If version information existed for files or
objects, many of the problems of version mismatch would at least be
detectable, so that an application could at least say "sorry I can't
read a link object". That there were format differences between 1.6
and 1.8 that may create conflicts suggests that this will happen
again. That a version system does not not exist, suggests that there
were no lessons learned from the introduction of non-detectable
conflicts.

Of course, version information would not solve the corruption of
data files, but any significant testing would have detected that.

This is all somewhat disappointing. I believe that using HDF5 is
probably the only viable format for exchanging large scientific data
sets, but I am not excited about the idea of having to continually
fight version problems due to poor design and sloppy practices.

--Matt Newville <newville at cars.uchicago.edu>

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

--
___________________________________________________________________________
Dr. Werner Benger Visualization Research
Laboratory for Creative Arts and Technology (LCAT)
Center for Computation & Technology at Louisiana State University (CCT/LSU)
211 Johnston Hall, Baton Rouge, Louisiana 70803
Tel.: +1 225 578 4809 Fax.: +1 225 578-5362

Quincey_Koziol · December 2, 2009, 6:12pm

Hi Matt,

I find Werner's and George's remarks disappointing. As Werner says,
HDF5 cannot be sufficient and applications must agree on what to do
with the contents of a datafile: the HDF5 format supplies syntax for
data containers, but the applications need to agree on the semantics.
That is not my concern.

OK, good, sometimes people stumble at this point and expect "magic" to happen at the "semantic" level just because they are using HDF5 at the "syntax" level.

If I read George's comments (and the comments of others in this
thread) correctly, the reason for "non-robust workflows" (in the sense
that an HDF5 file created somewhere may not be usable everywhere) is
taken to be that different applications use different versions of the
library, which have different APIs. But the HDF Group releases
different versions at a rate faster than 1 per year, and claim to
support a 1.6* series and a 1.8.* series. It would seem to be that
differences in library versions would be expected. Indeed, when I
read the Compatibility consideration pages, I see that forward and
backward compatibility are (as far as possible) design goals, though
the tables shown with "may create forward-compatibility conflicts"
suggests that nearly everything may create a conflict.

I completely agree with you, it's unfortunate that introducing new features that require file format changes generates this issue. We have worked hard to minimize the effects by always defaulting to the most backward compatible version of the file format possible to encode the feature that an application uses. So, an application that used a particular set of API routines and gets migrated to a new version of the HDF5 library will continue to create files that use the same file format version, as long as no new API routines are used. (modulo bugs, of course)

Also, we have spaced out the major releases (1.6.0, 1.8.0, the upcoming 1.10.0, etc.) by several years, with the interim releases only adding supporting/minor API routines and providing bug fixes. Hopefully, that's not a concern that you'll need to worry about.

The suggested remedies for problems from different versions are a)
contact the vendor to support a specific version of the API, and b)
choose carefully which versions of the API (and applications) to use.
These suggestion do, then, acknowledge that an HDF5 file created
somewhere may not be usable everywhere. The suggestions are not very
practical solutions for files intended to exchange and store data such
that they can be shared widely across platforms, applications, and
time. Neither are "support contracts": If I create a data file and
two years later colleague A on continent B cannot read that file into
analysis application C written with Programming language D by person
E, which person needs the support contract, and from whom?

You have outlined one of the difficulties, yes. However, we have designed the file format and library to always support reading earlier versions of objects in the file format and therefore all the files that have been produced with earlier versions of the library will always be able to be read by later versions of the library.

So, the main problem for application developers is when they are sending a file produced with new features on a new library to someone else who is using an older version of the library. Again, there's not much we can do in this case - one user wanted and used a feature that isn't available in the earlier version of the library that the second user has.

No one has explained a simple method to repair a file that has been
corrupted by HDFView2.5 with HDF5 1.8.2 or to even detect the
corruption: None of the h5 utilities detected it. It is pretty clear
to me that there is not much testing that happens for file
compatibilities across the **supported** API versions. I apologize
for sounding unappreciative fo the work done, and feel there is no
better alternative to HDF5, but is this really considered excellent?

Maybe I missed this in your earlier messages, but was the file actually corrupted by HDFView? Is it no longer readable? If so, could you provide access to a copy of it, so we can work with you to address the issue.

You are correct about testing between supported versions though, I think that's an area we could stand to improve in.

I believe my other question ("Can one detect which version of the API
was used to write a file") and its response ("No") actually has some
importance here. There are non-trivial differences between the 1.6
and 1.8 formats and APIs and, apparently, no practical way to test
for these changes or whether an object in a file can be read by a
specific library. This is described as a feature, not a bug, which I
find astonishing. If version information existed for files or
objects, many of the problems of version mismatch would at least be
detectable, so that an application could at least say "sorry I can't
read a link object". That there were format differences between 1.6
and 1.8 that may create conflicts suggests that this will happen
again. That a version system does not not exist, suggests that there
were no lessons learned from the introduction of non-detectable
conflicts.

The version information is contained in the files, that's not a problem. What I think you want is a utility that will check the format of each file to verify the version of the objects within it, which I mentioned is on our development path. Is there something else you'd like to see?

Of course, version information would not solve the corruption of
data files, but any significant testing would have detected that.

This is all somewhat disappointing. I believe that using HDF5 is
probably the only viable format for exchanging large scientific data
sets, but I am not excited about the idea of having to continually
fight version problems due to poor design and sloppy practices.

I appreciate your feedback here and hope you stay active on the forum and continue to provide it. Here's some of the things I think you are asking for:

- A utility which checks a file to determine if it can be read by a particular version of the HDF5 library.

- Better compatibility testing to read files produced with one version of the library with another version.

Are there other suggestions you would add?

Quincey

···

On Dec 2, 2009, at 11:42 AM, Matt Newville wrote:

gnwiii · December 2, 2009, 6:26pm

I find Werner's and George's remarks disappointing. As Werner says,
HDF5 cannot be sufficient and applications must agree on what to do
with the contents of a datafile: the HDF5 format supplies syntax for
data containers, but the applications need to agree on the semantics.
That is not my concern.

If I read George's comments (and the comments of others in this
thread) correctly, the reason for "non-robust workflows" (in the sense
that an HDF5 file created somewhere may not be usable everywhere) is
taken to be that different applications use different versions of the
library, which have different APIs. But the HDF Group releases
different versions at a rate faster than 1 per year, and claim to
support a 1.6* series and a 1.8.* series. It would seem to be that
differences in library versions would be expected. Indeed, when I
read the Compatibility consideration pages, I see that forward and
backward compatibility are (as far as possible) design goals, though
the tables shown with "may create forward-compatibility conflicts"
suggests that nearly everything may create a conflict.

I think these issues are a fact of life with something as complex as hdf.
Some of the problems are bugs, not just in hdf but in compilers and
libraries. Some of the differences are because hdf libraries are
built with different options (e.g., szip, fortran support).

The problem of versions is far from simple (sometimes I think the
md5sum is the only reliable version number)

The version of THG releases is only part of the story, you would need
to capture not just the version of the hdf library, but those of 3rd party
libraries and compilers.

The suggested remedies for problems from different versions are a)
contact the vendor to support a specific version of the API, and b)
choose carefully which versions of the API (and applications) to use.
These suggestion do, then, acknowledge that an HDF5 file created
somewhere may not be usable everywhere. The suggestions are not very
practical solutions for files intended to exchange and store data such
that they can be shared widely across platforms, applications, and
time. Neither are "support contracts": If I create a data file and
two years later colleague A on continent B cannot read that file into
analysis application C written with Programming language D by person
E, which person needs the support contract, and from whom?

No one has explained a simple method to repair a file that has been
corrupted by HDFView2.5 with HDF5 1.8.2 or to even detect the
corruption: None of the h5 utilities detected it. It is pretty clear
to me that there is not much testing that happens for file
compatibilities across the **supported** API versions. I apologize
for sounding unappreciative fo the work done, and feel there is no
better alternative to HDF5, but is this really considered excellent?

THG can only test a small subset of the cases. It takes real applications
and real users to fully explore all the possibilities, but in fact many
developers are very conservative about updating because testing even
for one application is a big job.

I believe my other question ("Can one detect which version of the API
was used to write a file") and its response ("No") actually has some
importance here. There are non-trivial differences between the 1.6
and 1.8 formats and APIs and, apparently, no practical way to test
for these changes or whether an object in a file can be read by a
specific library. This is described as a feature, not a bug, which I
find astonishing. If version information existed for files or
objects, many of the problems of version mismatch would at least be
detectable, so that an application could at least say "sorry I can't
read a link object". That there were format differences between 1.6
and 1.8 that may create conflicts suggests that this will happen
again. That a version system does not not exist, suggests that there
were no lessons learned from the introduction of non-detectable
conflicts.

Of course, version information would not solve the corruption of
data files, but any significant testing would have detected that.

This is all somewhat disappointing. I believe that using HDF5 is
probably the only viable format for exchanging large scientific data
sets, but I am not excited about the idea of having to continually
fight version problems due to poor design and sloppy practices.

I don't think it fair to accuse anyone involved of "poor design and
sloppy practices": pioneering work is crude. I tell people hdf workflows
are fragile because I think it is true and because it alerts them to take
precautions (making important files read-only, using "open with read-access",
checking error returns, etc.). Where "poor design and sloppy practices"
come in is the industry standard practices that result in each vendor
finding it necessary to install "private" versions of widely used libraries.
One Apple system here has hdf5 libraries provided by:

IDL: idl_hdf5.so
Octave: libhdf5.so
macports: libhdf5.so
NASA SeaDAS: libhdf5.a (as well as the IDL version)
Python: libhdf5.a, libhdf5.dylib
THG (hdfview): libjhdf5.jnilib

Anyone who has built the libraries on a range of platforms can tell you
it can be a bit of a struggle (including filing bug reports with your compiler
vendor and then getting the sys. admin to install the fixes when they arrive).

···

On Wed, Dec 2, 2009 at 1:42 PM, Matt Newville <newville@cars.uchicago.edu> wrote:

--
George N. White III <aa056@chebucto.ns.ca>
Head of St. Margarets Bay, Nova Scotia

newville · December 2, 2009, 6:48pm

No one has explained a simple method to repair a file that has been
corrupted by HDFView2.5 with HDF5 1.8.2 or to even detect the
corruption: None of the h5 utilities detected it. It is pretty clear
to me that there is not much testing that happens for file
compatibilities across the **supported** API versions. I apologize
for sounding unappreciative fo the work done, and feel there is no
better alternative to HDF5, but is this really considered excellent?

Maybe I missed this in your earlier messages, but was the file actually corrupted by
HDFView? Is it no longer readable? If so, could you provide access to a copy of it, so we
can work with you to address the issue.

Yes. HDFView2.5 corrupts HDF5 data files (whether created with h5py
and v1.8.* or IDL7.0) in such a way that they can not be read in IDL
at all.
   In IDL7.0, h5f_open() on a file that has been opened by HDFViewer fails.
   In IDL 6.3, h5f_open() on such a file crashes IDL.
   In IDL7.0, "x = h5_browser()" on such a file crashes IDL.

The file is altered in "the header" (sorry, I am new to hdf5 so do not
know the layout of the file), between bytes 160 and 192. More details
and small files prior to and after reading with HDFView2.5 to
demonstrate the problem are at
http://cars9.uchicago.edu/pybeamline/DataFormats/H5UsageNotes/HDF5AndIDL

The version information is contained in the files, that's not a problem.

Where? I thought that yesterday you said that not being able to
detect version numbers of a file was a feature, as multiple
applications may write to a single file. Perhaps I misunderstood you.
I don't see it, and none of the utilities report file versions or
object versions....

What I think you want is a utility that will check the format of each file to verify the version
of the objects within it, which I mentioned is on our development path. Is there something
else you'd like to see?

Rather than a separate utility, I would prefer to see that as each
object is opened for reading (by h5*_open), that a version number or
unique id tagging the version of that object would be read. If the
object ID is not recognized, it would be detectable as a "future
feature" that the library may not be able to use, so that the h5*_open
could fail gracefully. Perhaps this exists already? If different
APIs and formats are expected to be interchangeable, it seems like
you'd need some sort of check like this, no?

--Matt Newville <newville at cars.uchicago.edu> 630-252-0431

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

HDFView altering files such that they cannot be read with IDL