A number of small problems

I've recently been creating large-ish HDF files from CSV data (~45GiB)
with h5py. During the course of this data conversion, I've found a few
odd hdf5 issues I thought I should pass on:

1) "long double" / f16 float types are not viewable with HDFView 3.0.
The data type is shown as "Unknown" with all values displayed as
"*unknown*". It would be nice if it could support viewing of such data.

2) h5dump seems to truncate all floating point types to around that of
"float" / f4 by default. double and long double aka f8 and f16 have the
extra precision truncated. Is there a flag to tell h5dump to retain the
full precision for each size of float? I can see there's a "-m" flag,
but this affects display for all float types irrespective of size. Is
there a way to have it dump the full value of every float by default?

3) Opening a 45GB (uncompressed) dataset with HDFView 3.0 causes the
program to freeze in a busy state ~indefinitely. I'm not sure why it
needs to read in huge quantities of data when it could restrict itself
to what's viewable on the screen and fetch data as you scroll around.
Whatever it's doing right now doesn't seem very scalable.

4) Worse, killing HDFView when hung opening (3) above corrupts the file
making it unreadable(!) even though no changes were made:

% ls -l idr*.h5
-rw-rw-r-- 1 rleigh rleigh 17444765812 Jan 5 09:54 idr016-broken.h5
-rwxrwx--- 1 rleigh rleigh 17444765812 Jan 5 10:55 idr016.h5

% cmp -l idr016.h5 idr016-broken.h5 | gawk '{printf "%08X %02X %02X\n",
$1, strtonum(0$2), strtonum(0$3)}'
0000000C 00 01
0000002D A6 67
0000002E 31 19
0000002F CC 52
00000030 19 92

% h5ls -r idr016.h5
/ Group
/Images Dataset {69120/Inf}
/Objects Dataset {5880380/Inf}

% h5ls -r idr016-broken.h5
idr016-broken.h5: unable to open file

Is this a known issue? This seems quite bad; I wouldn't want to
permanently lose data simply by opening a dataset in the viewer tool.
I'm surprised any changes were made since all I did was to open the
dataset in the viewer. What is the reason for the viewer to be making
changes to the file in this case; I would expect it to be robust in the
face of unexpected termination, power loss, network interruption etc.

I've attached the overall structure of the HDF file in case this is
useful, from "h5dump -BHip". If you want to take a look at the original
and uncorrupted files, I can upload them somewhere.

Thanks,
Roger

structure (67 KB)

···

--
Dr Roger Leigh -- Open Microscopy Environment
Wellcome Trust Centre for Gene Regulation and Expression,
College of Life Sciences, University of Dundee, Dow Street,
Dundee DD1 5EH Scotland UK Tel: (01382) 386364

The University of Dundee is a registered Scottish Charity, No: SC015096

Hi Roger,

I can answer at least to your HDFView questions and will let someone more familiar with h5dump's behavior explain what is going on there.

For your first question, some architectural changes were introduced to HDFView 3 in order to have better support in the future for very complex datatypes, such as compound of compound, array of compound, etc. Due to time constraints, I was only able to get support in for some of the more basic types, but support for more datatypes is planned when more work is able to be done on HDFView.

As to your second question, this is an unfortunate problem that dates back all the way to when HDFView was young. The Java object library interface between the HDF5 C library and HDFView is responsible for setting up the portions of the dataset to read and has always defaulted to reading the entire dataset when no subsetting is done. Since the object library is meant to be entirely abstracted from applications, the solution is to introduce an interface to the object library for understanding how to read only the portions of a dataset that are visible to an application, without specifically tying the code to any one application, such as HDFView in particular. As this would entail a somewhat significant amount of design and development, work on it has not continued forward to this date yet. However, I have to say that I'm surprised you don't encounter an out of memory exception. A bit of work towards strengthening HDFView's ability to catch these large data read cases was done a little while back and should in theory at least stop trying to load the entire dataset and let the user know what happened. I'm assuming that the machine you're working with HDFView on does not have upwards of 45GB of memory, so this is rather interesting to me if you're encountering a case where this check is bypassed.

For your last question, this is likely a side effect of an elusive bug that has also existed in HDFView for a while now. The root cause hasn't yet been discovered but it has been noticed that occasionally simply opening a file and closing it at a later time seems to show that some modification has been made to the HDF file. I imagine that the change it is making to the file is something metadata-related which would explain why killing the process corrupts the file. Could you possibly rerun h5ls on the corrupted file with the additional "--enable-error-stack" command-line parameter so that we can at least capture and keep a record of the specific error that HDF5 is running into when it tries to open the file? This may be very helpful in debugging the root problem.

Unfortunately, all three of these issues are very well known to us, but in the face of lack of funding to work on the issues, they have mostly sat on the back burner until a future point.

Thanks,

Jordan

···

________________________________
From: Hdf-forum <hdf-forum-bounces@lists.hdfgroup.org> on behalf of Roger Leigh <rleigh@dundee.ac.uk>
Sent: Friday, January 5, 2018 6:19:51 AM
To: hdf-forum@lists.hdfgroup.org
Subject: [Hdf-forum] A number of small problems

I've recently been creating large-ish HDF files from CSV data (~45GiB)
with h5py. During the course of this data conversion, I've found a few
odd hdf5 issues I thought I should pass on:

1) "long double" / f16 float types are not viewable with HDFView 3.0.
The data type is shown as "Unknown" with all values displayed as
"*unknown*". It would be nice if it could support viewing of such data.

2) h5dump seems to truncate all floating point types to around that of
"float" / f4 by default. double and long double aka f8 and f16 have the
extra precision truncated. Is there a flag to tell h5dump to retain the
full precision for each size of float? I can see there's a "-m" flag,
but this affects display for all float types irrespective of size. Is
there a way to have it dump the full value of every float by default?

3) Opening a 45GB (uncompressed) dataset with HDFView 3.0 causes the
program to freeze in a busy state ~indefinitely. I'm not sure why it
needs to read in huge quantities of data when it could restrict itself
to what's viewable on the screen and fetch data as you scroll around.
Whatever it's doing right now doesn't seem very scalable.

4) Worse, killing HDFView when hung opening (3) above corrupts the file
making it unreadable(!) even though no changes were made:

% ls -l idr*.h5
-rw-rw-r-- 1 rleigh rleigh 17444765812 Jan 5 09:54 idr016-broken.h5
-rwxrwx--- 1 rleigh rleigh 17444765812 Jan 5 10:55 idr016.h5

% cmp -l idr016.h5 idr016-broken.h5 | gawk '{printf "%08X %02X %02X\n",
$1, strtonum(0$2), strtonum(0$3)}'
0000000C 00 01
0000002D A6 67
0000002E 31 19
0000002F CC 52
00000030 19 92

% h5ls -r idr016.h5
/ Group
/Images Dataset {69120/Inf}
/Objects Dataset {5880380/Inf}

% h5ls -r idr016-broken.h5
idr016-broken.h5: unable to open file

Is this a known issue? This seems quite bad; I wouldn't want to
permanently lose data simply by opening a dataset in the viewer tool.
I'm surprised any changes were made since all I did was to open the
dataset in the viewer. What is the reason for the viewer to be making
changes to the file in this case; I would expect it to be robust in the
face of unexpected termination, power loss, network interruption etc.

I've attached the overall structure of the HDF file in case this is
useful, from "h5dump -BHip". If you want to take a look at the original
and uncorrupted files, I can upload them somewhere.

Thanks,
Roger

--
Dr Roger Leigh -- Open Microscopy Environment
Wellcome Trust Centre for Gene Regulation and Expression,
College of Life Sciences, University of Dundee, Dow Street,
Dundee DD1 5EH Scotland UK Tel: (01382) 386364

The University of Dundee is a registered Scottish Charity, No: SC015096

Thanks for the detailed explanations, Jordan. All understood and
appreciated.

For your last question, this is likely a side effect of an elusive bug
that has also existed in HDFView for a while now. The root cause hasn't
yet been discovered but it has been noticed that occasionally simply
opening a file and closing it at a later time seems to show that some
modification has been made to the HDF file. I imagine that the change it
is making to the file is something metadata-related which would explain
why killing the process corrupts the file. Could you possibly rerun h5ls
on the corrupted file with the additional "--enable-error-stack"
command-line parameter so that we can at least capture and keep a record
of the specific error that HDF5 is running into when it tries to open
the file? This may be very helpful in debugging the root problem.

This isn't showing any extra detail I'm afraid:

% h5ls -r --enable-error-stack idr016-orig.h5
/ Group
/Images Dataset {69120/Inf}
/Objects Dataset {5880380/Inf}

% h5ls -r --enable-error-stack idr016-broken.h5
idr016-broken.h5: unable to open file

% h5ls --enable-error-stack idr016-broken.h5
idr016-broken.h5: unable to open file

Kind regards,
Roger

The University of Dundee is a registered Scottish Charity, No: SC015096

···

On 05/01/2018 16:39, Jordan Henderson wrote:

Ah yes, after taking a quick glance at the source of h5ls I see why this doesn't help any. H5ls tries to open the file using each of the available file drivers until it is successful or exhausts the list. Since it expects failures to occur when attempting to open the file with an incorrect file driver, it explicitly disables the error stack during file open and this is why you won't see the error stack for the true failure. Not a very elegant solution, but I can see why it was done.

If you are familiar with editing the HDF5 source code and rebuilding, I can show you where the issue is at in the source code for h5ls. Otherwise, I may need to think about the best way to upload a large file somewhere so that we might have a look at the file.

···

________________________________
From: Hdf-forum <hdf-forum-bounces@lists.hdfgroup.org> on behalf of Roger Leigh <rleigh@dundee.ac.uk>
Sent: Friday, January 5, 2018 11:30 AM
To: hdf-forum@lists.hdfgroup.org
Subject: Re: [Hdf-forum] A number of small problems

On 05/01/2018 16:39, Jordan Henderson wrote:

Thanks for the detailed explanations, Jordan. All understood and
appreciated.

For your last question, this is likely a side effect of an elusive bug
that has also existed in HDFView for a while now. The root cause hasn't
yet been discovered but it has been noticed that occasionally simply
opening a file and closing it at a later time seems to show that some
modification has been made to the HDF file. I imagine that the change it
is making to the file is something metadata-related which would explain
why killing the process corrupts the file. Could you possibly rerun h5ls
on the corrupted file with the additional "--enable-error-stack"
command-line parameter so that we can at least capture and keep a record
of the specific error that HDF5 is running into when it tries to open
the file? This may be very helpful in debugging the root problem.

This isn't showing any extra detail I'm afraid:

% h5ls -r --enable-error-stack idr016-orig.h5
/ Group
/Images Dataset {69120/Inf}
/Objects Dataset {5880380/Inf}

% h5ls -r --enable-error-stack idr016-broken.h5
idr016-broken.h5: unable to open file

% h5ls --enable-error-stack idr016-broken.h5
idr016-broken.h5: unable to open file

Kind regards,
Roger

The University of Dundee is a registered Scottish Charity, No: SC015096

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

I just wrote a tiny test problem calling H5Fopen and got this:

% ./read idr016-broken.h5
Open HDF5 file 'idr016-broken.h5' for reading
HDF5-DIAG: Error detected in HDF5 (1.10.0-patch1) thread 140201515243328:
   #000: ../../../src/H5F.c line 579 in H5Fopen(): unable to open file
     major: File accessibilty
     minor: Unable to open file
   #001: ../../../src/H5Fint.c line 1297 in H5F_open(): file is already
open for write (may use <h5clear file> to clear file consistency flags)
     major: File accessibilty
     minor: Unable to open file
Open dataset 'Simple/2DArray'
HDF5-DIAG: Error detected in HDF5 (1.10.0-patch1) thread 140201515243328:
   #000: ../../../src/H5D.c line 286 in H5Dopen2(): not a location
     major: Invalid arguments to routine
     minor: Inappropriate type
   #001: ../../../src/H5Gloc.c line 253 in H5G_loc(): invalid object ID
     major: Invalid arguments to routine
     minor: Bad value
HDF5-DIAG: Error detected in HDF5 (1.10.0-patch1) thread 140201515243328:
   #000: ../../../src/H5D.c line 449 in H5Dget_type(): not a dataset
     major: Invalid arguments to routine
     minor: Inappropriate type
HDF5-DIAG: Error detected in HDF5 (1.10.0-patch1) thread 140201515243328:
   #000: ../../../src/H5T.c line 1723 in H5Tclose(): not a datatype
     major: Invalid arguments to routine
     minor: Inappropriate type
Retrieve type information: Close dataset 'Objects'
HDF5-DIAG: Error detected in HDF5 (1.10.0-patch1) thread 140201515243328:
   #000: ../../../src/H5D.c line 334 in H5Dclose(): not a dataset
     major: Invalid arguments to routine
     minor: Inappropriate type
Close HDF5 file
HDF5-DIAG: Error detected in HDF5 (1.10.0-patch1) thread 140201515243328:
   #000: ../../../src/H5F.c line 749 in H5Fclose(): not a file ID
     major: Invalid arguments to routine
     minor: Inappropriate type

As recommended, calling "h5clear -s" did fix the problem, and then
permitted opening of the file. It would be nice if such a problem was
reported properly by the various tools so that the caller could be
informed of what the problem is. It would also be nice if HDFView could
explicitly open a file read-only, to avoid any changes being made.

By the way, looking at tools/src/misc/CMakeLists.txt, the h5clear binary
isn't installed. But it is installed by Makefile.am. Would it be
possible to correct this so it's installed in all cases?

Thanks,
Roger

···

On 05/01/18 17:55, Jordan Henderson wrote:

Ah yes, after taking a quick glance at the source of h5ls I see why this
doesn't help any. H5ls tries to open the file using each of the
available file drivers until it is successful or exhausts the list.
Since it expects failures to occur when attempting to open the file with
an incorrect file driver, it explicitly disables the error stack during
file open and this is why you won't see the error stack for the true
failure. Not a very elegant solution, but I can see why it was done.

If you are familiar with editing the HDF5 source code and rebuilding, I
can show you where the issue is at in the source code for h5ls.

--
Dr Roger Leigh -- Open Microscopy Environment
Wellcome Trust Centre for Gene Regulation and Expression,
College of Life Sciences, University of Dundee, Dow Street,
Dundee DD1 5EH Scotland UK Tel: (01382) 386364

The University of Dundee is a registered Scottish Charity, No: SC015096

Hi Roger,

Does using HDFView's File -> Open Read-Only menu item help with the issue of changes being made to the file? This should cause HDFView to open the file with H5Fopen's H5F_ACC_RDONLY flag and should at least help with this particular issue.

I've entered a bug report against the tools so that we can keep track of the issue with them not reporting file open issues and hopefully we will be able to address this sooner rather than later. I'll also go ahead and mention the issue with the disparity between h5clear in CMakeLists.txt vs Makefile.am; that should be a simple fix in this case.

Thanks,

Jordan

···

________________________________
From: Hdf-forum <hdf-forum-bounces@lists.hdfgroup.org> on behalf of Roger Leigh <rleigh@dundee.ac.uk>
Sent: Monday, January 8, 2018 6:19:38 AM
To: hdf-forum@lists.hdfgroup.org
Subject: Re: [Hdf-forum] A number of small problems

On 05/01/18 17:55, Jordan Henderson wrote:

Ah yes, after taking a quick glance at the source of h5ls I see why this
doesn't help any. H5ls tries to open the file using each of the
available file drivers until it is successful or exhausts the list.
Since it expects failures to occur when attempting to open the file with
an incorrect file driver, it explicitly disables the error stack during
file open and this is why you won't see the error stack for the true
failure. Not a very elegant solution, but I can see why it was done.

If you are familiar with editing the HDF5 source code and rebuilding, I
can show you where the issue is at in the source code for h5ls.

I just wrote a tiny test problem calling H5Fopen and got this:

% ./read idr016-broken.h5
Open HDF5 file 'idr016-broken.h5' for reading
HDF5-DIAG: Error detected in HDF5 (1.10.0-patch1) thread 140201515243328:
   #000: ../../../src/H5F.c line 579 in H5Fopen(): unable to open file
     major: File accessibilty
     minor: Unable to open file
   #001: ../../../src/H5Fint.c line 1297 in H5F_open(): file is already
open for write (may use <h5clear file> to clear file consistency flags)
     major: File accessibilty
     minor: Unable to open file
Open dataset 'Simple/2DArray'
HDF5-DIAG: Error detected in HDF5 (1.10.0-patch1) thread 140201515243328:
   #000: ../../../src/H5D.c line 286 in H5Dopen2(): not a location
     major: Invalid arguments to routine
     minor: Inappropriate type
   #001: ../../../src/H5Gloc.c line 253 in H5G_loc(): invalid object ID
     major: Invalid arguments to routine
     minor: Bad value
HDF5-DIAG: Error detected in HDF5 (1.10.0-patch1) thread 140201515243328:
   #000: ../../../src/H5D.c line 449 in H5Dget_type(): not a dataset
     major: Invalid arguments to routine
     minor: Inappropriate type
HDF5-DIAG: Error detected in HDF5 (1.10.0-patch1) thread 140201515243328:
   #000: ../../../src/H5T.c line 1723 in H5Tclose(): not a datatype
     major: Invalid arguments to routine
     minor: Inappropriate type
Retrieve type information: Close dataset 'Objects'
HDF5-DIAG: Error detected in HDF5 (1.10.0-patch1) thread 140201515243328:
   #000: ../../../src/H5D.c line 334 in H5Dclose(): not a dataset
     major: Invalid arguments to routine
     minor: Inappropriate type
Close HDF5 file
HDF5-DIAG: Error detected in HDF5 (1.10.0-patch1) thread 140201515243328:
   #000: ../../../src/H5F.c line 749 in H5Fclose(): not a file ID
     major: Invalid arguments to routine
     minor: Inappropriate type

As recommended, calling "h5clear -s" did fix the problem, and then
permitted opening of the file. It would be nice if such a problem was
reported properly by the various tools so that the caller could be
informed of what the problem is. It would also be nice if HDFView could
explicitly open a file read-only, to avoid any changes being made.

By the way, looking at tools/src/misc/CMakeLists.txt, the h5clear binary
isn't installed. But it is installed by Makefile.am. Would it be
possible to correct this so it's installed in all cases?

Thanks,
Roger

--
Dr Roger Leigh -- Open Microscopy Environment
Wellcome Trust Centre for Gene Regulation and Expression,
College of Life Sciences, University of Dundee, Dow Street,
Dundee DD1 5EH Scotland UK Tel: (01382) 386364

The University of Dundee is a registered Scottish Charity, No: SC015096

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Yes, that should definitely help. Might be worth adding to the toolbar
to raise its prominence, since open read-write is the only open action
there.

Thanks,
Roger

···

On 08/01/18 14:20, Jordan Henderson wrote:

Does using HDFView's File -> Open Read-Only menu item help with the
issue of changes being made to the file? This should cause HDFView to
open the file with H5Fopen's H5F_ACC_RDONLY flag and should at least
help with this particular issue.

--
Dr Roger Leigh -- Open Microscopy Environment
Wellcome Trust Centre for Gene Regulation and Expression,
College of Life Sciences, University of Dundee, Dow Street,
Dundee DD1 5EH Scotland UK Tel: (01382) 386364

The University of Dundee is a registered Scottish Charity, No: SC015096