A number of small problems

Roger_Leigh · January 5, 2018, 12:19pm

I've recently been creating large-ish HDF files from CSV data (~45GiB)
with h5py. During the course of this data conversion, I've found a few
odd hdf5 issues I thought I should pass on:

1) "long double" / f16 float types are not viewable with HDFView 3.0.
The data type is shown as "Unknown" with all values displayed as
"*unknown*". It would be nice if it could support viewing of such data.

2) h5dump seems to truncate all floating point types to around that of
"float" / f4 by default. double and long double aka f8 and f16 have the
extra precision truncated. Is there a flag to tell h5dump to retain the
full precision for each size of float? I can see there's a "-m" flag,
but this affects display for all float types irrespective of size. Is
there a way to have it dump the full value of every float by default?

3) Opening a 45GB (uncompressed) dataset with HDFView 3.0 causes the
program to freeze in a busy state ~indefinitely. I'm not sure why it
needs to read in huge quantities of data when it could restrict itself
to what's viewable on the screen and fetch data as you scroll around.
Whatever it's doing right now doesn't seem very scalable.

4) Worse, killing HDFView when hung opening (3) above corrupts the file
making it unreadable(!) even though no changes were made:

% ls -l idr*.h5
-rw-rw-r-- 1 rleigh rleigh 17444765812 Jan 5 09:54 idr016-broken.h5
-rwxrwx--- 1 rleigh rleigh 17444765812 Jan 5 10:55 idr016.h5

% cmp -l idr016.h5 idr016-broken.h5 | gawk '{printf "%08X %02X %02X\n",
$1, strtonum(0$2), strtonum(0$3)}'
0000000C 00 01
0000002D A6 67
0000002E 31 19
0000002F CC 52
00000030 19 92

% h5ls -r idr016.h5
/ Group
/Images Dataset {69120/Inf}
/Objects Dataset {5880380/Inf}

% h5ls -r idr016-broken.h5
idr016-broken.h5: unable to open file

Is this a known issue? This seems quite bad; I wouldn't want to
permanently lose data simply by opening a dataset in the viewer tool.
I'm surprised any changes were made since all I did was to open the
dataset in the viewer. What is the reason for the viewer to be making
changes to the file in this case; I would expect it to be robust in the
face of unexpected termination, power loss, network interruption etc.

I've attached the overall structure of the HDF file in case this is
useful, from "h5dump -BHip". If you want to take a look at the original
and uncorrupted files, I can upload them somewhere.

Thanks,
Roger

structure (67 KB)

···

--
Dr Roger Leigh -- Open Microscopy Environment
Wellcome Trust Centre for Gene Regulation and Expression,
College of Life Sciences, University of Dundee, Dow Street,
Dundee DD1 5EH Scotland UK Tel: (01382) 386364

The University of Dundee is a registered Scottish Charity, No: SC015096

jhenderson · January 5, 2018, 4:39pm

Hi Roger,

I can answer at least to your HDFView questions and will let someone more familiar with h5dump's behavior explain what is going on there.

For your first question, some architectural changes were introduced to HDFView 3 in order to have better support in the future for very complex datatypes, such as compound of compound, array of compound, etc. Due to time constraints, I was only able to get support in for some of the more basic types, but support for more datatypes is planned when more work is able to be done on HDFView.

As to your second question, this is an unfortunate problem that dates back all the way to when HDFView was young. The Java object library interface between the HDF5 C library and HDFView is responsible for setting up the portions of the dataset to read and has always defaulted to reading the entire dataset when no subsetting is done. Since the object library is meant to be entirely abstracted from applications, the solution is to introduce an interface to the object library for understanding how to read only the portions of a dataset that are visible to an application, without specifically tying the code to any one application, such as HDFView in particular. As this would entail a somewhat significant amount of design and development, work on it has not continued forward to this date yet. However, I have to say that I'm surprised you don't encounter an out of memory exception. A bit of work towards strengthening HDFView's ability to catch these large data read cases was done a little while back and should in theory at least stop trying to load the entire dataset and let the user know what happened. I'm assuming that the machine you're working with HDFView on does not have upwards of 45GB of memory, so this is rather interesting to me if you're encountering a case where this check is bypassed.

For your last question, this is likely a side effect of an elusive bug that has also existed in HDFView for a while now. The root cause hasn't yet been discovered but it has been noticed that occasionally simply opening a file and closing it at a later time seems to show that some modification has been made to the HDF file. I imagine that the change it is making to the file is something metadata-related which would explain why killing the process corrupts the file. Could you possibly rerun h5ls on the corrupted file with the additional "--enable-error-stack" command-line parameter so that we can at least capture and keep a record of the specific error that HDF5 is running into when it tries to open the file? This may be very helpful in debugging the root problem.

Unfortunately, all three of these issues are very well known to us, but in the face of lack of funding to work on the issues, they have mostly sat on the back burner until a future point.

Thanks,

Jordan

···

________________________________
From: Hdf-forum <hdf-forum-bounces@lists.hdfgroup.org> on behalf of Roger Leigh <rleigh@dundee.ac.uk>
Sent: Friday, January 5, 2018 6:19:51 AM
To: hdf-forum@lists.hdfgroup.org
Subject: [Hdf-forum] A number of small problems