···
On Tue, Jun 9, 2015 at 1:24 PM, Werner Benger <werner@cct.lsu.edu> wrote:
Jason,
the reason would be to keep the complexity of HDF5 as small as
possible. Introducing indexing-reordering into HDF5 increases complexity
and introduces possible sources of errors, especially as there is no
need for HDF5 to do it. HDF5 can just concentrate on storing all
datasets in C order and handling of fortran indexing to be separated out
in an add-on library similar to h5lite library that is shipped with
HDF5.
Both the HDF5 tools such as hdfview, h5ls and the HDF5 fortran api of
course would have to make use of that addon-library to set and interpret
such an "fortran-order" flag attribute. Using the "bare-bone" HDF5 would
be limited to mere C-order I/O .
Actually I had pretty much the same discussion ten years ago with other
users of HDF5 as well. It was the same arguments, the desire to change
HDF5 to support different index schemes, versus considering HDF5 as
C-only and doing anything else on top of it. Ultimately it's the
decision of the HDF team whether HDF5 should support different indexing
schemes in its core API. But the fact that it has never been done
demonstrates that it's unlikely to happen, and since it can be done via
an add-on library (which needs to be used by both the HDF5 tools and the
HDF5 fortran api, but it would not affect the HDF5 core), this seems to
be the easier and thus more realistic solution.
Werner
On 09.06.2015 19:30, Jason Newton wrote:
Werner,
What is the argument for leaving this to yet another add-on library on
top of HDF5? This strategy would still require the user checks after
reading for instance and calls another api. I believe this is going to
make it a less than first-class citizen/feature at the least. Ideally we
want most users reading to not even know this is happening, like when
content is chunked or compressed, although the metadata should be there
so the user can infer it will happen in their program..
Also, we want tools like hdfview, h5dump/h5ls to output the content
correctly too.
-Jason
On Tue, Jun 9, 2015 at 3:58 AM, Werner Benger <werner@cct.lsu.edu> wrote:
Basically what it needs is a convention such as an attribute to allow
identifying in which permutation order a dataset is stored...
As they say in
https://www.hdfgroup.org/HDF5/doc/fortran/index.html
"When a C application reads data stored from a Fortran program, the
data will appear to be transposed due to the difference in the C and
Fortran storage orders. For example, if Fortran writes a 4x6
two-dimensional dataset to the file, a C program will read it as a 6x4
two-dimensional dataset into memory. The HDF5 C utilities h5dump and
h5ls will also display transposed data, if data is written from a
Fortran program. "
But there is no way to find out whether data had been stored by a C or
Fortran program. A simple agreement on an attribute would do, even
better shared dataspaces that can hold such an attribute.
All the index-permutation or data transposing (if really required) can
be in some add-on library on top of HDF5 (similar to what F5 does,
though F5 does more than just that).
Werner
On 09.06.2015 11:00, Jason Newton wrote:
Was hoping more commentary would have happened but I also had some
timing issues getting back to this, my apologies.
Werner, thank you for you reply but your case is exactly the proof of
this as an issue that should be dealt with at the specification &
library level that I am talking about. Permuting indices whenever
accessing data is a large burden to put on user code, especially
considering how many different bindings one might use to access the
data. It leads to repeating and intrusive handling which is not what the
user should be dealing with. It's tricky, automatable, isolatable (to
the library), difficult out of C (at least in python), and not what the
tasks they should be spending time on using an advanced software like
HDF5.
If we look at the example of Eigen and Numpy we can see they have flags
set for dealing with column/row [
Eigen: Storage orders ]
and c/fortran [ see order argument:
numpy.array — NumPy v1.24 Manual &
http://docs.scipy.org/doc/numpy/reference/c-api.array.html ]. This shows
at least some numerical processing code deemed it important enough to
not only deal with the issue, but usually provide seamless usage or
conversion to the user's desired type.
I think defaults can be set to not change current behaviour but that
datasets & arrays could now be marked with a flag such as python's. When
reading/writing, an optional flag is provided for the memory space's
requested interpretation (default to C or Fortran by language context).
We could potentially put this in the dataset properties and type
properties so we wouldn't have to change API. And ideally, hopefully
performance being pretty great and handled in C, the library permutes
the storage for you as it's IOing it in for hopefully negligible
performance bump since IO is likely the limiting factor.
I brought this up because I'm writing a generalized HDF C++ library and
when trying to support something like Eigen (and more!), which allows
both C and F orders in the same runtime, it gets confusing on how to IO
to/from HDF files as the current approach relies on language level
wrappers to decide what the right thing to do is, and weakly at that.
But the user may genuinely want to IO in/out a fortran or C ordered
dataset/array to/from a C/fortran dataset/array in any combination for
what makes sense to them and this doesn't really work. I can be left
with baffling scenarios like this failing unless all data written to HDF
files is in C order.:
Eigen::Matrix<double, 4, 5, RowMajor> A_c; A_c.setZero(); A_c.row(i) =
5;
Eigen::Matrix<double, 4, 5, ColMajor> A_f;
hdf.write("A", A_c);
hdf.read("A", A_f);
assert(A_c == A_f);
If in this scenario A was already written by a Fortran program, then
code making the above test case work would apply a conversion where none
is needed for a read like this, making this test cases' assertion fail:
Eigen::Matrix<double, 4, 5, RowMajor> A_c; A_c.setZero(); A_c.row(i) =
5;
Eigen::Matrix<double, 4, 5, ColMajor> A_f;
hdf.read("A", A_f);
assert(A_c == A_f);
And that's why flags need to be saved in the document... the content
needs to specify it's storage layout - guessing based on language cannot
cover all cases and user made attributes are not the way because that
would a be a standard nobody knows about or will use.
-Jason
On Tue, May 12, 2015 at 12:16 AM, Werner Benger <werner@cct.lsu.edu> wrote:
Hi Jason,
I was facing the same issues as pretty much all use case I know and
have in my visualization software and context use and require "fortran"
order of indexing, including OpenGL graphics. It's not really an issue
with HDF5 as the only thing required is to permute the indices when
accessing the HDF5 API. And the HDF5 tools of course will display data
transposed then. This index permutation is supported in the F5 library
via a generic permutation vector that is stored with a group of dataset
sharing the same properties (the F5 library is a C library on top of
HDF5 guiding towards a specific data model for various classes of data
types occurring particularly in scientific visualization):
http://www.fiberbundle.net/doc/structChartDomain__IDs.html
So via the F5 API one would see the fortran-like indexing convention,
whereas whenever accessing data with the lower-level HDF5 API, it's
C-like convention (whereby the permutation vector gives the option of
arbitrary permutations).
I remember there had been plans by the HDF5 group to introduce "named
dataspaces", similarly to "named datatypes", that could then be stored
in the file as its own entity. Such would be a good place to store
properties of a dataspace as attributes on a dataspace, and to have such
shared among datasets. It would be a natural place to store a
permutation vector, which could be reduced to a simple flag as well to
just distinguish between C and fortran indexing conventions. Of course,
all the related tools would also need to honor such an attribute then.
Until then, one could use an attribute on each dataset and implement
index permutation similar to how the F5 library does it. It may be safer
to use new API functions anyway to not break old code that always
expects C order indexing.
Werner
On 12.05.2015 06:48, Jason Newton wrote:
Hi -
I've been a evangelist for HDF5 for a few of years now, it is a noble
and amazing library that solves data storage issues occurring with
scientific and beyond applications - e.g. it can save many developers
from wasting time and money so they can spend that on solving more
original problems. But you guys knew that already. I think there's been
a mistake though - that is the lack of first class column-vs-row major
storage. In a world where we are split down the middle on what format we
used based on what application, library and language we use we work in
one or the other it is an ongoing reality that there will never be one
true standard to follow. But HDF5 sought to only support row-major - and
I can back that up - standardizing is a good thing. But then as time has
shown, that really didn't work for alot of folks - such as those in
Matlab and fortran - when they read our data - it looks transposed to
them! When HDF5 utils/our code sees their data - it looks transposed to
us! These are arguably the users you do not want to face these
difficulties as it makes it down right embarrassing at times and hard to
work around in within that language (ahem, Matlab again is painful to
work with). Not only that but it doesn't really scale - it will always
take some manual fixing and there's no standardized mark for whether a
dataset is one of these column major masquerading datasets. So let me
assure you this is quite ugly to deal with in Matlab/etc and doesn't
seem to be the path many people take - and it can require skills many
people don't have or understanding that they can't give.
But then, why did we allow saving column major data in a row based
standard in the first place? Well, the answer seems to be performance.
Surely it can't take that long to convert the datasets - most of the
time at least - although there would for sure be some memory based
limitations to allow transposing just as HDF IOs. But alas - the current
state of the library indicates otherwise and thus is the users job to
handle correctly transforming the data back and forth between
application and party. But wait - wasn't this kind of activity what HDF5
was built to alleviate in the first place?
So then how do we rectify the situation? Well speaking as a developer
using HDF5 extensively and writing libraries for it - it looks to me it
should be in the core library as it is exceedingly messy to handle on
the user side each time. I think the interpretation of the dataset and
it's dimensions should be based on dataset creation properties. This
would allow an official marking of what kind of interpretation the raw
storage of the data (and dimensions?) are. However, this is only half of
the battle. We'd need something like the type conversion system to
permute order in all the right places if the user needs to IO an
opposing storage layout. And it should be fast and light on memory.
Perhaps it would merely operate inplace as a new utility subroutine
taking in the mem_type and user memory. However I can still think of one
problem this does not address: compound types using a mixture of
philosophies with fields being the opposite to the dataset layout - and
this case has me completely stumped as this indicates it should be type
level as well. The compound part of this is a sticky situation but I'd
still motion that the dataset creation property works for most things
that occur in practice.
So... has the HDF5 group tried to deal with this wart yet? Let me know
if anything is on the drawing board.
-Jason
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
--
___________________________________________________________________________
Dr. Werner Benger Visualization Research
Center for Computation & Technology at Louisiana State University
(CCT/LSU)
2019 Digital Media Center, Baton Rouge, Louisiana 70803
Tel.: +1 225 578 4809 ( tel:%2B1%20225%20578%204809 )
Fax.: +1 225 578-5362 ( tel:%2B1%20225%20578-5362 )
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
--
___________________________________________________________________________
Dr. Werner Benger Visualization Research
Center for Computation & Technology at Louisiana State University
(CCT/LSU)
2019 Digital Media Center, Baton Rouge, Louisiana 70803
Tel.: +1 225 578 4809 ( tel:%2B1%20225%20578%204809 )
Fax.: +1 225 578-5362 ( tel:%2B1%20225%20578-5362 )
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
--
___________________________________________________________________________
Dr. Werner Benger Visualization Research
Center for Computation & Technology at Louisiana State University
(CCT/LSU)
2019 Digital Media Center, Baton Rouge, Louisiana 70803
Tel.: +1 225 578 4809 ( tel:%2B1%20225%20578%204809 )
Fax.: +1 225 578-5362 ( tel:%2B1%20225%20578-5362 )
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5