RFC: libHDF5 to support row and column major storage?

nevion · May 12, 2015, 4:48am

Hi -

I've been a evangelist for HDF5 for a few of years now, it is a noble and
amazing library that solves data storage issues occurring with scientific
and beyond applications - e.g. it can save many developers from wasting
time and money so they can spend that on solving more original problems.
But you guys knew that already. I think there's been a mistake though -
that is the lack of first class column-vs-row major storage. In a world
where we are split down the middle on what format we used based on what
application, library and language we use we work in one or the other it is
an ongoing reality that there will never be one true standard to follow.
But HDF5 sought to only support row-major - and I can back that up -
standardizing is a good thing. But then as time has shown, that really
didn't work for alot of folks - such as those in Matlab and fortran - when
they read our data - it looks transposed to them! When HDF5 utils/our code
sees their data - it looks transposed to us! These are arguably the users
you do not want to face these difficulties as it makes it down right
embarrassing at times and hard to work around in within that language
(ahem, Matlab again is painful to work with). Not only that but it doesn't
really scale - it will always take some manual fixing and there's no
standardized mark for whether a dataset is one of these column major
masquerading datasets. So let me assure you this is quite ugly to deal
with in Matlab/etc and doesn't seem to be the path many people take - and
it can require skills many people don't have or understanding that they
can't give.

But then, why did we allow saving column major data in a row based standard
in the first place? Well, the answer seems to be performance. Surely it
can't take that long to convert the datasets - most of the time at least -
although there would for sure be some memory based limitations to allow
transposing just as HDF IOs. But alas - the current state of the library
indicates otherwise and thus is the users job to handle correctly
transforming the data back and forth between application and party. But
wait - wasn't this kind of activity what HDF5 was built to alleviate in the
first place?

So then how do we rectify the situation? Well speaking as a developer
using HDF5 extensively and writing libraries for it - it looks to me it
should be in the core library as it is exceedingly messy to handle on the
user side each time. I think the interpretation of the dataset and it's
dimensions should be based on dataset creation properties. This would
allow an official marking of what kind of interpretation the raw storage of
the data (and dimensions?) are. However, this is only half of the battle.
We'd need something like the type conversion system to permute order in all
the right places if the user needs to IO an opposing storage layout. And
it should be fast and light on memory. Perhaps it would merely operate
inplace as a new utility subroutine taking in the mem_type and user memory.
However I can still think of one problem this does not address: compound
types using a mixture of philosophies with fields being the opposite to
the dataset layout - and this case has me completely stumped as this
indicates it should be type level as well. The compound part of this is a
sticky situation but I'd still motion that the dataset creation property
works for most things that occur in practice.

So... has the HDF5 group tried to deal with this wart yet? Let me know if
anything is on the drawing board.

-Jason

werner · May 12, 2015, 7:16am

Hi Jason,

I was facing the same issues as pretty much all use case I know and have in my visualization software and context use and require "fortran" order of indexing, including OpenGL graphics. It's not really an issue with HDF5 as the only thing required is to permute the indices when accessing the HDF5 API. And the HDF5 tools of course will display data transposed then. This index permutation is supported in the F5 library via a generic permutation vector that is stored with a group of dataset sharing the same properties (the F5 library is a C library on top of HDF5 guiding towards a specific data model for various classes of data types occurring particularly in scientific visualization):

http://www.fiberbundle.net/doc/structChartDomain__IDs.html

So via the F5 API one would see the fortran-like indexing convention, whereas whenever accessing data with the lower-level HDF5 API, it's C-like convention (whereby the permutation vector gives the option of arbitrary permutations).

I remember there had been plans by the HDF5 group to introduce "named dataspaces", similarly to "named datatypes", that could then be stored in the file as its own entity. Such would be a good place to store properties of a dataspace as attributes on a dataspace, and to have such shared among datasets. It would be a natural place to store a permutation vector, which could be reduced to a simple flag as well to just distinguish between C and fortran indexing conventions. Of course, all the related tools would also need to honor such an attribute then. Until then, one could use an attribute on each dataset and implement index permutation similar to how the F5 library does it. It may be safer to use new API functions anyway to not break old code that always expects C order indexing.

Werner

···

On 12.05.2015 06:48, Jason Newton wrote:

Hi -

I've been a evangelist for HDF5 for a few of years now, it is a noble and amazing library that solves data storage issues occurring with scientific and beyond applications - e.g. it can save many developers from wasting time and money so they can spend that on solving more original problems. But you guys knew that already. I think there's been a mistake though - that is the lack of first class column-vs-row major storage. In a world where we are split down the middle on what format we used based on what application, library and language we use we work in one or the other it is an ongoing reality that there will never be one true standard to follow. But HDF5 sought to only support row-major - and I can back that up - standardizing is a good thing. But then as time has shown, that really didn't work for alot of folks - such as those in Matlab and fortran - when they read our data - it looks transposed to them! When HDF5 utils/our code sees their data - it looks transposed to us! These are arguably the users you do not want to face these difficulties as it makes it down right embarrassing at times and hard to work around in within that language (ahem, Matlab again is painful to work with). Not only that but it doesn't really scale - it will always take some manual fixing and there's no standardized mark for whether a dataset is one of these column major masquerading datasets. So let me assure you this is quite ugly to deal with in Matlab/etc and doesn't seem to be the path many people take - and it can require skills many people don't have or understanding that they can't give.

But then, why did we allow saving column major data in a row based standard in the first place? Well, the answer seems to be performance. Surely it can't take that long to convert the datasets - most of the time at least - although there would for sure be some memory based limitations to allow transposing just as HDF IOs. But alas - the current state of the library indicates otherwise and thus is the users job to handle correctly transforming the data back and forth between application and party. But wait - wasn't this kind of activity what HDF5 was built to alleviate in the first place?

So then how do we rectify the situation? Well speaking as a developer using HDF5 extensively and writing libraries for it - it looks to me it should be in the core library as it is exceedingly messy to handle on the user side each time. I think the interpretation of the dataset and it's dimensions should be based on dataset creation properties. This would allow an official marking of what kind of interpretation the raw storage of the data (and dimensions?) are. However, this is only half of the battle. We'd need something like the type conversion system to permute order in all the right places if the user needs to IO an opposing storage layout. And it should be fast and light on memory. Perhaps it would merely operate inplace as a new utility subroutine taking in the mem_type and user memory. However I can still think of one problem this does not address: compound types using a mixture of philosophies with fields being the opposite to the dataset layout - and this case has me completely stumped as this indicates it should be type level as well. The compound part of this is a sticky situation but I'd still motion that the dataset creation property works for most things that occur in practice.

So... has the HDF5 group tried to deal with this wart yet? Let me know if anything is on the drawing board.

-Jason

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: x.com

--
___________________________________________________________________________
Dr. Werner Benger Visualization Research
Center for Computation & Technology at Louisiana State University (CCT/LSU)
2019 Digital Media Center, Baton Rouge, Louisiana 70803
Tel.: +1 225 578 4809 Fax.: +1 225 578-5362

nevion · June 9, 2015, 9:00am

Was hoping more commentary would have happened but I also had some timing
issues getting back to this, my apologies.

Werner, thank you for you reply but your case is exactly the proof of this
as an issue that should be dealt with at the specification & library level
that I am talking about. Permuting indices whenever accessing data is a
large burden to put on user code, especially considering how many different
bindings one might use to access the data. It leads to repeating and
intrusive handling which is not what the user should be dealing with. It's
tricky, automatable, isolatable (to the library), difficult out of C (at
least in python), and not what the tasks they should be spending time on
using an advanced software like HDF5.

If we look at the example of Eigen and Numpy we can see they have flags set
for dealing with column/row [
Eigen: Storage orders ] and
c/fortran [ see order argument:
numpy.array — NumPy v2.2 Manual &
http://docs.scipy.org/doc/numpy/reference/c-api.array.html ]. This shows
at least some numerical processing code deemed it important enough to not
only deal with the issue, but usually provide seamless usage or conversion
to the user's desired type.

I think defaults can be set to not change current behaviour but that
datasets & arrays could now be marked with a flag such as python's. When
reading/writing, an optional flag is provided for the memory space's
requested interpretation (default to C or Fortran by language context). We
could potentially put this in the dataset properties and type properties so
we wouldn't have to change API. And ideally, hopefully performance being
pretty great and handled in C, the library permutes the storage for you as
it's IOing it in for hopefully negligible performance bump since IO is
likely the limiting factor.

I brought this up because I'm writing a generalized HDF C++ library and
when trying to support something like Eigen (and more!), which allows both
C and F orders in the same runtime, it gets confusing on how to IO to/from
HDF files as the current approach relies on language level wrappers to
decide what the right thing to do is, and weakly at that. But the user
may genuinely want to IO in/out a fortran or C ordered dataset/array
to/from a C/fortran dataset/array in any combination for what makes sense
to them and this doesn't really work. I can be left with baffling
scenarios like this failing unless all data written to HDF files is in C
order.:

Eigen::Matrix<double, 4, 5, RowMajor> A_c; A_c.setZero(); A_c.row(i) = 5;

Eigen::Matrix<double, 4, 5, ColMajor> A_f;
hdf.write("A", A_c);
hdf.read("A", A_f);
assert(A_c == A_f);

If in this scenario A was already written by a Fortran program, then code
making the above test case work would apply a conversion where none is
needed for a read like this, making this test cases' assertion fail:

Eigen::Matrix<double, 4, 5, RowMajor> A_c; A_c.setZero(); A_c.row(i) = 5;
Eigen::Matrix<double, 4, 5, ColMajor> A_f;
hdf.read("A", A_f);
assert(A_c == A_f);

And that's why flags need to be saved in the document... the content needs
to specify it's storage layout - guessing based on language cannot cover
all cases and user made attributes are not the way because that would a be
a standard nobody knows about or will use.

-Jason

···

On Tue, May 12, 2015 at 12:16 AM, Werner Benger <werner@cct.lsu.edu> wrote:

Hi Jason,

I was facing the same issues as pretty much all use case I know and have
in my visualization software and context use and require "fortran" order of
indexing, including OpenGL graphics. It's not really an issue with HDF5 as
the only thing required is to permute the indices when accessing the HDF5
API. And the HDF5 tools of course will display data transposed then. This
index permutation is supported in the F5 library via a generic permutation
vector that is stored with a group of dataset sharing the same properties
(the F5 library is a C library on top of HDF5 guiding towards a specific
data model for various classes of data types occurring particularly in
scientific visualization):

FiberBundleHDF5: ChartDomain_IDs Struct Reference

So via the F5 API one would see the fortran-like indexing convention,
whereas whenever accessing data with the lower-level HDF5 API, it's C-like
convention (whereby the permutation vector gives the option of arbitrary
permutations).

I remember there had been plans by the HDF5 group to introduce "named
dataspaces", similarly to "named datatypes", that could then be stored in
the file as its own entity. Such would be a good place to store properties
of a dataspace as attributes on a dataspace, and to have such shared among
datasets. It would be a natural place to store a permutation vector, which
could be reduced to a simple flag as well to just distinguish between C and
fortran indexing conventions. Of course, all the related tools would also
need to honor such an attribute then. Until then, one could use an
attribute on each dataset and implement index permutation similar to how
the F5 library does it. It may be safer to use new API functions anyway to
not break old code that always expects C order indexing.

Werner

On 12.05.2015 06:48, Jason Newton wrote:

Hi -

I've been a evangelist for HDF5 for a few of years now, it is a noble and
amazing library that solves data storage issues occurring with scientific
and beyond applications - e.g. it can save many developers from wasting
time and money so they can spend that on solving more original problems.
But you guys knew that already. I think there's been a mistake though -
that is the lack of first class column-vs-row major storage. In a world
where we are split down the middle on what format we used based on what
application, library and language we use we work in one or the other it is
an ongoing reality that there will never be one true standard to follow.
But HDF5 sought to only support row-major - and I can back that up -
standardizing is a good thing. But then as time has shown, that really
didn't work for alot of folks - such as those in Matlab and fortran - when
they read our data - it looks transposed to them! When HDF5 utils/our code
sees their data - it looks transposed to us! These are arguably the users
you do not want to face these difficulties as it makes it down right
embarrassing at times and hard to work around in within that language
(ahem, Matlab again is painful to work with). Not only that but it doesn't
really scale - it will always take some manual fixing and there's no
standardized mark for whether a dataset is one of these column major
masquerading datasets. So let me assure you this is quite ugly to deal
with in Matlab/etc and doesn't seem to be the path many people take - and
it can require skills many people don't have or understanding that they
can't give.

But then, why did we allow saving column major data in a row based
standard in the first place? Well, the answer seems to be performance.
Surely it can't take that long to convert the datasets - most of the time
at least - although there would for sure be some memory based limitations
to allow transposing just as HDF IOs. But alas - the current state of the
library indicates otherwise and thus is the users job to handle correctly
transforming the data back and forth between application and party. But
wait - wasn't this kind of activity what HDF5 was built to alleviate in the
first place?

So then how do we rectify the situation? Well speaking as a developer
using HDF5 extensively and writing libraries for it - it looks to me it
should be in the core library as it is exceedingly messy to handle on the
user side each time. I think the interpretation of the dataset and it's
dimensions should be based on dataset creation properties. This would
allow an official marking of what kind of interpretation the raw storage of
the data (and dimensions?) are. However, this is only half of the battle.
We'd need something like the type conversion system to permute order in all
the right places if the user needs to IO an opposing storage layout. And
it should be fast and light on memory. Perhaps it would merely operate
inplace as a new utility subroutine taking in the mem_type and user memory.
However I can still think of one problem this does not address: compound
types using a mixture of philosophies with fields being the opposite to
the dataset layout - and this case has me completely stumped as this
indicates it should be type level as well. The compound part of this is a
sticky situation but I'd still motion that the dataset creation property
works for most things that occur in practice.

So... has the HDF5 group tried to deal with this wart yet? Let me know
if anything is on the drawing board.

-Jason

_______________________________________________
Hdf-forum is for HDF software users discussion.Hdf-forum@lists.hdfgroup.orghttp://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: x.com

--
___________________________________________________________________________
Dr. Werner Benger Visualization Research
Center for Computation & Technology at Louisiana State University (CCT/LSU)
2019 Digital Media Center, Baton Rouge, Louisiana 70803
Tel.: +1 225 578 4809 Fax.: +1 225 578-5362

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: x.com

werner · June 9, 2015, 10:58am

Basically what it needs is a convention such as an attribute to allow identifying in which permutation order a dataset is stored...

As they say in

https://www.hdfgroup.org/HDF5/doc/fortran/index.html

"When a C application reads data stored from a Fortran program, the data will appear to be transposed due to the difference in the C and Fortran storage orders. For example, if Fortran writes a 4x6 two-dimensional dataset to the file, a C program will read it as a 6x4 two-dimensional dataset into memory. The HDF5 C utilities h5dump and h5ls will also display transposed data, if data is written from a Fortran program. "

But there is no way to find out whether data had been stored by a C or Fortran program. A simple agreement on an attribute would do, even better shared dataspaces that can hold such an attribute.

All the index-permutation or data transposing (if really required) can be in some add-on library on top of HDF5 (similar to what F5 does, though F5 does more than just that).

Werner

···

On 09.06.2015 11:00, Jason Newton wrote:

Was hoping more commentary would have happened but I also had some timing issues getting back to this, my apologies.

Werner, thank you for you reply but your case is exactly the proof of this as an issue that should be dealt with at the specification & library level that I am talking about. Permuting indices whenever accessing data is a large burden to put on user code, especially considering how many different bindings one might use to access the data. It leads to repeating and intrusive handling which is not what the user should be dealing with. It's tricky, automatable, isolatable (to the library), difficult out of C (at least in python), and not what the tasks they should be spending time on using an advanced software like HDF5.

If we look at the example of Eigen and Numpy we can see they have flags set for dealing with column/row [ Eigen: Storage orders ] and c/fortran [ see order argument: numpy.array — NumPy v2.2 Manual & http://docs.scipy.org/doc/numpy/reference/c-api.array.html ]. This shows at least some numerical processing code deemed it important enough to not only deal with the issue, but usually provide seamless usage or conversion to the user's desired type.

I think defaults can be set to not change current behaviour but that datasets & arrays could now be marked with a flag such as python's. When reading/writing, an optional flag is provided for the memory space's requested interpretation (default to C or Fortran by language context). We could potentially put this in the dataset properties and type properties so we wouldn't have to change API. And ideally, hopefully performance being pretty great and handled in C, the library permutes the storage for you as it's IOing it in for hopefully negligible performance bump since IO is likely the limiting factor.

I brought this up because I'm writing a generalized HDF C++ library and when trying to support something like Eigen (and more!), which allows both C and F orders in the same runtime, it gets confusing on how to IO to/from HDF files as the current approach relies on language level wrappers to decide what the right thing to do is, and weakly at that. But the user may genuinely want to IO in/out a fortran or C ordered dataset/array to/from a C/fortran dataset/array in any combination for what makes sense to them and this doesn't really work. I can be left with baffling scenarios like this failing unless all data written to HDF files is in C order.:

    Eigen::Matrix<double, 4, 5, RowMajor> A_c; A_c.setZero();
    A_c.row(i) = 5;
    Eigen::Matrix<double, 4, 5, ColMajor> A_f;
    hdf.write("A", A_c);
    hdf.read("A", A_f);
    assert(A_c == A_f);

  If in this scenario A was already written by a Fortran program, then code making the above test case work would apply a conversion where none is needed for a read like this, making this test cases' assertion fail:

    Eigen::Matrix<double, 4, 5, RowMajor> A_c; A_c.setZero();
    A_c.row(i) = 5;
    Eigen::Matrix<double, 4, 5, ColMajor> A_f;
    hdf.read("A", A_f);
    assert(A_c == A_f);

And that's why flags need to be saved in the document... the content needs to specify it's storage layout - guessing based on language cannot cover all cases and user made attributes are not the way because that would a be a standard nobody knows about or will use.

-Jason

On Tue, May 12, 2015 at 12:16 AM, Werner Benger <werner@cct.lsu.edu > <mailto:werner@cct.lsu.edu>> wrote:

    Hi Jason,

     I was facing the same issues as pretty much all use case I know
    and have in my visualization software and context use and require
    "fortran" order of indexing, including OpenGL graphics. It's not
    really an issue with HDF5 as the only thing required is to permute
    the indices when accessing the HDF5 API. And the HDF5 tools of
    course will display data transposed then. This index permutation
    is supported in the F5 library via a generic permutation vector
    that is stored with a group of dataset sharing the same properties
    (the F5 library is a C library on top of HDF5 guiding towards a
    specific data model for various classes of data types occurring
    particularly in scientific visualization):

    FiberBundleHDF5: ChartDomain_IDs Struct Reference

    So via the F5 API one would see the fortran-like indexing
    convention, whereas whenever accessing data with the lower-level
    HDF5 API, it's C-like convention (whereby the permutation vector
    gives the option of arbitrary permutations).

    I remember there had been plans by the HDF5 group to introduce
    "named dataspaces", similarly to "named datatypes", that could
    then be stored in the file as its own entity. Such would be a good
    place to store properties of a dataspace as attributes on a
    dataspace, and to have such shared among datasets. It would be a
    natural place to store a permutation vector, which could be
    reduced to a simple flag as well to just distinguish between C and
    fortran indexing conventions. Of course, all the related tools
    would also need to honor such an attribute then. Until then, one
    could use an attribute on each dataset and implement index
    permutation similar to how the F5 library does it. It may be safer
    to use new API functions anyway to not break old code that always
    expects C order indexing.

              Werner

    On 12.05.2015 06:48, Jason Newton wrote:

    Hi -

    I've been a evangelist for HDF5 for a few of years now, it is a
    noble and amazing library that solves data storage issues
    occurring with scientific and beyond applications - e.g. it can
    save many developers from wasting time and money so they can
    spend that on solving more original problems. But you guys knew
    that already. I think there's been a mistake though - that is
    the lack of first class column-vs-row major storage. In a world
    where we are split down the middle on what format we used based
    on what application, library and language we use we work in one
    or the other it is an ongoing reality that there will never be
    one true standard to follow. But HDF5 sought to only support
    row-major - and I can back that up - standardizing is a good
    thing. But then as time has shown, that really didn't work for
    alot of folks - such as those in Matlab and fortran - when they
    read our data - it looks transposed to them! When HDF5 utils/our
    code sees their data - it looks transposed to us! These are
    arguably the users you do not want to face these difficulties as
    it makes it down right embarrassing at times and hard to work
    around in within that language (ahem, Matlab again is painful to
    work with). Not only that but it doesn't really scale - it will
    always take some manual fixing and there's no standardized mark
    for whether a dataset is one of these column major masquerading
    datasets. So let me assure you this is quite ugly to deal with
    in Matlab/etc and doesn't seem to be the path many people take -
    and it can require skills many people don't have or understanding
    that they can't give.

    But then, why did we allow saving column major data in a row
    based standard in the first place? Well, the answer seems to be
    performance. Surely it can't take that long to convert the
    datasets - most of the time at least - although there would for
    sure be some memory based limitations to allow transposing just
    as HDF IOs. But alas - the current state of the library indicates
    otherwise and thus is the users job to handle correctly
    transforming the data back and forth between application and
    party. But wait - wasn't this kind of activity what HDF5 was
    built to alleviate in the first place?

    So then how do we rectify the situation? Well speaking as a
    developer using HDF5 extensively and writing libraries for it -
    it looks to me it should be in the core library as it is
    exceedingly messy to handle on the user side each time. I think
    the interpretation of the dataset and it's dimensions should be
    based on dataset creation properties. This would allow an
    official marking of what kind of interpretation the raw storage
    of the data (and dimensions?) are. However, this is only half of
    the battle. We'd need something like the type conversion system
    to permute order in all the right places if the user needs to IO
    an opposing storage layout. And it should be fast and light on
    memory. Perhaps it would merely operate inplace as a new utility
    subroutine taking in the mem_type and user memory. However I can
    still think of one problem this does not address: compound types
    using a mixture of philosophies with fields being the opposite
    to the dataset layout - and this case has me completely stumped
    as this indicates it should be type level as well. The compound
    part of this is a sticky situation but I'd still motion that the
    dataset creation property works for most things that occur in
    practice.

    So... has the HDF5 group tried to deal with this wart yet? Let
    me know if anything is on the drawing board.

    -Jason

    _______________________________________________
    Hdf-forum is for HDF software users discussion.
    Hdf-forum@lists.hdfgroup.org <mailto:Hdf-forum@lists.hdfgroup.org>
    http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
    Twitter:x.com

    -- ___________________________________________________________________________
    Dr. Werner Benger Visualization Research
    Center for Computation & Technology at Louisiana State University (CCT/LSU)
    2019 Digital Media Center, Baton Rouge, Louisiana 70803
    Tel.:+1 225 578 4809 <tel:%2B1%20225%20578%204809> Fax.:+1 225 578-5362 <tel:%2B1%20225%20578-5362>

    _______________________________________________
    Hdf-forum is for HDF software users discussion.
    Hdf-forum@lists.hdfgroup.org <mailto:Hdf-forum@lists.hdfgroup.org>
    http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
    Twitter: x.com

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: x.com

--
___________________________________________________________________________
Dr. Werner Benger Visualization Research
Center for Computation & Technology at Louisiana State University (CCT/LSU)
2019 Digital Media Center, Baton Rouge, Louisiana 70803
Tel.: +1 225 578 4809 Fax.: +1 225 578-5362

nevion · June 9, 2015, 5:30pm

Werner,

What is the argument for leaving this to yet another add-on library on top
of HDF5? This strategy would still require the user checks after reading
for instance and calls another api. I believe this is going to make it a
less than first-class citizen/feature at the least. Ideally we want most
users reading to not even know this is happening, like when content is
chunked or compressed, although the metadata should be there so the user
can infer it will happen in their program..

Also, we want tools like hdfview, h5dump/h5ls to output the content
correctly too.

-Jason

···

On Tue, Jun 9, 2015 at 3:58 AM, Werner Benger <werner@cct.lsu.edu> wrote:

Basically what it needs is a convention such as an attribute to allow
identifying in which permutation order a dataset is stored...

As they say in

https://www.hdfgroup.org/HDF5/doc/fortran/index.html

"When a C application reads data stored from a Fortran program, the data
will appear to be transposed due to the difference in the C and Fortran
storage orders. For example, if Fortran writes a 4x6 two-dimensional
dataset to the file, a C program will read it as a 6x4 two-dimensional
dataset into memory. The HDF5 C utilities h5dump and h5ls will also display
transposed data, if data is written from a Fortran program. "

But there is no way to find out whether data had been stored by a C or
Fortran program. A simple agreement on an attribute would do, even better
shared dataspaces that can hold such an attribute.

All the index-permutation or data transposing (if really required) can be
in some add-on library on top of HDF5 (similar to what F5 does, though F5
does more than just that).

     Werner

On 09.06.2015 11:00, Jason Newton wrote:

  Was hoping more commentary would have happened but I also had some
timing issues getting back to this, my apologies.

Werner, thank you for you reply but your case is exactly the proof of
this as an issue that should be dealt with at the specification & library
level that I am talking about. Permuting indices whenever accessing data
is a large burden to put on user code, especially considering how many
different bindings one might use to access the data. It leads to repeating
and intrusive handling which is not what the user should be dealing with.
It's tricky, automatable, isolatable (to the library), difficult out of C
(at least in python), and not what the tasks they should be spending time
on using an advanced software like HDF5.

If we look at the example of Eigen and Numpy we can see they have flags
set for dealing with column/row [
Eigen: Storage orders ]
and c/fortran [ see order argument:
numpy.array — NumPy v2.2 Manual &
http://docs.scipy.org/doc/numpy/reference/c-api.array.html ]. This shows
at least some numerical processing code deemed it important enough to not
only deal with the issue, but usually provide seamless usage or conversion
to the user's desired type.

I think defaults can be set to not change current behaviour but that
datasets & arrays could now be marked with a flag such as python's. When
reading/writing, an optional flag is provided for the memory space's
requested interpretation (default to C or Fortran by language context). We
could potentially put this in the dataset properties and type properties so
we wouldn't have to change API. And ideally, hopefully performance being
pretty great and handled in C, the library permutes the storage for you as
it's IOing it in for hopefully negligible performance bump since IO is
likely the limiting factor.

I brought this up because I'm writing a generalized HDF C++ library and
when trying to support something like Eigen (and more!), which allows both
C and F orders in the same runtime, it gets confusing on how to IO to/from
HDF files as the current approach relies on language level wrappers to
decide what the right thing to do is, and weakly at that. But the user
may genuinely want to IO in/out a fortran or C ordered dataset/array
to/from a C/fortran dataset/array in any combination for what makes sense
to them and this doesn't really work. I can be left with baffling
scenarios like this failing unless all data written to HDF files is in C
order.:

Eigen::Matrix<double, 4, 5, RowMajor> A_c; A_c.setZero(); A_c.row(i) = 5;

Eigen::Matrix<double, 4, 5, ColMajor> A_f;
hdf.write("A", A_c);
hdf.read("A", A_f);
assert(A_c == A_f);

  If in this scenario A was already written by a Fortran program, then
code making the above test case work would apply a conversion where none is
needed for a read like this, making this test cases' assertion fail:

Eigen::Matrix<double, 4, 5, RowMajor> A_c; A_c.setZero(); A_c.row(i) = 5;
Eigen::Matrix<double, 4, 5, ColMajor> A_f;
hdf.read("A", A_f);
assert(A_c == A_f);

And that's why flags need to be saved in the document... the content needs
to specify it's storage layout - guessing based on language cannot cover
all cases and user made attributes are not the way because that would a be
a standard nobody knows about or will use.

-Jason

On Tue, May 12, 2015 at 12:16 AM, Werner Benger <werner@cct.lsu.edu> > wrote:

Hi Jason,

I was facing the same issues as pretty much all use case I know and have
in my visualization software and context use and require "fortran" order of
indexing, including OpenGL graphics. It's not really an issue with HDF5 as
the only thing required is to permute the indices when accessing the HDF5
API. And the HDF5 tools of course will display data transposed then. This
index permutation is supported in the F5 library via a generic permutation
vector that is stored with a group of dataset sharing the same properties
(the F5 library is a C library on top of HDF5 guiding towards a specific
data model for various classes of data types occurring particularly in
scientific visualization):

FiberBundleHDF5: ChartDomain_IDs Struct Reference

So via the F5 API one would see the fortran-like indexing convention,
whereas whenever accessing data with the lower-level HDF5 API, it's C-like
convention (whereby the permutation vector gives the option of arbitrary
permutations).

I remember there had been plans by the HDF5 group to introduce "named
dataspaces", similarly to "named datatypes", that could then be stored in
the file as its own entity. Such would be a good place to store properties
of a dataspace as attributes on a dataspace, and to have such shared among
datasets. It would be a natural place to store a permutation vector, which
could be reduced to a simple flag as well to just distinguish between C and
fortran indexing conventions. Of course, all the related tools would also
need to honor such an attribute then. Until then, one could use an
attribute on each dataset and implement index permutation similar to how
the F5 library does it. It may be safer to use new API functions anyway to
not break old code that always expects C order indexing.

          Werner

On 12.05.2015 06:48, Jason Newton wrote:

Hi -

I've been a evangelist for HDF5 for a few of years now, it is a noble and
amazing library that solves data storage issues occurring with scientific
and beyond applications - e.g. it can save many developers from wasting
time and money so they can spend that on solving more original problems.
But you guys knew that already. I think there's been a mistake though -
that is the lack of first class column-vs-row major storage. In a world
where we are split down the middle on what format we used based on what
application, library and language we use we work in one or the other it is
an ongoing reality that there will never be one true standard to follow.
But HDF5 sought to only support row-major - and I can back that up -
standardizing is a good thing. But then as time has shown, that really
didn't work for alot of folks - such as those in Matlab and fortran - when
they read our data - it looks transposed to them! When HDF5 utils/our code
sees their data - it looks transposed to us! These are arguably the users
you do not want to face these difficulties as it makes it down right
embarrassing at times and hard to work around in within that language
(ahem, Matlab again is painful to work with). Not only that but it doesn't
really scale - it will always take some manual fixing and there's no
standardized mark for whether a dataset is one of these column major
masquerading datasets. So let me assure you this is quite ugly to deal
with in Matlab/etc and doesn't seem to be the path many people take - and
it can require skills many people don't have or understanding that they
can't give.

But then, why did we allow saving column major data in a row based
standard in the first place? Well, the answer seems to be performance.
Surely it can't take that long to convert the datasets - most of the time
at least - although there would for sure be some memory based limitations
to allow transposing just as HDF IOs. But alas - the current state of the
library indicates otherwise and thus is the users job to handle correctly
transforming the data back and forth between application and party. But
wait - wasn't this kind of activity what HDF5 was built to alleviate in the
first place?

So then how do we rectify the situation? Well speaking as a developer
using HDF5 extensively and writing libraries for it - it looks to me it
should be in the core library as it is exceedingly messy to handle on the
user side each time. I think the interpretation of the dataset and it's
dimensions should be based on dataset creation properties. This would
allow an official marking of what kind of interpretation the raw storage of
the data (and dimensions?) are. However, this is only half of the battle.
We'd need something like the type conversion system to permute order in all
the right places if the user needs to IO an opposing storage layout. And
it should be fast and light on memory. Perhaps it would merely operate
inplace as a new utility subroutine taking in the mem_type and user memory.
However I can still think of one problem this does not address: compound
types using a mixture of philosophies with fields being the opposite to
the dataset layout - and this case has me completely stumped as this
indicates it should be type level as well. The compound part of this is a
sticky situation but I'd still motion that the dataset creation property
works for most things that occur in practice.

So... has the HDF5 group tried to deal with this wart yet? Let me know
if anything is on the drawing board.

-Jason

_______________________________________________
Hdf-forum is for HDF software users discussion.Hdf-forum@lists.hdfgroup.orghttp://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: x.com

--
___________________________________________________________________________
Dr. Werner Benger Visualization Research
Center for Computation & Technology at Louisiana State University (CCT/LSU)
2019 Digital Media Center, Baton Rouge, Louisiana 70803
Tel.: +1 225 578 4809 Fax.: +1 225 578-5362

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: x.com

_______________________________________________
Hdf-forum is for HDF software users discussion.Hdf-forum@lists.hdfgroup.orghttp://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: x.com

--
___________________________________________________________________________
Dr. Werner Benger Visualization Research
Center for Computation & Technology at Louisiana State University (CCT/LSU)
2019 Digital Media Center, Baton Rouge, Louisiana 70803
Tel.: +1 225 578 4809 Fax.: +1 225 578-5362

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: x.com

werner · June 9, 2015, 8:24pm

Jason,

the reason would be to keep the complexity of HDF5 as small as possible. Introducing indexing-reordering into HDF5 increases complexity and introduces possible sources of errors, especially as there is no need for HDF5 to do it. HDF5 can just concentrate on storing all datasets in C order and handling of fortran indexing to be separated out in an add-on library similar to h5lite library that is shipped with HDF5.

Both the HDF5 tools such as hdfview, h5ls and the HDF5 fortran api of course would have to make use of that addon-library to set and interpret such an "fortran-order" flag attribute. Using the "bare-bone" HDF5 would be limited to mere C-order I/O .

Actually I had pretty much the same discussion ten years ago with other users of HDF5 as well. It was the same arguments, the desire to change HDF5 to support different index schemes, versus considering HDF5 as C-only and doing anything else on top of it. Ultimately it's the decision of the HDF team whether HDF5 should support different indexing schemes in its core API. But the fact that it has never been done demonstrates that it's unlikely to happen, and since it can be done via an add-on library (which needs to be used by both the HDF5 tools and the HDF5 fortran api, but it would not affect the HDF5 core), this seems to be the easier and thus more realistic solution.

Werner

···

On 09.06.2015 19:30, Jason Newton wrote:

Werner,

What is the argument for leaving this to yet another add-on library on top of HDF5? This strategy would still require the user checks after reading for instance and calls another api. I believe this is going to make it a less than first-class citizen/feature at the least. Ideally we want most users reading to not even know this is happening, like when content is chunked or compressed, although the metadata should be there so the user can infer it will happen in their program..

Also, we want tools like hdfview, h5dump/h5ls to output the content correctly too.

-Jason

On Tue, Jun 9, 2015 at 3:58 AM, Werner Benger <werner@cct.lsu.edu > <mailto:werner@cct.lsu.edu>> wrote:

    Basically what it needs is a convention such as an attribute to
    allow identifying in which permutation order a dataset is stored...

    As they say in

    https://www.hdfgroup.org/HDF5/doc/fortran/index.html

    "When a C application reads data stored from a Fortran program,
    the data will appear to be transposed due to the difference in the
    C and Fortran storage orders. For example, if Fortran writes a 4x6
    two-dimensional dataset to the file, a C program will read it as a
    6x4 two-dimensional dataset into memory. The HDF5 C utilities
    h5dump and h5ls will also display transposed data, if data is
    written from a Fortran program. "

    But there is no way to find out whether data had been stored by a
    C or Fortran program. A simple agreement on an attribute would do,
    even better shared dataspaces that can hold such an attribute.

    All the index-permutation or data transposing (if really required)
    can be in some add-on library on top of HDF5 (similar to what F5
    does, though F5 does more than just that).

         Werner

    On 09.06.2015 11:00, Jason Newton wrote:

    Was hoping more commentary would have happened but I also had
    some timing issues getting back to this, my apologies.

    Werner, thank you for you reply but your case is exactly the
    proof of this as an issue that should be dealt with at the
    specification & library level that I am talking about. Permuting
    indices whenever accessing data is a large burden to put on user
    code, especially considering how many different bindings one
    might use to access the data. It leads to repeating and intrusive
    handling which is not what the user should be dealing with. It's
    tricky, automatable, isolatable (to the library), difficult out
    of C (at least in python), and not what the tasks they should be
    spending time on using an advanced software like HDF5.

    If we look at the example of Eigen and Numpy we can see they have
    flags set for dealing with column/row [
    Eigen: Storage orders
    ] and c/fortran [ see order argument:
    numpy.array — NumPy v2.2 Manual
    & http://docs.scipy.org/doc/numpy/reference/c-api.array.html ]. This shows at least some numerical processing code deemed it
    important enough to not only deal with the issue, but usually
    provide seamless usage or conversion to the user's desired type.

    I think defaults can be set to not change current behaviour but
    that datasets & arrays could now be marked with a flag such as
    python's. When reading/writing, an optional flag is provided for
    the memory space's requested interpretation (default to C or
    Fortran by language context). We could potentially put this in
    the dataset properties and type properties so we wouldn't have to
    change API. And ideally, hopefully performance being pretty
    great and handled in C, the library permutes the storage for you
    as it's IOing it in for hopefully negligible performance bump
    since IO is likely the limiting factor.

    I brought this up because I'm writing a generalized HDF C++
    library and when trying to support something like Eigen (and
    more!), which allows both C and F orders in the same runtime, it
    gets confusing on how to IO to/from HDF files as the current
    approach relies on language level wrappers to decide what the
    right thing to do is, and weakly at that. But the user may
    genuinely want to IO in/out a fortran or C ordered dataset/array
    to/from a C/fortran dataset/array in any combination for what
    makes sense to them and this doesn't really work. I can be left
    with baffling scenarios like this failing unless all data written
    to HDF files is in C order.:

        Eigen::Matrix<double, 4, 5, RowMajor> A_c; A_c.setZero();
        A_c.row(i) = 5;
        Eigen::Matrix<double, 4, 5, ColMajor> A_f;
        hdf.write("A", A_c);
        hdf.read("A", A_f);
        assert(A_c == A_f);

      If in this scenario A was already written by a Fortran program,
    then code making the above test case work would apply a
    conversion where none is needed for a read like this, making this
    test cases' assertion fail:

        Eigen::Matrix<double, 4, 5, RowMajor> A_c; A_c.setZero();
        A_c.row(i) = 5;
        Eigen::Matrix<double, 4, 5, ColMajor> A_f;
        hdf.read("A", A_f);
        assert(A_c == A_f);

    And that's why flags need to be saved in the document... the
    content needs to specify it's storage layout - guessing based on
    language cannot cover all cases and user made attributes are not
    the way because that would a be a standard nobody knows about or
    will use.

    -Jason

    On Tue, May 12, 2015 at 12:16 AM, Werner Benger >> <werner@cct.lsu.edu <mailto:werner@cct.lsu.edu>> wrote:

        Hi Jason,

         I was facing the same issues as pretty much all use case I
        know and have in my visualization software and context use
        and require "fortran" order of indexing, including OpenGL
        graphics. It's not really an issue with HDF5 as the only
        thing required is to permute the indices when accessing the
        HDF5 API. And the HDF5 tools of course will display data
        transposed then. This index permutation is supported in the
        F5 library via a generic permutation vector that is stored
        with a group of dataset sharing the same properties (the F5
        library is a C library on top of HDF5 guiding towards a
        specific data model for various classes of data types
        occurring particularly in scientific visualization):

        FiberBundleHDF5: ChartDomain_IDs Struct Reference

        So via the F5 API one would see the fortran-like indexing
        convention, whereas whenever accessing data with the
        lower-level HDF5 API, it's C-like convention (whereby the
        permutation vector gives the option of arbitrary permutations).

        I remember there had been plans by the HDF5 group to
        introduce "named dataspaces", similarly to "named datatypes",
        that could then be stored in the file as its own entity. Such
        would be a good place to store properties of a dataspace as
        attributes on a dataspace, and to have such shared among
        datasets. It would be a natural place to store a permutation
        vector, which could be reduced to a simple flag as well to
        just distinguish between C and fortran indexing conventions.
        Of course, all the related tools would also need to honor
        such an attribute then. Until then, one could use an
        attribute on each dataset and implement index permutation
        similar to how the F5 library does it. It may be safer to use
        new API functions anyway to not break old code that always
        expects C order indexing.

                  Werner

        On 12.05.2015 06:48, Jason Newton wrote:

        Hi -

        I've been a evangelist for HDF5 for a few of years now, it
        is a noble and amazing library that solves data storage
        issues occurring with scientific and beyond applications -
        e.g. it can save many developers from wasting time and money
        so they can spend that on solving more original problems. But you guys knew that already. I think there's been a
        mistake though - that is the lack of first class
        column-vs-row major storage. In a world where we are split
        down the middle on what format we used based on what
        application, library and language we use we work in one or
        the other it is an ongoing reality that there will never be
        one true standard to follow. But HDF5 sought to only
        support row-major - and I can back that up - standardizing
        is a good thing. But then as time has shown, that really
        didn't work for alot of folks - such as those in Matlab and
        fortran - when they read our data - it looks transposed to
        them! When HDF5 utils/our code sees their data - it looks
        transposed to us! These are arguably the users you do not
        want to face these difficulties as it makes it down right
        embarrassing at times and hard to work around in within that
        language (ahem, Matlab again is painful to work with). Not
        only that but it doesn't really scale - it will always take
        some manual fixing and there's no standardized mark for
        whether a dataset is one of these column major masquerading
        datasets. So let me assure you this is quite ugly to deal
        with in Matlab/etc and doesn't seem to be the path many
        people take - and it can require skills many people don't
        have or understanding that they can't give.

        But then, why did we allow saving column major data in a row
        based standard in the first place? Well, the answer seems to
        be performance. Surely it can't take that long to convert
        the datasets - most of the time at least - although there
        would for sure be some memory based limitations to allow
        transposing just as HDF IOs. But alas - the current state of
        the library indicates otherwise and thus is the users job to
        handle correctly transforming the data back and forth
        between application and party. But wait - wasn't this kind
        of activity what HDF5 was built to alleviate in the first place?

        So then how do we rectify the situation? Well speaking as a
        developer using HDF5 extensively and writing libraries for
        it - it looks to me it should be in the core library as it
        is exceedingly messy to handle on the user side each time. I think the interpretation of the dataset and it's
        dimensions should be based on dataset creation properties. This would allow an official marking of what kind of
        interpretation the raw storage of the data (and dimensions?)
        are. However, this is only half of the battle. We'd need
        something like the type conversion system to permute order
        in all the right places if the user needs to IO an opposing
        storage layout. And it should be fast and light on memory. Perhaps it would merely operate inplace as a new utility
        subroutine taking in the mem_type and user memory. However I
        can still think of one problem this does not address:
        compound types using a mixture of philosophies with fields
        being the opposite to the dataset layout - and this case has
        me completely stumped as this indicates it should be type
        level as well. The compound part of this is a sticky
        situation but I'd still motion that the dataset creation
        property works for most things that occur in practice.

        So... has the HDF5 group tried to deal with this wart yet? Let me know if anything is on the drawing board.

        -Jason

        _______________________________________________
        Hdf-forum is for HDF software users discussion.
        Hdf-forum@lists.hdfgroup.org <mailto:Hdf-forum@lists.hdfgroup.org>
        http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
        Twitter:x.com

        -- ___________________________________________________________________________
        Dr. Werner Benger Visualization Research
        Center for Computation & Technology at Louisiana State University (CCT/LSU)
        2019 Digital Media Center, Baton Rouge, Louisiana 70803
        Tel.:+1 225 578 4809 <tel:%2B1%20225%20578%204809> Fax.:+1 225 578-5362 <tel:%2B1%20225%20578-5362>

        _______________________________________________
        Hdf-forum is for HDF software users discussion.
        Hdf-forum@lists.hdfgroup.org
        <mailto:Hdf-forum@lists.hdfgroup.org>
        http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
        Twitter: x.com

    _______________________________________________
    Hdf-forum is for HDF software users discussion.
    Hdf-forum@lists.hdfgroup.org <mailto:Hdf-forum@lists.hdfgroup.org>
    http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
    Twitter:x.com

    -- ___________________________________________________________________________
    Dr. Werner Benger Visualization Research
    Center for Computation & Technology at Louisiana State University (CCT/LSU)
    2019 Digital Media Center, Baton Rouge, Louisiana 70803
    Tel.:+1 225 578 4809 <tel:%2B1%20225%20578%204809> Fax.:+1 225 578-5362 <tel:%2B1%20225%20578-5362>

    _______________________________________________
    Hdf-forum is for HDF software users discussion.
    Hdf-forum@lists.hdfgroup.org <mailto:Hdf-forum@lists.hdfgroup.org>
    http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
    Twitter: x.com

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: x.com

--
___________________________________________________________________________
Dr. Werner Benger Visualization Research
Center for Computation & Technology at Louisiana State University (CCT/LSU)
2019 Digital Media Center, Baton Rouge, Louisiana 70803
Tel.: +1 225 578 4809 Fax.: +1 225 578-5362

nevion · June 10, 2015, 1:38am

Werner,

Could you point me to the thread you mentioned? I figured this came up
before and I'd like to take a read of it.

Re small as possible - I see the reasoning but I think it just has to be
swallowed here - what is the true amount of complexity introduced? Surely
not as bad as the types and conversion system but I know what you mean just
the same. Driving this strategy is the consideration of how often are
people violating this very soft C order only guideline mentioned in the
documentation, and said type of people. And we're going to have violators
of this striving for performance and having less copies in memory of huge
datasets... One common thorn in industry is MATLAB. These common MATLAB
user doesn't know any of the api's involved and expect things to just work
with a simple one liner; correcting behavior on their side is an
intractable problem and something I've seen introduce bumps in spreading
HDF to others I work with. Matlab itself saves it's data in fortran-order
when using the mat serializers, I believe it was noted in the past they did
this for performance, although I cannot find references to this now and
that probably extends to not enforcing the C order on the low level
function IO. I'll also note I had a difficult time supporting generalized
corrections in MATLAB when dealing with multiple common cases, such as
nested 3x3 or 4x4 matrices in compound types. I'd always have to write
preprocessing/postprocessing scripts that were very slow since MATLAB was
doing them.

I am receptive to first class support of c/fortran order is likely not
happening, and that is to me saddening because in my eyes it is an
investment to do it in the core libraries transparently with something like
properties that is going to pay off in support and user satisfaction. On
the one side, it'll probably be thankless work, but on the other it'll
remove a very ugly wart when sharing data between teams/members. I'd say
this wart has been my biggest barrier in getting scientific (MATLAB) folks
at work to use HDF5 directly.

I guess in my library I will default the column-major matrices to convert
to/from row-major on the fly when simply outputing matrix datasets... but
this still doesn't work for column major nested types, inside
compounds/structs. The only solution I can figure there needs to use type
information of the array types wrapping the matrix fields. Putting the
burden on the struct designer to make HDF save views of compounds before IO
is not a good one from my experiences (leads to alot of code) so the only
thing I'm left with saying is don't store fortran-order matrices in structs.

-Jason

···

On Tue, Jun 9, 2015 at 1:24 PM, Werner Benger <werner@cct.lsu.edu> wrote:

Jason,

the reason would be to keep the complexity of HDF5 as small as possible.
Introducing indexing-reordering into HDF5 increases complexity and
introduces possible sources of errors, especially as there is no need for
HDF5 to do it. HDF5 can just concentrate on storing all datasets in C order
and handling of fortran indexing to be separated out in an add-on library
similar to h5lite library that is shipped with HDF5.

Both the HDF5 tools such as hdfview, h5ls and the HDF5 fortran api of
course would have to make use of that addon-library to set and interpret
such an "fortran-order" flag attribute. Using the "bare-bone" HDF5 would be
limited to mere C-order I/O .

Actually I had pretty much the same discussion ten years ago with other
users of HDF5 as well. It was the same arguments, the desire to change HDF5
to support different index schemes, versus considering HDF5 as C-only and
doing anything else on top of it. Ultimately it's the decision of the HDF
team whether HDF5 should support different indexing schemes in its core
API. But the fact that it has never been done demonstrates that it's
unlikely to happen, and since it can be done via an add-on library (which
needs to be used by both the HDF5 tools and the HDF5 fortran api, but it
would not affect the HDF5 core), this seems to be the easier and thus more
realistic solution.

      Werner

On 09.06.2015 19:30, Jason Newton wrote:

  Werner,

What is the argument for leaving this to yet another add-on library on
top of HDF5? This strategy would still require the user checks after
reading for instance and calls another api. I believe this is going to make
it a less than first-class citizen/feature at the least. Ideally we want
most users reading to not even know this is happening, like when content is
chunked or compressed, although the metadata should be there so the user
can infer it will happen in their program..

Also, we want tools like hdfview, h5dump/h5ls to output the content
correctly too.

-Jason

On Tue, Jun 9, 2015 at 3:58 AM, Werner Benger <werner@cct.lsu.edu> wrote:

Basically what it needs is a convention such as an attribute to allow
identifying in which permutation order a dataset is stored...

As they say in

https://www.hdfgroup.org/HDF5/doc/fortran/index.html

"When a C application reads data stored from a Fortran program, the data
will appear to be transposed due to the difference in the C and Fortran
storage orders. For example, if Fortran writes a 4x6 two-dimensional
dataset to the file, a C program will read it as a 6x4 two-dimensional
dataset into memory. The HDF5 C utilities h5dump and h5ls will also display
transposed data, if data is written from a Fortran program. "

But there is no way to find out whether data had been stored by a C or
Fortran program. A simple agreement on an attribute would do, even better
shared dataspaces that can hold such an attribute.

All the index-permutation or data transposing (if really required) can be
in some add-on library on top of HDF5 (similar to what F5 does, though F5
does more than just that).

     Werner

On 09.06.2015 11:00, Jason Newton wrote:

  Was hoping more commentary would have happened but I also had some
timing issues getting back to this, my apologies.

Werner, thank you for you reply but your case is exactly the proof of
this as an issue that should be dealt with at the specification & library
level that I am talking about. Permuting indices whenever accessing data
is a large burden to put on user code, especially considering how many
different bindings one might use to access the data. It leads to repeating
and intrusive handling which is not what the user should be dealing with.
It's tricky, automatable, isolatable (to the library), difficult out of C
(at least in python), and not what the tasks they should be spending time
on using an advanced software like HDF5.

If we look at the example of Eigen and Numpy we can see they have flags
set for dealing with column/row [
Eigen: Storage orders ]
and c/fortran [ see order argument:
numpy.array — NumPy v2.2 Manual &
http://docs.scipy.org/doc/numpy/reference/c-api.array.html ]. This
shows at least some numerical processing code deemed it important enough to
not only deal with the issue, but usually provide seamless usage or
conversion to the user's desired type.

I think defaults can be set to not change current behaviour but that
datasets & arrays could now be marked with a flag such as python's. When
reading/writing, an optional flag is provided for the memory space's
requested interpretation (default to C or Fortran by language context). We
could potentially put this in the dataset properties and type properties so
we wouldn't have to change API. And ideally, hopefully performance being
pretty great and handled in C, the library permutes the storage for you as
it's IOing it in for hopefully negligible performance bump since IO is
likely the limiting factor.

I brought this up because I'm writing a generalized HDF C++ library and
when trying to support something like Eigen (and more!), which allows both
C and F orders in the same runtime, it gets confusing on how to IO to/from
HDF files as the current approach relies on language level wrappers to
decide what the right thing to do is, and weakly at that. But the user
may genuinely want to IO in/out a fortran or C ordered dataset/array
to/from a C/fortran dataset/array in any combination for what makes sense
to them and this doesn't really work. I can be left with baffling
scenarios like this failing unless all data written to HDF files is in C
order.:

Eigen::Matrix<double, 4, 5, RowMajor> A_c; A_c.setZero(); A_c.row(i) =

5;
Eigen::Matrix<double, 4, 5, ColMajor> A_f;
hdf.write("A", A_c);
hdf.read("A", A_f);
assert(A_c == A_f);

  If in this scenario A was already written by a Fortran program, then
code making the above test case work would apply a conversion where none is
needed for a read like this, making this test cases' assertion fail:

Eigen::Matrix<double, 4, 5, RowMajor> A_c; A_c.setZero(); A_c.row(i) = 5;
Eigen::Matrix<double, 4, 5, ColMajor> A_f;
hdf.read("A", A_f);
assert(A_c == A_f);

And that's why flags need to be saved in the document... the content
needs to specify it's storage layout - guessing based on language cannot
cover all cases and user made attributes are not the way because that would
a be a standard nobody knows about or will use.

-Jason

On Tue, May 12, 2015 at 12:16 AM, Werner Benger <werner@cct.lsu.edu> >> wrote:

Hi Jason,

I was facing the same issues as pretty much all use case I know and
have in my visualization software and context use and require "fortran"
order of indexing, including OpenGL graphics. It's not really an issue with
HDF5 as the only thing required is to permute the indices when accessing
the HDF5 API. And the HDF5 tools of course will display data transposed
then. This index permutation is supported in the F5 library via a generic
permutation vector that is stored with a group of dataset sharing the same
properties (the F5 library is a C library on top of HDF5 guiding towards a
specific data model for various classes of data types occurring
particularly in scientific visualization):

FiberBundleHDF5: ChartDomain_IDs Struct Reference

So via the F5 API one would see the fortran-like indexing convention,
whereas whenever accessing data with the lower-level HDF5 API, it's C-like
convention (whereby the permutation vector gives the option of arbitrary
permutations).

I remember there had been plans by the HDF5 group to introduce "named
dataspaces", similarly to "named datatypes", that could then be stored in
the file as its own entity. Such would be a good place to store properties
of a dataspace as attributes on a dataspace, and to have such shared among
datasets. It would be a natural place to store a permutation vector, which
could be reduced to a simple flag as well to just distinguish between C and
fortran indexing conventions. Of course, all the related tools would also
need to honor such an attribute then. Until then, one could use an
attribute on each dataset and implement index permutation similar to how
the F5 library does it. It may be safer to use new API functions anyway to
not break old code that always expects C order indexing.

          Werner

On 12.05.2015 06:48, Jason Newton wrote:

Hi -

I've been a evangelist for HDF5 for a few of years now, it is a noble
and amazing library that solves data storage issues occurring with
scientific and beyond applications - e.g. it can save many developers from
wasting time and money so they can spend that on solving more original
problems. But you guys knew that already. I think there's been a mistake
though - that is the lack of first class column-vs-row major storage. In a
world where we are split down the middle on what format we used based on
what application, library and language we use we work in one or the other
it is an ongoing reality that there will never be one true standard to
follow. But HDF5 sought to only support row-major - and I can back that up
- standardizing is a good thing. But then as time has shown, that really
didn't work for alot of folks - such as those in Matlab and fortran - when
they read our data - it looks transposed to them! When HDF5 utils/our code
sees their data - it looks transposed to us! These are arguably the users
you do not want to face these difficulties as it makes it down right
embarrassing at times and hard to work around in within that language
(ahem, Matlab again is painful to work with). Not only that but it doesn't
really scale - it will always take some manual fixing and there's no
standardized mark for whether a dataset is one of these column major
masquerading datasets. So let me assure you this is quite ugly to deal
with in Matlab/etc and doesn't seem to be the path many people take - and
it can require skills many people don't have or understanding that they
can't give.

But then, why did we allow saving column major data in a row based
standard in the first place? Well, the answer seems to be performance.
Surely it can't take that long to convert the datasets - most of the time
at least - although there would for sure be some memory based limitations
to allow transposing just as HDF IOs. But alas - the current state of the
library indicates otherwise and thus is the users job to handle correctly
transforming the data back and forth between application and party. But
wait - wasn't this kind of activity what HDF5 was built to alleviate in the
first place?

So then how do we rectify the situation? Well speaking as a developer
using HDF5 extensively and writing libraries for it - it looks to me it
should be in the core library as it is exceedingly messy to handle on the
user side each time. I think the interpretation of the dataset and it's
dimensions should be based on dataset creation properties. This would
allow an official marking of what kind of interpretation the raw storage of
the data (and dimensions?) are. However, this is only half of the battle.
We'd need something like the type conversion system to permute order in all
the right places if the user needs to IO an opposing storage layout. And
it should be fast and light on memory. Perhaps it would merely operate
inplace as a new utility subroutine taking in the mem_type and user memory.
However I can still think of one problem this does not address: compound
types using a mixture of philosophies with fields being the opposite to
the dataset layout - and this case has me completely stumped as this
indicates it should be type level as well. The compound part of this is a
sticky situation but I'd still motion that the dataset creation property
works for most things that occur in practice.

So... has the HDF5 group tried to deal with this wart yet? Let me
know if anything is on the drawing board.

-Jason

_______________________________________________
Hdf-forum is for HDF software users discussion.Hdf-forum@lists.hdfgroup.orghttp://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: x.com

--
___________________________________________________________________________
Dr. Werner Benger Visualization Research
Center for Computation & Technology at Louisiana State University (CCT/LSU)
2019 Digital Media Center, Baton Rouge, Louisiana 70803
Tel.: +1 225 578 4809 Fax.: +1 225 578-5362

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: x.com

_______________________________________________
Hdf-forum is for HDF software users discussion.Hdf-forum@lists.hdfgroup.orghttp://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: x.com

--
___________________________________________________________________________
Dr. Werner Benger Visualization Research
Center for Computation & Technology at Louisiana State University (CCT/LSU)
2019 Digital Media Center, Baton Rouge, Louisiana 70803
Tel.: +1 225 578 4809 Fax.: +1 225 578-5362

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: x.com

_______________________________________________
Hdf-forum is for HDF software users discussion.Hdf-forum@lists.hdfgroup.orghttp://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: x.com

--
___________________________________________________________________________
Dr. Werner Benger Visualization Research
Center for Computation & Technology at Louisiana State University (CCT/LSU)
2019 Digital Media Center, Baton Rouge, Louisiana 70803
Tel.: +1 225 578 4809 Fax.: +1 225 578-5362

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: x.com

werner · June 10, 2015, 8:22am

Jason,

that discussion that I just recalled was not in this forum here but elsewhere, and it was in German, so probably not too helpful here. Basically it was the same arguing: "wait for HDF5 to become better and implement such a feature in its core" versus "do it ourselves on top of HDF5 via some addon library layer". Technically it's as simple as - eg. - introducing a convention such that all datasets and datastructures that end with "_f" in their name, are considered fortran-order (in practice I'm using attributes on named data types to store an index permutation vector).

The main problem is that whatever is introduced here, may it be a new HDF5 core feature or a new HDF5 addon library implementing such a convention, such functionality also needs to be used. It would need to be used by the HDF5 tools, by the HDF5 fortran API, by matlab, by any software that has data in fortran order in memory and writes it to c-order HDF5. It's just this piece of information that needs to be stored to be able to interpret data correctly, it's not even a performance problem.

If you're going to use type information to store such information - same as I do in F5 - then you will probably also face the same burden that transient types cannot hold attributes, only named types which are bound to a file. That is also some aspect that would be nice to be improved in HDF5. But "for the time being" it can be handled by "lots of code" on top of HDF5, still avoiding applications to do it as well if it's done via a reusable add-on library. However, such addon library cannot be fully generic as it does introduce certain conventions on how to use HDF5, even if minimalistic. I'd see that like the HDF5 dimension scales or image specification, which are HDF5-approved conventions on top of HDF5 and supported by the HDF5 tools.

Werner

···

On 10.06.2015 03:38, Jason Newton wrote:

Werner,

Could you point me to the thread you mentioned? I figured this came up before and I'd like to take a read of it.

Re small as possible - I see the reasoning but I think it just has to be swallowed here - what is the true amount of complexity introduced? Surely not as bad as the types and conversion system but I know what you mean just the same. Driving this strategy is the consideration of how often are people violating this very soft C order only guideline mentioned in the documentation, and said type of people. And we're going to have violators of this striving for performance and having less copies in memory of huge datasets... One common thorn in industry is MATLAB. These common MATLAB user doesn't know any of the api's involved and expect things to just work with a simple one liner; correcting behavior on their side is an intractable problem and something I've seen introduce bumps in spreading HDF to others I work with. Matlab itself saves it's data in fortran-order when using the mat serializers, I believe it was noted in the past they did this for performance, although I cannot find references to this now and that probably extends to not enforcing the C order on the low level function IO. I'll also note I had a difficult time supporting generalized corrections in MATLAB when dealing with multiple common cases, such as nested 3x3 or 4x4 matrices in compound types. I'd always have to write preprocessing/postprocessing scripts that were very slow since MATLAB was doing them.

I am receptive to first class support of c/fortran order is likely not happening, and that is to me saddening because in my eyes it is an investment to do it in the core libraries transparently with something like properties that is going to pay off in support and user satisfaction. On the one side, it'll probably be thankless work, but on the other it'll remove a very ugly wart when sharing data between teams/members. I'd say this wart has been my biggest barrier in getting scientific (MATLAB) folks at work to use HDF5 directly.

I guess in my library I will default the column-major matrices to convert to/from row-major on the fly when simply outputing matrix datasets... but this still doesn't work for column major nested types, inside compounds/structs. The only solution I can figure there needs to use type information of the array types wrapping the matrix fields. Putting the burden on the struct designer to make HDF save views of compounds before IO is not a good one from my experiences (leads to alot of code) so the only thing I'm left with saying is don't store fortran-order matrices in structs.

-Jason

On Tue, Jun 9, 2015 at 1:24 PM, Werner Benger <werner@cct.lsu.edu > <mailto:werner@cct.lsu.edu>> wrote:

    Jason,

     the reason would be to keep the complexity of HDF5 as small as
    possible. Introducing indexing-reordering into HDF5 increases
    complexity and introduces possible sources of errors, especially
    as there is no need for HDF5 to do it. HDF5 can just concentrate
    on storing all datasets in C order and handling of fortran
    indexing to be separated out in an add-on library similar to
    h5lite library that is shipped with HDF5.

    Both the HDF5 tools such as hdfview, h5ls and the HDF5 fortran api
    of course would have to make use of that addon-library to set and
    interpret such an "fortran-order" flag attribute. Using the
    "bare-bone" HDF5 would be limited to mere C-order I/O .

    Actually I had pretty much the same discussion ten years ago with
    other users of HDF5 as well. It was the same arguments, the desire
    to change HDF5 to support different index schemes, versus
    considering HDF5 as C-only and doing anything else on top of it.
    Ultimately it's the decision of the HDF team whether HDF5 should
    support different indexing schemes in its core API. But the fact
    that it has never been done demonstrates that it's unlikely to
    happen, and since it can be done via an add-on library (which
    needs to be used by both the HDF5 tools and the HDF5 fortran api,
    but it would not affect the HDF5 core), this seems to be the
    easier and thus more realistic solution.

          Werner

    On 09.06.2015 19:30, Jason Newton wrote:

    Werner,

    What is the argument for leaving this to yet another add-on
    library on top of HDF5? This strategy would still require the
    user checks after reading for instance and calls another api. I
    believe this is going to make it a less than first-class
    citizen/feature at the least. Ideally we want most users reading
    to not even know this is happening, like when content is chunked
    or compressed, although the metadata should be there so the user
    can infer it will happen in their program..

    Also, we want tools like hdfview, h5dump/h5ls to output the
    content correctly too.

    -Jason

    On Tue, Jun 9, 2015 at 3:58 AM, Werner Benger <werner@cct.lsu.edu >> <mailto:werner@cct.lsu.edu>> wrote:

        Basically what it needs is a convention such as an attribute
        to allow identifying in which permutation order a dataset is
        stored...

        As they say in

        https://www.hdfgroup.org/HDF5/doc/fortran/index.html

        "When a C application reads data stored from a Fortran
        program, the data will appear to be transposed due to the
        difference in the C and Fortran storage orders. For example,
        if Fortran writes a 4x6 two-dimensional dataset to the file,
        a C program will read it as a 6x4 two-dimensional dataset
        into memory. The HDF5 C utilities h5dump and h5ls will also
        display transposed data, if data is written from a Fortran
        program. "

        But there is no way to find out whether data had been stored
        by a C or Fortran program. A simple agreement on an attribute
        would do, even better shared dataspaces that can hold such an
        attribute.

        All the index-permutation or data transposing (if really
        required) can be in some add-on library on top of HDF5
        (similar to what F5 does, though F5 does more than just that).

             Werner

        On 09.06.2015 11:00, Jason Newton wrote:

        Was hoping more commentary would have happened but I also
        had some timing issues getting back to this, my apologies.

        Werner, thank you for you reply but your case is exactly the
        proof of this as an issue that should be dealt with at the
        specification & library level that I am talking about.
        Permuting indices whenever accessing data is a large burden
        to put on user code, especially considering how many
        different bindings one might use to access the data. It
        leads to repeating and intrusive handling which is not what
        the user should be dealing with. It's tricky, automatable,
        isolatable (to the library), difficult out of C (at least in
        python), and not what the tasks they should be spending time
        on using an advanced software like HDF5.

        If we look at the example of Eigen and Numpy we can see they
        have flags set for dealing with column/row [
        Eigen: Storage orders
        ] and c/fortran [ see order argument:
        numpy.array — NumPy v2.2 Manual
        & http://docs.scipy.org/doc/numpy/reference/c-api.array.html
        ]. This shows at least some numerical processing code
        deemed it important enough to not only deal with the issue,
        but usually provide seamless usage or conversion to the
        user's desired type.

        I think defaults can be set to not change current behaviour
        but that datasets & arrays could now be marked with a flag
        such as python's. When reading/writing, an optional flag is
        provided for the memory space's requested interpretation
        (default to C or Fortran by language context). We could
        potentially put this in the dataset properties and type
        properties so we wouldn't have to change API. And ideally,
        hopefully performance being pretty great and handled in C,
        the library permutes the storage for you as it's IOing it in
        for hopefully negligible performance bump since IO is likely
        the limiting factor.

        I brought this up because I'm writing a generalized HDF C++
        library and when trying to support something like Eigen (and
        more!), which allows both C and F orders in the same
        runtime, it gets confusing on how to IO to/from HDF files as
        the current approach relies on language level wrappers to
        decide what the right thing to do is, and weakly at that. But the user may genuinely want to IO in/out a fortran or C
        ordered dataset/array to/from a C/fortran dataset/array in
        any combination for what makes sense to them and this
        doesn't really work. I can be left with baffling scenarios
        like this failing unless all data written to HDF files is in
        C order.:

            Eigen::Matrix<double, 4, 5, RowMajor> A_c;
            A_c.setZero(); A_c.row(i) = 5;
            Eigen::Matrix<double, 4, 5, ColMajor> A_f;
            hdf.write("A", A_c);
            hdf.read("A", A_f);
            assert(A_c == A_f);

          If in this scenario A was already written by a Fortran
        program, then code making the above test case work would
        apply a conversion where none is needed for a read like
        this, making this test cases' assertion fail:

            Eigen::Matrix<double, 4, 5, RowMajor> A_c;
            A_c.setZero(); A_c.row(i) = 5;
            Eigen::Matrix<double, 4, 5, ColMajor> A_f;
            hdf.read("A", A_f);
            assert(A_c == A_f);

        And that's why flags need to be saved in the document... the
        content needs to specify it's storage layout - guessing
        based on language cannot cover all cases and user made
        attributes are not the way because that would a be a
        standard nobody knows about or will use.

        -Jason

        On Tue, May 12, 2015 at 12:16 AM, Werner Benger >>> <werner@cct.lsu.edu <mailto:werner@cct.lsu.edu>> wrote:

            Hi Jason,

             I was facing the same issues as pretty much all use
            case I know and have in my visualization software and
            context use and require "fortran" order of indexing,
            including OpenGL graphics. It's not really an issue with
            HDF5 as the only thing required is to permute the
            indices when accessing the HDF5 API. And the HDF5 tools
            of course will display data transposed then. This index
            permutation is supported in the F5 library via a generic
            permutation vector that is stored with a group of
            dataset sharing the same properties (the F5 library is a
            C library on top of HDF5 guiding towards a specific data
            model for various classes of data types occurring
            particularly in scientific visualization):

            FiberBundleHDF5: ChartDomain_IDs Struct Reference

            So via the F5 API one would see the fortran-like
            indexing convention, whereas whenever accessing data
            with the lower-level HDF5 API, it's C-like convention
            (whereby the permutation vector gives the option of
            arbitrary permutations).

            I remember there had been plans by the HDF5 group to
            introduce "named dataspaces", similarly to "named
            datatypes", that could then be stored in the file as its
            own entity. Such would be a good place to store
            properties of a dataspace as attributes on a dataspace,
            and to have such shared among datasets. It would be a
            natural place to store a permutation vector, which could
            be reduced to a simple flag as well to just distinguish
            between C and fortran indexing conventions. Of course,
            all the related tools would also need to honor such an
            attribute then. Until then, one could use an attribute
            on each dataset and implement index permutation similar
            to how the F5 library does it. It may be safer to use
            new API functions anyway to not break old code that
            always expects C order indexing.

                      Werner

            On 12.05.2015 06:48, Jason Newton wrote:

            Hi -

            I've been a evangelist for HDF5 for a few of years now,
            it is a noble and amazing library that solves data
            storage issues occurring with scientific and beyond
            applications - e.g. it can save many developers from
            wasting time and money so they can spend that on
            solving more original problems. But you guys knew that
            already. I think there's been a mistake though - that
            is the lack of first class column-vs-row major
            storage. In a world where we are split down the middle
            on what format we used based on what application,
            library and language we use we work in one or the other
            it is an ongoing reality that there will never be one
            true standard to follow. But HDF5 sought to only
            support row-major - and I can back that up -
            standardizing is a good thing. But then as time has
            shown, that really didn't work for alot of folks - such
            as those in Matlab and fortran - when they read our
            data - it looks transposed to them! When HDF5
            utils/our code sees their data - it looks transposed to
            us! These are arguably the users you do not want to
            face these difficulties as it makes it down right
            embarrassing at times and hard to work around in within
            that language (ahem, Matlab again is painful to work
            with). Not only that but it doesn't really scale - it
            will always take some manual fixing and there's no
            standardized mark for whether a dataset is one of these
            column major masquerading datasets. So let me assure
            you this is quite ugly to deal with in Matlab/etc and
            doesn't seem to be the path many people take - and it
            can require skills many people don't have or
            understanding that they can't give.

            But then, why did we allow saving column major data in
            a row based standard in the first place? Well, the
            answer seems to be performance. Surely it can't take
            that long to convert the datasets - most of the time at
            least - although there would for sure be some memory
            based limitations to allow transposing just as HDF IOs.
            But alas - the current state of the library indicates
            otherwise and thus is the users job to handle correctly
            transforming the data back and forth between
            application and party. But wait - wasn't this kind of
            activity what HDF5 was built to alleviate in the first
            place?

            So then how do we rectify the situation? Well speaking
            as a developer using HDF5 extensively and writing
            libraries for it - it looks to me it should be in the
            core library as it is exceedingly messy to handle on
            the user side each time. I think the interpretation of
            the dataset and it's dimensions should be based on
            dataset creation properties. This would allow an
            official marking of what kind of interpretation the raw
            storage of the data (and dimensions?) are. However,
            this is only half of the battle. We'd need something
            like the type conversion system to permute order in all
            the right places if the user needs to IO an opposing
            storage layout. And it should be fast and light on
            memory. Perhaps it would merely operate inplace as a
            new utility subroutine taking in the mem_type and user
            memory. However I can still think of one problem this
            does not address: compound types using a mixture of
            philosophies with fields being the opposite to the
            dataset layout - and this case has me completely
            stumped as this indicates it should be type level as
            well. The compound part of this is a sticky situation
            but I'd still motion that the dataset creation property
            works for most things that occur in practice.

            So... has the HDF5 group tried to deal with this wart
            yet? Let me know if anything is on the drawing board.

            -Jason

            _______________________________________________
            Hdf-forum is for HDF software users discussion.
            Hdf-forum@lists.hdfgroup.org <mailto:Hdf-forum@lists.hdfgroup.org>
            http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
            Twitter:x.com

            -- ___________________________________________________________________________
            Dr. Werner Benger Visualization Research
            Center for Computation & Technology at Louisiana State University (CCT/LSU)
            2019 Digital Media Center, Baton Rouge, Louisiana 70803
            Tel.:+1 225 578 4809 <tel:%2B1%20225%20578%204809> Fax.:+1 225 578-5362 <tel:%2B1%20225%20578-5362>

            _______________________________________________
            Hdf-forum is for HDF software users discussion.
            Hdf-forum@lists.hdfgroup.org
            <mailto:Hdf-forum@lists.hdfgroup.org>
            http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
            Twitter: x.com

        _______________________________________________
        Hdf-forum is for HDF software users discussion.
        Hdf-forum@lists.hdfgroup.org <mailto:Hdf-forum@lists.hdfgroup.org>
        http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
        Twitter:x.com

        -- ___________________________________________________________________________
        Dr. Werner Benger Visualization Research
        Center for Computation & Technology at Louisiana State University (CCT/LSU)
        2019 Digital Media Center, Baton Rouge, Louisiana 70803
        Tel.:+1 225 578 4809 <tel:%2B1%20225%20578%204809> Fax.:+1 225 578-5362 <tel:%2B1%20225%20578-5362>

        _______________________________________________
        Hdf-forum is for HDF software users discussion.
        Hdf-forum@lists.hdfgroup.org
        <mailto:Hdf-forum@lists.hdfgroup.org>
        http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
        Twitter: x.com

    _______________________________________________
    Hdf-forum is for HDF software users discussion.
    Hdf-forum@lists.hdfgroup.org <mailto:Hdf-forum@lists.hdfgroup.org>
    http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
    Twitter:x.com

    -- ___________________________________________________________________________
    Dr. Werner Benger Visualization Research
    Center for Computation & Technology at Louisiana State University (CCT/LSU)
    2019 Digital Media Center, Baton Rouge, Louisiana 70803
    Tel.:+1 225 578 4809 <tel:%2B1%20225%20578%204809> Fax.:+1 225 578-5362 <tel:%2B1%20225%20578-5362>

    _______________________________________________
    Hdf-forum is for HDF software users discussion.
    Hdf-forum@lists.hdfgroup.org <mailto:Hdf-forum@lists.hdfgroup.org>
    http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
    Twitter: x.com

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: x.com

--
___________________________________________________________________________
Dr. Werner Benger Visualization Research
Center for Computation & Technology at Louisiana State University (CCT/LSU)
2019 Digital Media Center, Baton Rouge, Louisiana 70803
Tel.: +1 225 578 4809 Fax.: +1 225 578-5362

Ger_van_Diepen · June 11, 2015, 5:57am

Fortran usually starts index counting at 1 and C at 0, which might a
problem as well. Certainly MATLAB starts counting at 1. When getting
hyperslabs, such a difference matters a lot.

Cheers,
Ger

Jason Newton <nevion@gmail.com> 6/10/2015 3:38 AM >>>

Werner,

Could you point me to the thread you mentioned? I figured this came up
before and I'd like to take a read of it.

Re small as possible - I see the reasoning but I think it just has to
be swallowed here - what is the true amount of complexity introduced?
Surely not as bad as the types and conversion system but I know what you
mean just the same. Driving this strategy is the consideration of how
often are people violating this very soft C order only guideline
mentioned in the documentation, and said type of people. And we're going
to have violators of this striving for performance and having less
copies in memory of huge datasets... One common thorn in industry is
MATLAB. These common MATLAB user doesn't know any of the api's involved
and expect things to just work with a simple one liner; correcting
behavior on their side is an intractable problem and something I've seen
introduce bumps in spreading HDF to others I work with. Matlab itself
saves it's data in fortran-order when using the mat serializers, I
believe it was noted in the past they did this for performance, although
I cannot find references to this now and that probably extends to not
enforcing the C order on the low level function IO. I'll also note I had
a difficult time supporting generalized corrections in MATLAB when
dealing with multiple common cases, such as nested 3x3 or 4x4 matrices
in compound types. I'd always have to write preprocessing/postprocessing
scripts that were very slow since MATLAB was doing them.

I am receptive to first class support of c/fortran order is likely not
happening, and that is to me saddening because in my eyes it is an
investment to do it in the core libraries transparently with something
like properties that is going to pay off in support and user
satisfaction. On the one side, it'll probably be thankless work, but on
the other it'll remove a very ugly wart when sharing data between
teams/members. I'd say this wart has been my biggest barrier in getting
scientific (MATLAB) folks at work to use HDF5 directly.

I guess in my library I will default the column-major matrices to
convert to/from row-major on the fly when simply outputing matrix
datasets... but this still doesn't work for column major nested types,
inside compounds/structs. The only solution I can figure there needs to
use type information of the array types wrapping the matrix fields.
Putting the burden on the struct designer to make HDF save views of
compounds before IO is not a good one from my experiences (leads to alot
of code) so the only thing I'm left with saying is don't store
fortran-order matrices in structs.

-Jason

···

On Tue, Jun 9, 2015 at 1:24 PM, Werner Benger <werner@cct.lsu.edu> wrote:

Jason,

the reason would be to keep the complexity of HDF5 as small as
possible. Introducing indexing-reordering into HDF5 increases complexity
and introduces possible sources of errors, especially as there is no
need for HDF5 to do it. HDF5 can just concentrate on storing all
datasets in C order and handling of fortran indexing to be separated out
in an add-on library similar to h5lite library that is shipped with
HDF5.

Both the HDF5 tools such as hdfview, h5ls and the HDF5 fortran api of
course would have to make use of that addon-library to set and interpret
such an "fortran-order" flag attribute. Using the "bare-bone" HDF5 would
be limited to mere C-order I/O .

Actually I had pretty much the same discussion ten years ago with other
users of HDF5 as well. It was the same arguments, the desire to change
HDF5 to support different index schemes, versus considering HDF5 as
C-only and doing anything else on top of it. Ultimately it's the
decision of the HDF team whether HDF5 should support different indexing
schemes in its core API. But the fact that it has never been done
demonstrates that it's unlikely to happen, and since it can be done via
an add-on library (which needs to be used by both the HDF5 tools and the
HDF5 fortran api, but it would not affect the HDF5 core), this seems to
be the easier and thus more realistic solution.

Werner

On 09.06.2015 19:30, Jason Newton wrote:

Werner,

What is the argument for leaving this to yet another add-on library on
top of HDF5? This strategy would still require the user checks after
reading for instance and calls another api. I believe this is going to
make it a less than first-class citizen/feature at the least. Ideally we
want most users reading to not even know this is happening, like when
content is chunked or compressed, although the metadata should be there
so the user can infer it will happen in their program..

Also, we want tools like hdfview, h5dump/h5ls to output the content
correctly too.

-Jason

On Tue, Jun 9, 2015 at 3:58 AM, Werner Benger <werner@cct.lsu.edu> wrote:

Basically what it needs is a convention such as an attribute to allow
identifying in which permutation order a dataset is stored...

As they say in

https://www.hdfgroup.org/HDF5/doc/fortran/index.html

"When a C application reads data stored from a Fortran program, the
data will appear to be transposed due to the difference in the C and
Fortran storage orders. For example, if Fortran writes a 4x6
two-dimensional dataset to the file, a C program will read it as a 6x4
two-dimensional dataset into memory. The HDF5 C utilities h5dump and
h5ls will also display transposed data, if data is written from a
Fortran program. "

But there is no way to find out whether data had been stored by a C or
Fortran program. A simple agreement on an attribute would do, even
better shared dataspaces that can hold such an attribute.

All the index-permutation or data transposing (if really required) can
be in some add-on library on top of HDF5 (similar to what F5 does,
though F5 does more than just that).

Werner

On 09.06.2015 11:00, Jason Newton wrote:

Was hoping more commentary would have happened but I also had some
timing issues getting back to this, my apologies.

Werner, thank you for you reply but your case is exactly the proof of
this as an issue that should be dealt with at the specification &
library level that I am talking about. Permuting indices whenever
accessing data is a large burden to put on user code, especially
considering how many different bindings one might use to access the
data. It leads to repeating and intrusive handling which is not what the
user should be dealing with. It's tricky, automatable, isolatable (to
the library), difficult out of C (at least in python), and not what the
tasks they should be spending time on using an advanced software like
HDF5.

If we look at the example of Eigen and Numpy we can see they have flags
set for dealing with column/row [
Eigen: Storage orders ]
and c/fortran [ see order argument:
numpy.array — NumPy v2.2 Manual &
http://docs.scipy.org/doc/numpy/reference/c-api.array.html ]. This shows
at least some numerical processing code deemed it important enough to
not only deal with the issue, but usually provide seamless usage or
conversion to the user's desired type.

I think defaults can be set to not change current behaviour but that
datasets & arrays could now be marked with a flag such as python's. When
reading/writing, an optional flag is provided for the memory space's
requested interpretation (default to C or Fortran by language context).
We could potentially put this in the dataset properties and type
properties so we wouldn't have to change API. And ideally, hopefully
performance being pretty great and handled in C, the library permutes
the storage for you as it's IOing it in for hopefully negligible
performance bump since IO is likely the limiting factor.

I brought this up because I'm writing a generalized HDF C++ library and
when trying to support something like Eigen (and more!), which allows
both C and F orders in the same runtime, it gets confusing on how to IO
to/from HDF files as the current approach relies on language level
wrappers to decide what the right thing to do is, and weakly at that.
But the user may genuinely want to IO in/out a fortran or C ordered
dataset/array to/from a C/fortran dataset/array in any combination for
what makes sense to them and this doesn't really work. I can be left
with baffling scenarios like this failing unless all data written to HDF
files is in C order.:

Eigen::Matrix<double, 4, 5, RowMajor> A_c; A_c.setZero(); A_c.row(i) =
5;

Eigen::Matrix<double, 4, 5, ColMajor> A_f;

hdf.write("A", A_c);

hdf.read("A", A_f);

assert(A_c == A_f);

If in this scenario A was already written by a Fortran program, then
code making the above test case work would apply a conversion where none
is needed for a read like this, making this test cases' assertion fail:

Eigen::Matrix<double, 4, 5, RowMajor> A_c; A_c.setZero(); A_c.row(i) =
5;
Eigen::Matrix<double, 4, 5, ColMajor> A_f;
hdf.read("A", A_f);
assert(A_c == A_f);

And that's why flags need to be saved in the document... the content
needs to specify it's storage layout - guessing based on language cannot
cover all cases and user made attributes are not the way because that
would a be a standard nobody knows about or will use.

-Jason

On Tue, May 12, 2015 at 12:16 AM, Werner Benger <werner@cct.lsu.edu> wrote:

Hi Jason,

I was facing the same issues as pretty much all use case I know and
have in my visualization software and context use and require "fortran"
order of indexing, including OpenGL graphics. It's not really an issue
with HDF5 as the only thing required is to permute the indices when
accessing the HDF5 API. And the HDF5 tools of course will display data
transposed then. This index permutation is supported in the F5 library
via a generic permutation vector that is stored with a group of dataset
sharing the same properties (the F5 library is a C library on top of
HDF5 guiding towards a specific data model for various classes of data
types occurring particularly in scientific visualization):

http://www.fiberbundle.net/doc/structChartDomain__IDs.html

So via the F5 API one would see the fortran-like indexing convention,
whereas whenever accessing data with the lower-level HDF5 API, it's
C-like convention (whereby the permutation vector gives the option of
arbitrary permutations).

I remember there had been plans by the HDF5 group to introduce "named
dataspaces", similarly to "named datatypes", that could then be stored
in the file as its own entity. Such would be a good place to store
properties of a dataspace as attributes on a dataspace, and to have such
shared among datasets. It would be a natural place to store a
permutation vector, which could be reduced to a simple flag as well to
just distinguish between C and fortran indexing conventions. Of course,
all the related tools would also need to honor such an attribute then.
Until then, one could use an attribute on each dataset and implement
index permutation similar to how the F5 library does it. It may be safer
to use new API functions anyway to not break old code that always
expects C order indexing.

Werner

On 12.05.2015 06:48, Jason Newton wrote:

Hi -

I've been a evangelist for HDF5 for a few of years now, it is a noble
and amazing library that solves data storage issues occurring with
scientific and beyond applications - e.g. it can save many developers
from wasting time and money so they can spend that on solving more
original problems. But you guys knew that already. I think there's been
a mistake though - that is the lack of first class column-vs-row major
storage. In a world where we are split down the middle on what format we
used based on what application, library and language we use we work in
one or the other it is an ongoing reality that there will never be one
true standard to follow. But HDF5 sought to only support row-major - and
I can back that up - standardizing is a good thing. But then as time has
shown, that really didn't work for alot of folks - such as those in
Matlab and fortran - when they read our data - it looks transposed to
them! When HDF5 utils/our code sees their data - it looks transposed to
us! These are arguably the users you do not want to face these
difficulties as it makes it down right embarrassing at times and hard to
work around in within that language (ahem, Matlab again is painful to
work with). Not only that but it doesn't really scale - it will always
take some manual fixing and there's no standardized mark for whether a
dataset is one of these column major masquerading datasets. So let me
assure you this is quite ugly to deal with in Matlab/etc and doesn't
seem to be the path many people take - and it can require skills many
people don't have or understanding that they can't give.

But then, why did we allow saving column major data in a row based
standard in the first place? Well, the answer seems to be performance.
Surely it can't take that long to convert the datasets - most of the
time at least - although there would for sure be some memory based
limitations to allow transposing just as HDF IOs. But alas - the current
state of the library indicates otherwise and thus is the users job to
handle correctly transforming the data back and forth between
application and party. But wait - wasn't this kind of activity what HDF5
was built to alleviate in the first place?

So then how do we rectify the situation? Well speaking as a developer
using HDF5 extensively and writing libraries for it - it looks to me it
should be in the core library as it is exceedingly messy to handle on
the user side each time. I think the interpretation of the dataset and
it's dimensions should be based on dataset creation properties. This
would allow an official marking of what kind of interpretation the raw
storage of the data (and dimensions?) are. However, this is only half of
the battle. We'd need something like the type conversion system to
permute order in all the right places if the user needs to IO an
opposing storage layout. And it should be fast and light on memory.
Perhaps it would merely operate inplace as a new utility subroutine
taking in the mem_type and user memory. However I can still think of one
problem this does not address: compound types using a mixture of
philosophies with fields being the opposite to the dataset layout - and
this case has me completely stumped as this indicates it should be type
level as well. The compound part of this is a sticky situation but I'd
still motion that the dataset creation property works for most things
that occur in practice.

So... has the HDF5 group tried to deal with this wart yet? Let me know
if anything is on the drawing board.

-Jason

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

Twitter: x.com

--
___________________________________________________________________________

Dr. Werner Benger Visualization Research

Center for Computation & Technology at Louisiana State University
(CCT/LSU)
2019 Digital Media Center, Baton Rouge, Louisiana 70803

Tel.: +1 225 578 4809 ( tel:%2B1%20225%20578%204809 )
     Fax.: +1 225 578-5362 ( tel:%2B1%20225%20578-5362 )

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: x.com

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

Twitter: x.com

--
___________________________________________________________________________
Dr. Werner Benger Visualization Research
Center for Computation & Technology at Louisiana State University
(CCT/LSU)
2019 Digital Media Center, Baton Rouge, Louisiana 70803
Tel.: +1 225 578 4809 ( tel:%2B1%20225%20578%204809 )
Fax.: +1 225 578-5362 ( tel:%2B1%20225%20578-5362 )

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: x.com

--
___________________________________________________________________________
Dr. Werner Benger Visualization Research
Center for Computation & Technology at Louisiana State University
(CCT/LSU)
2019 Digital Media Center, Baton Rouge, Louisiana 70803
Tel.: +1 225 578 4809 ( tel:%2B1%20225%20578%204809 )
Fax.: +1 225 578-5362 ( tel:%2B1%20225%20578-5362 )

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: x.com

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

RFC: libHDF5 to support row and column major storage?