RFC: Special Values in HDF5

RAA · September 2, 2008, 6:44pm

A new Request for Comments (RFC) on the handling of Special Values in HDF5 has just been published at http://hdfgroup.com/pubs/rfcs/RFC_Special_Values_in_HDF5.pdf.

The HDF Group is currently soliciting feedback on this RFC. Community comments will be one of the factors considered by The HDF Group in making the final design and implementation decisions.

Comments may be sent to help@hdfgroup.org.

-Ruth Aydt
The HDF Group

Francesc_Alted1 · September 3, 2008, 8:15am

Hi Ruth,

A Tuesday 02 September 2008, Ruth Aydt escrigué:

A new Request for Comments (RFC) on the handling of Special Values in
HDF5 has just been published at
http://hdfgroup.com/pubs/rfcs/RFC_Special_Values_in_HDF5.pdf .

The HDF Group is currently soliciting feedback on this RFC.
Community comments will be one of the factors considered by The HDF
Group in making the final design and implementation decisions.

Thanks for sharing this with us. After pondering a bit about the
different possibilities, I'd say that the "Parallel Special Values
Dataset" option seems best to my eyes. Here it is a rational:

- I think that "Parallel Special Values Dataset" is more general
than "Attribute Triplet" in that the former allows to describe highly
scattered special values more efficiently than the latter. I
personally find the "Attribute Triplet" more suited for geographical
purposes, but not for general special value distributions.

- As you said in your report, compression will reduce a lot the space
overhead of requiring several datasets for keeping the special values
in the "Parallel Special Values Dataset" approach. On its hand,
the "Attribute Triplet" won't let you to compress data, so it is
perfectly possible that, in the end, the "Parallel Special Values
Dataset" would effectively require less space on disk in many
situations (and not only in the scattered special values scenario).

- Moreover, reading a specific dataset of special values out of
a "Parallel Special Values Dataset" setup would probably be similar in
speed of perhaps faster than with an "Attribute Triplet" one. The
former will be probably much faster for a highly scattered special
values scenario. In a more 'geographic' scenario (i.e. the special
values are relatively contiguous), the "Attribute Triplet" approach
could be marginally faster, but if a bit-mask, compressed dataset is
used to keep special values in a "Parallel Special Values Dataset"
setup, that can be very fast to read too (where it is the cross-point
between both apporaches will depend on the special values spatial
distribution).

- Regarding the implementation of simple operations on dataset region
selections (i.e. union, intersect, complement) would be very easy to
implement with the "Parallel Special Values Dataset" approach, and also
would perform fast, IMO. This is because there is an easy conversion
path from special values datasets to bit-mask datasets (in many cases,
the special value should be a bit-mask itself, so no conversion at all
would be needed), and computing the unions, intersections or
complements on contiguous datasets are fast operations on nowadays
superscalar processors (integer '&', '|' and '~' operators).

- Finally, and in my opinion, a "Parallel Special Values Dataset" would
integrate better on existing "masked array" implementations in
numerical libraries (I'm thinking in NumPy here, but there should exist
others out there), in that they setup a couple of datasets in memory:
one that contains the regular values, and another (the mask) that says
whether the regular value is valid or not. It is clear that
the "Parallel Special Values Dataset" approach is more general than
this, but it is equally evident the parallelism between both
implementations and that this parallelism should allow for a better and
more efficient integration for both libraries.

Having said this, I'm not specially against the "Attribute Triplet"
approach (it is better than nothing), but I think that the "Parallel
Special Values Dataset" has a lot of virtues and could be a better bet
in the long term (due to its generality, compressibility, simplicity
and high level of integration with existing computing libraries).

Cheers,

···

--
Francesc Alted
Freelance developer
Tel +34-964-282-249

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Andrew_Collette · September 4, 2008, 1:33am

Hi,

I'd like to add a little to what Francesc is saying here. What struck
me when I read the RFC is that with the Attribute Triplet scenario it's
unclear how to efficiently turn a point coordinate into a "compiled"
special value (i.e. a single bitmasked integer). The "parallel dataset"
scenario solves this nicely, at the cost of having to explicitly
enumerate the values across the dataspace. I think this design is the
better one for general use, given the current limitations of the
dataspace API and the availability of chunking and compression to manage
the storage cost.

It seems like the issue addessed by the "attribute triplet" idea is how
to express the concept of set membership. We have a collection of
points (coordinates in the "main" dataset) which can be a member of one
or more externally defined categories. Attribute triplet arrays express
this through a list of region references labelled with strings. Each
region reference defines a set. In this sense, you don't even need to
store a "special value" with the reference; it could easily be an
"attribute doublet". The user can do whatever they like with the region
at read-time, including using it to impose a mask value (via H5Dfill).

It's easy to select and read points which belong to a certain set (like
"ice") with this approach. The weakness, as the RFC implies, is that
there's no easy way to perform set-like operations on dataspaces (union,
intersection, complement, etc.). It's unclear how I would read all
"cloud and ice" points, or all "cloud but not ice" points in a single
operation, or even "all cloud points within this box". The dataspace
API would have to advance significantly for this strategy to be useful
beyond single-dataspace selections.

Conversely, a "parallel dataset" is an explicitly populated lookup
table. Each point contains a bitmask with the containing "sets"
explicitly listed. It's very easy to go from a coordinate to a list of
the containing sets. As Francesc pointed out, using bitmasks also
allows you to use bitwise & and | to replace the missing set operations.
This is much more in line with the traditional "element mask" idea found
in many numerical analysis environments. Even if the specification
didn't require a bitmask the element-wise addressing is still much more
suited to this convention than regions.

The RFC mentions the obvious disadvantages; there's no way to get hold
of all points in one category ("ice" or "ice and cloud") without polling
the entire table. It's also more expensive to add or remove categories
as you need to explicitly write to each member point, and the number of
categories is limited to the number of bits in the mask.

The limitations of each approach indicates to me that there are really
two well-distinguished use cases here. Perhaps there could even be two
specifications, one for "masked" datasets backed by lookup tables with
bitmasked/enumeration/user-provided values, and one for "set-like"
datasets, with a standardized storage convention for an unlimited number
of annotated region references, perhaps not even associated with
specific numerical values. Finally, I strongly agree with keeping this
out of the core library and in the form of a specification, at least for
now. All of this should be on top of the existing low-level
infrastructure.

Thanks,

Andrew Collette
h5py.alfven.org

···

On Wed, 2008-09-03 at 10:15 +0200, Francesc Alted wrote:

Hi Ruth,

A Tuesday 02 September 2008, Ruth Aydt escrigué:
> A new Request for Comments (RFC) on the handling of Special Values in
> HDF5 has just been published at
> http://hdfgroup.com/pubs/rfcs/RFC_Special_Values_in_HDF5.pdf .
>
> The HDF Group is currently soliciting feedback on this RFC.
> Community comments will be one of the factors considered by The HDF
> Group in making the final design and implementation decisions.

Thanks for sharing this with us. After pondering a bit about the
different possibilities, I'd say that the "Parallel Special Values
Dataset" option seems best to my eyes. Here it is a rational:

- I think that "Parallel Special Values Dataset" is more general
than "Attribute Triplet" in that the former allows to describe highly
scattered special values more efficiently than the latter. I
personally find the "Attribute Triplet" more suited for geographical
purposes, but not for general special value distributions.

- As you said in your report, compression will reduce a lot the space
overhead of requiring several datasets for keeping the special values
in the "Parallel Special Values Dataset" approach. On its hand,
the "Attribute Triplet" won't let you to compress data, so it is
perfectly possible that, in the end, the "Parallel Special Values
Dataset" would effectively require less space on disk in many
situations (and not only in the scattered special values scenario).

- Moreover, reading a specific dataset of special values out of
a "Parallel Special Values Dataset" setup would probably be similar in
speed of perhaps faster than with an "Attribute Triplet" one. The
former will be probably much faster for a highly scattered special
values scenario. In a more 'geographic' scenario (i.e. the special
values are relatively contiguous), the "Attribute Triplet" approach
could be marginally faster, but if a bit-mask, compressed dataset is
used to keep special values in a "Parallel Special Values Dataset"
setup, that can be very fast to read too (where it is the cross-point
between both apporaches will depend on the special values spatial
distribution).

- Regarding the implementation of simple operations on dataset region
selections (i.e. union, intersect, complement) would be very easy to
implement with the "Parallel Special Values Dataset" approach, and also
would perform fast, IMO. This is because there is an easy conversion
path from special values datasets to bit-mask datasets (in many cases,
the special value should be a bit-mask itself, so no conversion at all
would be needed), and computing the unions, intersections or
complements on contiguous datasets are fast operations on nowadays
superscalar processors (integer '&', '|' and '~' operators).

- Finally, and in my opinion, a "Parallel Special Values Dataset" would
integrate better on existing "masked array" implementations in
numerical libraries (I'm thinking in NumPy here, but there should exist
others out there), in that they setup a couple of datasets in memory:
one that contains the regular values, and another (the mask) that says
whether the regular value is valid or not. It is clear that
the "Parallel Special Values Dataset" approach is more general than
this, but it is equally evident the parallelism between both
implementations and that this parallelism should allow for a better and
more efficient integration for both libraries.

Having said this, I'm not specially against the "Attribute Triplet"
approach (it is better than nothing), but I think that the "Parallel
Special Values Dataset" has a lot of virtues and could be a better bet
in the long term (due to its generality, compressibility, simplicity
and high level of integration with existing computing libraries).

Cheers,

--
Francesc Alted
Freelance developer
Tel +34-964-282-249

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Ger_van_Diepen · September 4, 2008, 6:09am

I fully agree with the remarks made by Andrew and Francesc. Masks and regions are distinct cases.
Note that it is also possible to define regions in world coordinates (e.g. geographic longitude and latitude), but that is beyond the scope of this RFC.

I would like to remark that I think it is usually much more efficient to store a region as a bounding box and mask than as a list of element indices. Not only in space, but also in testing if an dataset element is part of the region. Also calculating the union, intersection, etc. of regions is much more efficient that way (and could be done on the fly).

Cheers,
Ger van Diepen

···

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Dimitris_Servis · September 4, 2008, 8:48am

Hi all,

I tried to read the RFC and responses as carefully as I could so here is my
2c:

I agree with the previous opinions that the parallel datasets are much more
flexible. Moreover I assume they will require less intervention in the
library itself. For me the most important disadvantage is that it will make
the library more complex than it should be and therefore less attractive for
people to join in. Already people complain that it is pretty complex and I
always advocate it should be as simple (but less ugly ) as XML. Taking
this under consideration and from the architectural point of view, HDF5
should be as simple as possible a data format (no special attributes that
the user cannot recognize putting them there) with a library for storing and
retrieving the data. The rest is business domain that has no business in the
core library. And I agree with Fransesc that the "parallel dataset design"
is more fit for general purposes.

In my own use case, I use the parallel datasets to define one-to-many
mappings between topologies. Using the attribute method would bloat the
attributes of the source topology.

Regards,

-- dimitris

Zhengying_Wang · September 4, 2008, 9:55am

Hi,

I have come cross a performance issue with HDF group, which really
puzzles me.

Here listed the formats of two HDF files:

1) Data organized with groups

HDF5 "/tmp/test.h5" {
FILE_CONTENTS {
group /group2
dataset /group2/dataset1
dataset /group2/dataset2
dataset /group2/dataset3
datatype /datatype1
group /group3
dataset /group3/dataset1
dataset /group3/dataset2
dataset /group3/dataset3
group /group4
dataset /group4/dataset1
dataset /group4/dataset2
dataset /group4/dataset3
datatype /datatype2
group /group1
dataset /group1/dataset1
dataset /group1/dataset2
dataset /group1/dataset3
}
}

2) Data organized with flat datasets

HDF5 "/tmp/test_un.h5" {
FILE_CONTENTS {
dataset /group2dataset1
dataset /group2dataset2
dataset /group2dataset3
datatype /datatype1
dataset /group3dataset1
dataset /group3dataset2
dataset /group3dataset3
dataset /group4dataset1
dataset /group4dataset2
dataset /group4dataset3
datatype /datatype2
dataset /group1dataset1
dataset /group1dataset2
dataset /group1dataset3
}
}

Same data is stored in the exactly same format of each dataset.
Amazingly, the performance are quite different to access the two files.

To read the same amount of data (with same compression level and chunk
size), it's about 2 times faster to read data in 2) than 1). By running
callgrind to profile the program, I found function call inflate_fast()
of 2) spent much less time compared to format 1).

Does anyone know why? How would group affect the compression performance
in HDF?

Any help would be really appreciated!

Thanks,
Zane

···

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Dimitris_Servis · September 4, 2008, 9:21am

What would be interesting for me would be to read in a dataset as a
combination of a mask or a mapping and the target dataset without having to
read both in memory (i.e. do the masking operation on the fly). I do not
know how interesting this is for others.

···

2008/9/4 Dimitris Servis <servisster@gmail.com>

Hi all,

I tried to read the RFC and responses as carefully as I could so here is my
2c:

I agree with the previous opinions that the parallel datasets are much more
flexible. Moreover I assume they will require less intervention in the
library itself. For me the most important disadvantage is that it will make
the library more complex than it should be and therefore less attractive for
people to join in. Already people complain that it is pretty complex and I
always advocate it should be as simple (but less ugly ) as XML. Taking
this under consideration and from the architectural point of view, HDF5
should be as simple as possible a data format (no special attributes that
the user cannot recognize putting them there) with a library for storing and
retrieving the data. The rest is business domain that has no business in the
core library. And I agree with Fransesc that the "parallel dataset design"
is more fit for general purposes.

In my own use case, I use the parallel datasets to define one-to-many
mappings between topologies. Using the attribute method would bloat the
attributes of the source topology.

Regards,

-- dimitris

Quincey_Koziol · September 4, 2008, 7:46pm

Hi Zane,

···

On Sep 4, 2008, at 4:55 AM, Zhengying Wang wrote:

Hi,

I have come cross a performance issue with HDF group, which really
puzzles me.

Here listed the formats of two HDF files:

1) Data organized with groups

HDF5 "/tmp/test.h5" {
FILE_CONTENTS {
group /group2
dataset /group2/dataset1
dataset /group2/dataset2
dataset /group2/dataset3
datatype /datatype1
group /group3
dataset /group3/dataset1
dataset /group3/dataset2
dataset /group3/dataset3
group /group4
dataset /group4/dataset1
dataset /group4/dataset2
dataset /group4/dataset3
datatype /datatype2
group /group1
dataset /group1/dataset1
dataset /group1/dataset2
dataset /group1/dataset3
}

2) Data organized with flat datasets

HDF5 "/tmp/test_un.h5" {
FILE_CONTENTS {
dataset /group2dataset1
dataset /group2dataset2
dataset /group2dataset3
datatype /datatype1
dataset /group3dataset1
dataset /group3dataset2
dataset /group3dataset3
dataset /group4dataset1
dataset /group4dataset2
dataset /group4dataset3
datatype /datatype2
dataset /group1dataset1
dataset /group1dataset2
dataset /group1dataset3
}

Same data is stored in the exactly same format of each dataset.
Amazingly, the performance are quite different to access the two files.

To read the same amount of data (with same compression level and chunk
size), it's about 2 times faster to read data in 2) than 1). By running
callgrind to profile the program, I found function call inflate_fast()
of 2) spent much less time compared to format 1).

Does anyone know why? How would group affect the compression performance
in HDF?

That's definitely counterintuitive... :-? Can you write some simple programs that show this behavior, so we can see the details of what you are doing?

Quincey

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Francesc_Alted1 · September 4, 2008, 10:07am

A Thursday 04 September 2008, Dimitris Servis escrigué:

> Hi all,
>
> I tried to read the RFC and responses as carefully as I could so
> here is my 2c:
>
> I agree with the previous opinions that the parallel datasets are
> much more flexible. Moreover I assume they will require less
> intervention in the library itself. For me the most important
> disadvantage is that it will make the library more complex than it
> should be and therefore less attractive for people to join in.
> Already people complain that it is pretty complex and I always
> advocate it should be as simple (but less ugly ) as XML. Taking
> this under consideration and from the architectural point of view,
> HDF5 should be as simple as possible a data format (no special
> attributes that the user cannot recognize putting them there) with
> a library for storing and retrieving the data. The rest is business
> domain that has no business in the core library. And I agree with
> Fransesc that the "parallel dataset design" is more fit for general
> purposes.
>
> In my own use case, I use the parallel datasets to define
> one-to-many mappings between topologies. Using the attribute method
> would bloat the attributes of the source topology.
>
> Regards,
>
> -- dimitris

What would be interesting for me would be to read in a dataset as a
combination of a mask or a mapping and the target dataset without
having to read both in memory (i.e. do the masking operation on the
fly). I do not know how interesting this is for others.

Hmm, I didn't think about this, but that could be interesting in many
situations (for example, when the user doesn't have an implementation
of masked arrays in memory at hand). Ideally, one can even think about
a filter that is able to do such automatic masking/mapping, so
the 'parallel' dataset can be transparent to the end user (he only has
to specify the dataset to read, the region/elements and the desired
mask/map, that's all). Pretty cool.

Cheers,

···

2008/9/4 Dimitris Servis <servisster@gmail.com>

--
Francesc Alted
Freelance developer
Tel +34-964-282-249

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Dimitris_Servis · September 4, 2008, 10:29am

Hi Francesc,

A Thursday 04 September 2008, Dimitris Servis escrigué:
>
> > Hi all,
> >
> > I tried to read the RFC and responses as carefully as I could so
> > here is my 2c:
> >
> > I agree with the previous opinions that the parallel datasets are
> > much more flexible. Moreover I assume they will require less
> > intervention in the library itself. For me the most important
> > disadvantage is that it will make the library more complex than it
> > should be and therefore less attractive for people to join in.
> > Already people complain that it is pretty complex and I always
> > advocate it should be as simple (but less ugly ) as XML. Taking
> > this under consideration and from the architectural point of view,
> > HDF5 should be as simple as possible a data format (no special
> > attributes that the user cannot recognize putting them there) with
> > a library for storing and retrieving the data. The rest is business
> > domain that has no business in the core library. And I agree with
> > Fransesc that the "parallel dataset design" is more fit for general
> > purposes.
> >
> > In my own use case, I use the parallel datasets to define
> > one-to-many mappings between topologies. Using the attribute method
> > would bloat the attributes of the source topology.
> >
> > Regards,
> >
> > -- dimitris
>
> What would be interesting for me would be to read in a dataset as a
> combination of a mask or a mapping and the target dataset without
> having to read both in memory (i.e. do the masking operation on the
> fly). I do not know how interesting this is for others.

Hmm, I didn't think about this, but that could be interesting in many
situations (for example, when the user doesn't have an implementation
of masked arrays in memory at hand). Ideally, one can even think about
a filter that is able to do such automatic masking/mapping, so
the 'parallel' dataset can be transparent to the end user (he only has
to specify the dataset to read, the region/elements and the desired
mask/map, that's all). Pretty cool.

Cheers,

--
Francesc Alted
Freelance developer
Tel +34-964-282-249

Hi Francesc,

I always had this idea of defining operations (mathematical and the like)
that can be performed on datasets while they were loaded. So that you could
add two datasets on the fly for example, thus saving a lot of memory
probably for the same IO. Mapping/masking is also an operation in this
sense. Your mentioning of filters may be a good solution to this.

thanks!

-- dimitris

···

2008/9/4 Francesc Alted <faltet@pytables.com>

> 2008/9/4 Dimitris Servis <servisster@gmail.com>

Ger_van_Diepen · September 4, 2008, 10:58am

Hi Dimitris,

Several years ago C++ classes and an expression grammar were developed
in the casacore package for on-the-fly mathematical operations on N-dim
astronomical image datasets (which can now also be in HDF5 format). It
can also apply masks (or calculate them on the fly) and regions (boxes,
polygons, etc.). It indeed works nicely.

Note that when adding, etc datasets on the fly you cannot let HDF5
apply the mask, otherwise you don't know what the corresponding elements
are.

Cheers,
Ger

"Dimitris Servis" <servisster@gmail.com> 09/04/08 12:29 PM >>>

Hi Francesc,

A Thursday 04 September 2008, Dimitris Servis escrigué:
>
> > Hi all,
> >
> > I tried to read the RFC and responses as carefully as I could so
> > here is my 2c:
> >
> > I agree with the previous opinions that the parallel datasets

are

> > much more flexible. Moreover I assume they will require less
> > intervention in the library itself. For me the most important
> > disadvantage is that it will make the library more complex than

it

> > should be and therefore less attractive for people to join in.
> > Already people complain that it is pretty complex and I always
> > advocate it should be as simple (but less ugly ) as XML.

Taking

> > this under consideration and from the architectural point of

view,

> > HDF5 should be as simple as possible a data format (no special
> > attributes that the user cannot recognize putting them there)

with

> > a library for storing and retrieving the data. The rest is

business

> > domain that has no business in the core library. And I agree

with

> > Fransesc that the "parallel dataset design" is more fit for

general

> > purposes.
> >
> > In my own use case, I use the parallel datasets to define
> > one-to-many mappings between topologies. Using the attribute

method

> > would bloat the attributes of the source topology.
> >
> > Regards,
> >
> > -- dimitris
>
> What would be interesting for me would be to read in a dataset as

a

> combination of a mask or a mapping and the target dataset without
> having to read both in memory (i.e. do the masking operation on

the

> fly). I do not know how interesting this is for others.

Hmm, I didn't think about this, but that could be interesting in

many

situations (for example, when the user doesn't have an

implementation

of masked arrays in memory at hand). Ideally, one can even think

about

a filter that is able to do such automatic masking/mapping, so
the 'parallel' dataset can be transparent to the end user (he only

has

to specify the dataset to read, the region/elements and the desired
mask/map, that's all). Pretty cool.

Cheers,

--
Francesc Alted
Freelance developer
Tel +34-964-282-249

Hi Francesc,

I always had this idea of defining operations (mathematical and the
like)
that can be performed on datasets while they were loaded. So that you
could
add two datasets on the fly for example, thus saving a lot of memory
probably for the same IO. Mapping/masking is also an operation in this
sense. Your mentioning of filters may be a good solution to this.

thanks!

-- dimitris

···

2008/9/4 Francesc Alted <faltet@pytables.com>

> 2008/9/4 Dimitris Servis <servisster@gmail.com>

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Francesc_Alted1 · September 4, 2008, 12:03pm

Hi Ger & Dimitris,

A Thursday 04 September 2008, Ger van Diepen escrigué:

Hi Dimitris,

Several years ago C++ classes and an expression grammar were
developed in the casacore package for on-the-fly mathematical
operations on N-dim astronomical image datasets (which can now also
be in HDF5 format). It can also apply masks (or calculate them on the
fly) and regions (boxes, polygons, etc.). It indeed works nicely.

Hmm, I'm not sure if overloading HDF5 filters is the correct path to
implement such a general operations, but I tend to think that this is
more a matter of the application on top of HDF5. For example, if the
application already has a buffer for doing the I/O, implementing
general operations at the application layer could be quite easy and
probably much more flexible than using filters.

Note that when adding, etc datasets on the fly you cannot let HDF5
apply the mask, otherwise you don't know what the corresponding
elements are.

That's true. And this is another reason to defer these complex
operations to the application layer --even masking, provided that the
app already knows how to deal with masked arrays. Filters should be
only used to perform very simple and well defined operations.

Francesc

···

Cheers,
Ger

>>> "Dimitris Servis" <servisster@gmail.com> 09/04/08 12:29 PM >>>

Hi Francesc,

2008/9/4 Francesc Alted <faltet@pytables.com>

> A Thursday 04 September 2008, Dimitris Servis escrigué:
> > 2008/9/4 Dimitris Servis <servisster@gmail.com>
> >
> > > Hi all,
> > >
> > > I tried to read the RFC and responses as carefully as I could
> > > so here is my 2c:
> > >
> > > I agree with the previous opinions that the parallel datasets

are

> > > much more flexible. Moreover I assume they will require less
> > > intervention in the library itself. For me the most important
> > > disadvantage is that it will make the library more complex than

it

> > > should be and therefore less attractive for people to join in.
> > > Already people complain that it is pretty complex and I always
> > > advocate it should be as simple (but less ugly ) as XML.

Taking

> > > this under consideration and from the architectural point of

view,

> > > HDF5 should be as simple as possible a data format (no special
> > > attributes that the user cannot recognize putting them there)

with

> > > a library for storing and retrieving the data. The rest is

business

> > > domain that has no business in the core library. And I agree

with

> > > Fransesc that the "parallel dataset design" is more fit for

general

> > > purposes.
> > >
> > > In my own use case, I use the parallel datasets to define
> > > one-to-many mappings between topologies. Using the attribute

method

> > > would bloat the attributes of the source topology.
> > >
> > > Regards,
> > >
> > > -- dimitris
> >
> > What would be interesting for me would be to read in a dataset as

a

> > combination of a mask or a mapping and the target dataset without
> > having to read both in memory (i.e. do the masking operation on

the

> > fly). I do not know how interesting this is for others.
>
> Hmm, I didn't think about this, but that could be interesting in

many

> situations (for example, when the user doesn't have an

implementation

> of masked arrays in memory at hand). Ideally, one can even think

about

> a filter that is able to do such automatic masking/mapping, so
> the 'parallel' dataset can be transparent to the end user (he only

has

> to specify the dataset to read, the region/elements and the desired
> mask/map, that's all). Pretty cool.
>
> Cheers,
>
> --
> Francesc Alted
> Freelance developer
> Tel +34-964-282-249

Hi Francesc,

I always had this idea of defining operations (mathematical and the
like)
that can be performed on datasets while they were loaded. So that you
could
add two datasets on the fly for example, thus saving a lot of memory
probably for the same IO. Mapping/masking is also an operation in
this sense. Your mentioning of filters may be a good solution to
this.

thanks!

-- dimitris

---------------------------------------------------------------------
- This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to
hdf-forum-subscribe@hdfgroup.org. To unsubscribe, send a message to
hdf-forum-unsubscribe@hdfgroup.org.

--
Francesc Alted
Freelance developer
Tel +34-964-282-249

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Ger_van_Diepen · September 4, 2008, 1:10pm

Hi Francesc,

Sorry I was a bit unclear.
I don't use filters; it is done in application code on top of HDF5 or
other storage formats for the image data (like FITS).
I fully agree filters are not the appropriate place to do it; I don't
see how a single filter could add two or more data sets from possibly
different HDF5 files.

Cheers,
Ger

Francesc Alted <faltet@pytables.com> 09/04/08 2:03 PM >>>

Hi Ger & Dimitris,

A Thursday 04 September 2008, Ger van Diepen escrigué:

Hi Dimitris,

Several years ago C++ classes and an expression grammar were
developed in the casacore package for on-the-fly mathematical
operations on N-dim astronomical image datasets (which can now also
be in HDF5 format). It can also apply masks (or calculate them on

the

fly) and regions (boxes, polygons, etc.). It indeed works nicely.

Hmm, I'm not sure if overloading HDF5 filters is the correct path to
implement such a general operations, but I tend to think that this is
more a matter of the application on top of HDF5. For example, if the
application already has a buffer for doing the I/O, implementing
general operations at the application layer could be quite easy and
probably much more flexible than using filters.

Note that when adding, etc datasets on the fly you cannot let HDF5
apply the mask, otherwise you don't know what the corresponding
elements are.

That's true. And this is another reason to defer these complex
operations to the application layer --even masking, provided that the
app already knows how to deal with masked arrays. Filters should be
only used to perform very simple and well defined operations.

Francesc

Cheers,
Ger

>>> "Dimitris Servis" <servisster@gmail.com> 09/04/08 12:29 PM >>>

Hi Francesc,

> A Thursday 04 September 2008, Dimitris Servis escrigué:
> >
> > > Hi all,
> > >
> > > I tried to read the RFC and responses as carefully as I could
> > > so here is my 2c:
> > >
> > > I agree with the previous opinions that the parallel datasets

are

> > > much more flexible. Moreover I assume they will require less
> > > intervention in the library itself. For me the most important
> > > disadvantage is that it will make the library more complex

than

it

> > > should be and therefore less attractive for people to join in.
> > > Already people complain that it is pretty complex and I always
> > > advocate it should be as simple (but less ugly ) as XML.

Taking

> > > this under consideration and from the architectural point of

view,

> > > HDF5 should be as simple as possible a data format (no special
> > > attributes that the user cannot recognize putting them there)

with

> > > a library for storing and retrieving the data. The rest is

business

> > > domain that has no business in the core library. And I agree

with

> > > Fransesc that the "parallel dataset design" is more fit for

general

> > > purposes.
> > >
> > > In my own use case, I use the parallel datasets to define
> > > one-to-many mappings between topologies. Using the attribute

method

> > > would bloat the attributes of the source topology.
> > >
> > > Regards,
> > >
> > > -- dimitris
> >
> > What would be interesting for me would be to read in a dataset

as

a

> > combination of a mask or a mapping and the target dataset

without

> > having to read both in memory (i.e. do the masking operation on

the

> > fly). I do not know how interesting this is for others.
>
> Hmm, I didn't think about this, but that could be interesting in

many

> situations (for example, when the user doesn't have an

implementation

> of masked arrays in memory at hand). Ideally, one can even think

about

> a filter that is able to do such automatic masking/mapping, so
> the 'parallel' dataset can be transparent to the end user (he only

has

> to specify the dataset to read, the region/elements and the

desired

> mask/map, that's all). Pretty cool.
>
> Cheers,
>
> --
> Francesc Alted
> Freelance developer
> Tel +34-964-282-249

Hi Francesc,

I always had this idea of defining operations (mathematical and the
like)
that can be performed on datasets while they were loaded. So that

you

···

2008/9/4 Francesc Alted <faltet@pytables.com>
> > 2008/9/4 Dimitris Servis <servisster@gmail.com>
could
add two datasets on the fly for example, thus saving a lot of memory
probably for the same IO. Mapping/masking is also an operation in
this sense. Your mentioning of filters may be a good solution to
this.

thanks!

-- dimitris

---------------------------------------------------------------------

- This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to
hdf-forum-subscribe@hdfgroup.org. To unsubscribe, send a message to
hdf-forum-unsubscribe@hdfgroup.org.

--
Francesc Alted
Freelance developer
Tel +34-964-282-249

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to
hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Dimitris_Servis · September 4, 2008, 1:22pm

Hi Ger & Francesc

Hi Ger & Dimitris,

A Thursday 04 September 2008, Ger van Diepen escrigué:
> Hi Dimitris,
>
> Several years ago C++ classes and an expression grammar were
> developed in the casacore package for on-the-fly mathematical
> operations on N-dim astronomical image datasets (which can now also
> be in HDF5 format). It can also apply masks (or calculate them on the
> fly) and regions (boxes, polygons, etc.). It indeed works nicely.

Hmm, I'm not sure if overloading HDF5 filters is the correct path to
implement such a general operations, but I tend to think that this is
more a matter of the application on top of HDF5. For example, if the
application already has a buffer for doing the I/O, implementing
general operations at the application layer could be quite easy and
probably much more flexible than using filters.

> Note that when adding, etc datasets on the fly you cannot let HDF5
> apply the mask, otherwise you don't know what the corresponding
> elements are.

That's true. And this is another reason to defer these complex
operations to the application layer --even masking, provided that the
app already knows how to deal with masked arrays. Filters should be
only used to perform very simple and well defined operations.

Francesc

>
> Cheers,
> Ger
>
> >>> "Dimitris Servis" <servisster@gmail.com> 09/04/08 12:29 PM >>>
>
> Hi Francesc,
>
>
> > A Thursday 04 September 2008, Dimitris Servis escrigué:
> > >
> > > > Hi all,
> > > >
> > > > I tried to read the RFC and responses as carefully as I could
> > > > so here is my 2c:
> > > >
> > > > I agree with the previous opinions that the parallel datasets
>
> are
>
> > > > much more flexible. Moreover I assume they will require less
> > > > intervention in the library itself. For me the most important
> > > > disadvantage is that it will make the library more complex than
>
> it
>
> > > > should be and therefore less attractive for people to join in.
> > > > Already people complain that it is pretty complex and I always
> > > > advocate it should be as simple (but less ugly ) as XML.
>
> Taking
>
> > > > this under consideration and from the architectural point of
>
> view,
>
> > > > HDF5 should be as simple as possible a data format (no special
> > > > attributes that the user cannot recognize putting them there)
>
> with
>
> > > > a library for storing and retrieving the data. The rest is
>
> business
>
> > > > domain that has no business in the core library. And I agree
>
> with
>
> > > > Fransesc that the "parallel dataset design" is more fit for
>
> general
>
> > > > purposes.
> > > >
> > > > In my own use case, I use the parallel datasets to define
> > > > one-to-many mappings between topologies. Using the attribute
>
> method
>
> > > > would bloat the attributes of the source topology.
> > > >
> > > > Regards,
> > > >
> > > > -- dimitris
> > >
> > > What would be interesting for me would be to read in a dataset as
>
> a
>
> > > combination of a mask or a mapping and the target dataset without
> > > having to read both in memory (i.e. do the masking operation on
>
> the
>
> > > fly). I do not know how interesting this is for others.
> >
> > Hmm, I didn't think about this, but that could be interesting in
>
> many
>
> > situations (for example, when the user doesn't have an
>
> implementation
>
> > of masked arrays in memory at hand). Ideally, one can even think
>
> about
>
> > a filter that is able to do such automatic masking/mapping, so
> > the 'parallel' dataset can be transparent to the end user (he only
>
> has
>
> > to specify the dataset to read, the region/elements and the desired
> > mask/map, that's all). Pretty cool.
> >
> > Cheers,
> >
> > --
> > Francesc Alted
> > Freelance developer
> > Tel +34-964-282-249
>
> Hi Francesc,
>
> I always had this idea of defining operations (mathematical and the
> like)
> that can be performed on datasets while they were loaded. So that you
> could
> add two datasets on the fly for example, thus saving a lot of memory
> probably for the same IO. Mapping/masking is also an operation in
> this sense. Your mentioning of filters may be a good solution to
> this.
>
> thanks!
>
> -- dimitris
>
>
> ---------------------------------------------------------------------
>- This mailing list is for HDF software users discussion.
> To subscribe to this list, send a message to
> hdf-forum-subscribe@hdfgroup.org. To unsubscribe, send a message to
> hdf-forum-unsubscribe@hdfgroup.org.

--
Francesc Alted
Freelance developer
Tel +34-964-282-249

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to
hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

I think Francesc refers to my mentioning of filters and I think you are both
right that filters are not the appropriate way to do it. My mind jumped to
the idea of using one dataset as filter for another.

I absolutely agree that the definition of such operations is business
related. However the implementation of generic dataset combinations, I'm
afraid would be more low level as one would have to operate on an element
basis in order to do it efficiently and not load entire datasets in memory.
AFAIK the only alternative would be to use H5Diterate (can we use 2 reading
threads there?) and I admit I have no idea how expensive and optimized this
is for such operations. On the other hand I see the benefit of an
application managing its own buffers (PyTables for example do it I think),
loading parts of the two datasets and manipulating them. I am not sure if I
got all the concepts right until now...

That's a really interesting subject, thanks for the replies and help!

-- dimitris

···

2008/9/4 Francesc Alted <faltet@pytables.com>

> 2008/9/4 Francesc Alted <faltet@pytables.com>
> > > 2008/9/4 Dimitris Servis <servisster@gmail.com>

Ger_van_Diepen · September 4, 2008, 1:35pm

Hi Dimitris,

casacore does not load the entire datasets into memory. The datasets
can be much bigger than memory.
The expression is another type of image and is evaluated on the fly.
This means that the user asks for a chunk of data from the image
expression which is then evaluated for that chunk only. Internally it
uses iterators to do it efficiently (usually chunk by chunk).
A reduction function like median can be part of the expression (e.g. to
clip based on a median). It has a special implementation because it
requires histogramming and a partial sort.

Ger

"Dimitris Servis" <servisster@gmail.com> 09/04/08 3:22 PM >>>

Hi Ger & Francesc

Hi Ger & Dimitris,

A Thursday 04 September 2008, Ger van Diepen escrigué:
> Hi Dimitris,
>
> Several years ago C++ classes and an expression grammar were
> developed in the casacore package for on-the-fly mathematical
> operations on N-dim astronomical image datasets (which can now

also

> be in HDF5 format). It can also apply masks (or calculate them on

the

> fly) and regions (boxes, polygons, etc.). It indeed works nicely.

Hmm, I'm not sure if overloading HDF5 filters is the correct path to
implement such a general operations, but I tend to think that this

is

more a matter of the application on top of HDF5. For example, if

the

application already has a buffer for doing the I/O, implementing
general operations at the application layer could be quite easy and
probably much more flexible than using filters.

> Note that when adding, etc datasets on the fly you cannot let HDF5
> apply the mask, otherwise you don't know what the corresponding
> elements are.

That's true. And this is another reason to defer these complex
operations to the application layer --even masking, provided that

the

app already knows how to deal with masked arrays. Filters should be
only used to perform very simple and well defined operations.

Francesc

>
> Cheers,
> Ger
>
> >>> "Dimitris Servis" <servisster@gmail.com> 09/04/08 12:29 PM >>>
>
> Hi Francesc,
>
>
> > A Thursday 04 September 2008, Dimitris Servis escrigué:
> > >
> > > > Hi all,
> > > >
> > > > I tried to read the RFC and responses as carefully as I

could

> > > > so here is my 2c:
> > > >
> > > > I agree with the previous opinions that the parallel

datasets

>
> are
>
> > > > much more flexible. Moreover I assume they will require less
> > > > intervention in the library itself. For me the most

important

> > > > disadvantage is that it will make the library more complex

than

>
> it
>
> > > > should be and therefore less attractive for people to join

in.

> > > > Already people complain that it is pretty complex and I

always

> > > > advocate it should be as simple (but less ugly ) as XML.
>
> Taking
>
> > > > this under consideration and from the architectural point of
>
> view,
>
> > > > HDF5 should be as simple as possible a data format (no

special

> > > > attributes that the user cannot recognize putting them

there)

>
> with
>
> > > > a library for storing and retrieving the data. The rest is
>
> business
>
> > > > domain that has no business in the core library. And I agree
>
> with
>
> > > > Fransesc that the "parallel dataset design" is more fit for
>
> general
>
> > > > purposes.
> > > >
> > > > In my own use case, I use the parallel datasets to define
> > > > one-to-many mappings between topologies. Using the attribute
>
> method
>
> > > > would bloat the attributes of the source topology.
> > > >
> > > > Regards,
> > > >
> > > > -- dimitris
> > >
> > > What would be interesting for me would be to read in a dataset

as

>
> a
>
> > > combination of a mask or a mapping and the target dataset

without

> > > having to read both in memory (i.e. do the masking operation

on

>
> the
>
> > > fly). I do not know how interesting this is for others.
> >
> > Hmm, I didn't think about this, but that could be interesting in
>
> many
>
> > situations (for example, when the user doesn't have an
>
> implementation
>
> > of masked arrays in memory at hand). Ideally, one can even

think

>
> about
>
> > a filter that is able to do such automatic masking/mapping, so
> > the 'parallel' dataset can be transparent to the end user (he

only

>
> has
>
> > to specify the dataset to read, the region/elements and the

desired

> > mask/map, that's all). Pretty cool.
> >
> > Cheers,
> >
> > --
> > Francesc Alted
> > Freelance developer
> > Tel +34-964-282-249
>
> Hi Francesc,
>
> I always had this idea of defining operations (mathematical and

the

> like)
> that can be performed on datasets while they were loaded. So that

you

> could
> add two datasets on the fly for example, thus saving a lot of

memory

···

2008/9/4 Francesc Alted <faltet@pytables.com>

> 2008/9/4 Francesc Alted <faltet@pytables.com>
> > > 2008/9/4 Dimitris Servis <servisster@gmail.com>
> probably for the same IO. Mapping/masking is also an operation in
> this sense. Your mentioning of filters may be a good solution to
> this.
>
> thanks!
>
> -- dimitris
>
>
>

---------------------------------------------------------------------

>- This mailing list is for HDF software users discussion.
> To subscribe to this list, send a message to
> hdf-forum-subscribe@hdfgroup.org. To unsubscribe, send a message

to

> hdf-forum-unsubscribe@hdfgroup.org.

--
Francesc Alted
Freelance developer
Tel +34-964-282-249

----------------------------------------------------------------------

This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to
hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to

hdf-forum-unsubscribe@hdfgroup.org.

I think Francesc refers to my mentioning of filters and I think you are
both
right that filters are not the appropriate way to do it. My mind jumped
to
the idea of using one dataset as filter for another.

I absolutely agree that the definition of such operations is business
related. However the implementation of generic dataset combinations,
I'm
afraid would be more low level as one would have to operate on an
element
basis in order to do it efficiently and not load entire datasets in
memory.
AFAIK the only alternative would be to use H5Diterate (can we use 2
reading
threads there?) and I admit I have no idea how expensive and optimized
this
is for such operations. On the other hand I see the benefit of an
application managing its own buffers (PyTables for example do it I
think),
loading parts of the two datasets and manipulating them. I am not sure
if I
got all the concepts right until now...

That's a really interesting subject, thanks for the replies and help!

-- dimitris

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Francesc_Alted1 · September 4, 2008, 3:17pm

A Thursday 04 September 2008, Dimitris Servis escrigué:

I think Francesc refers to my mentioning of filters and I think you
are both right that filters are not the appropriate way to do it. My
mind jumped to the idea of using one dataset as filter for another.

I absolutely agree that the definition of such operations is business
related. However the implementation of generic dataset combinations,
I'm afraid would be more low level as one would have to operate on an
element basis in order to do it efficiently and not load entire
datasets in memory. AFAIK the only alternative would be to use
H5Diterate (can we use 2 reading threads there?)

You can use threads, yes. However, HDF5 is blocking threads internally
on a quite coarse grain level while the library is accessing critical
parts (I don't know whether they are working on reducing this), so in
the end, you should not see too much increase in performance, IMO.

and I admit I have
no idea how expensive and optimized this is for such operations. On
the other hand I see the benefit of an application managing its own
buffers (PyTables for example do it I think), loading parts of the
two datasets and manipulating them. I am not sure if I got all the
concepts right until now...

Yes, PyTables implements buffered I/O (only on compound datasets as they
are the cornerstone of PyTables). And in fact, the I/O buffers are
used to compute arbitrarily complex expressions (using an enhanced
computing kernel in C named Numexpr [1]) between the columns in the
user tables without the need to read the complete dataset in memory.

This approach seems similar to how Ger is using casacore, but instead of
using chunks, a larger buffer is used instead. Using a large buffer is
important because, in order to achieve maximum speed, modern CPUs does
require to work on buffers that are generally larger than usual
chunksizes (my experiments are saying that buffers 10x larger are
enough in general for keeping the pipelines working most of the time,
although that depends on the number of fields and the chunksize).

And this works pretty well as you can see in the speed-ups achieved by
in-kernel [2] and indexed [3] queries (both are using the computational
kernel for speeding-up data selections).

[1] http://code.google.com/p/numexpr/
[2] http://www.pytables.org/docs/manual/ch05.html#inkernelSearch
[3] http://www.pytables.org/docs/manual/ch05.html#indexedSearches

That's a really interesting subject, thanks for the replies and help!

Yeah, I'm learning quite a few too!

···

--
Francesc Alted
Freelance developer
Tel +34-964-282-249

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.