Hi Dimitris,
casacore does not load the entire datasets into memory. The datasets
can be much bigger than memory.
The expression is another type of image and is evaluated on the fly.
This means that the user asks for a chunk of data from the image
expression which is then evaluated for that chunk only. Internally it
uses iterators to do it efficiently (usually chunk by chunk).
A reduction function like median can be part of the expression (e.g. to
clip based on a median). It has a special implementation because it
requires histogramming and a partial sort.
Ger
"Dimitris Servis" <servisster@gmail.com> 09/04/08 3:22 PM >>>
Hi Ger & Francesc
Hi Ger & Dimitris,
A Thursday 04 September 2008, Ger van Diepen escrigué:
> Hi Dimitris,
>
> Several years ago C++ classes and an expression grammar were
> developed in the casacore package for on-the-fly mathematical
> operations on N-dim astronomical image datasets (which can now
also
> be in HDF5 format). It can also apply masks (or calculate them on
the
> fly) and regions (boxes, polygons, etc.). It indeed works nicely.
Hmm, I'm not sure if overloading HDF5 filters is the correct path to
implement such a general operations, but I tend to think that this
is
more a matter of the application on top of HDF5. For example, if
the
application already has a buffer for doing the I/O, implementing
general operations at the application layer could be quite easy and
probably much more flexible than using filters.
> Note that when adding, etc datasets on the fly you cannot let HDF5
> apply the mask, otherwise you don't know what the corresponding
> elements are.
That's true. And this is another reason to defer these complex
operations to the application layer --even masking, provided that
the
app already knows how to deal with masked arrays. Filters should be
only used to perform very simple and well defined operations.
Francesc
>
> Cheers,
> Ger
>
> >>> "Dimitris Servis" <servisster@gmail.com> 09/04/08 12:29 PM >>>
>
> Hi Francesc,
>
>
> > A Thursday 04 September 2008, Dimitris Servis escrigué:
> > >
> > > > Hi all,
> > > >
> > > > I tried to read the RFC and responses as carefully as I
could
> > > > so here is my 2c:
> > > >
> > > > I agree with the previous opinions that the parallel
datasets
>
> are
>
> > > > much more flexible. Moreover I assume they will require less
> > > > intervention in the library itself. For me the most
important
> > > > disadvantage is that it will make the library more complex
than
>
> it
>
> > > > should be and therefore less attractive for people to join
in.
> > > > Already people complain that it is pretty complex and I
always
> > > > advocate it should be as simple (but less ugly
) as XML.
>
> Taking
>
> > > > this under consideration and from the architectural point of
>
> view,
>
> > > > HDF5 should be as simple as possible a data format (no
special
> > > > attributes that the user cannot recognize putting them
there)
>
> with
>
> > > > a library for storing and retrieving the data. The rest is
>
> business
>
> > > > domain that has no business in the core library. And I agree
>
> with
>
> > > > Fransesc that the "parallel dataset design" is more fit for
>
> general
>
> > > > purposes.
> > > >
> > > > In my own use case, I use the parallel datasets to define
> > > > one-to-many mappings between topologies. Using the attribute
>
> method
>
> > > > would bloat the attributes of the source topology.
> > > >
> > > > Regards,
> > > >
> > > > -- dimitris
> > >
> > > What would be interesting for me would be to read in a dataset
as
>
> a
>
> > > combination of a mask or a mapping and the target dataset
without
> > > having to read both in memory (i.e. do the masking operation
on
>
> the
>
> > > fly). I do not know how interesting this is for others.
> >
> > Hmm, I didn't think about this, but that could be interesting in
>
> many
>
> > situations (for example, when the user doesn't have an
>
> implementation
>
> > of masked arrays in memory at hand). Ideally, one can even
think
>
> about
>
> > a filter that is able to do such automatic masking/mapping, so
> > the 'parallel' dataset can be transparent to the end user (he
only
>
> has
>
> > to specify the dataset to read, the region/elements and the
desired
> > mask/map, that's all). Pretty cool.
> >
> > Cheers,
> >
> > --
> > Francesc Alted
> > Freelance developer
> > Tel +34-964-282-249
>
> Hi Francesc,
>
> I always had this idea of defining operations (mathematical and
the
> like)
> that can be performed on datasets while they were loaded. So that
you
> could
> add two datasets on the fly for example, thus saving a lot of
memory
···
2008/9/4 Francesc Alted <faltet@pytables.com>
> 2008/9/4 Francesc Alted <faltet@pytables.com>
> > > 2008/9/4 Dimitris Servis <servisster@gmail.com>
> probably for the same IO. Mapping/masking is also an operation in
> this sense. Your mentioning of filters may be a good solution to
> this.
>
> thanks!
>
> -- dimitris
>
>
>
---------------------------------------------------------------------
>- This mailing list is for HDF software users discussion.
> To subscribe to this list, send a message to
> hdf-forum-subscribe@hdfgroup.org. To unsubscribe, send a message
to
> hdf-forum-unsubscribe@hdfgroup.org.
--
Francesc Alted
Freelance developer
Tel +34-964-282-249
----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to
hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to
hdf-forum-unsubscribe@hdfgroup.org.
I think Francesc refers to my mentioning of filters and I think you are
both
right that filters are not the appropriate way to do it. My mind jumped
to
the idea of using one dataset as filter for another.
I absolutely agree that the definition of such operations is business
related. However the implementation of generic dataset combinations,
I'm
afraid would be more low level as one would have to operate on an
element
basis in order to do it efficiently and not load entire datasets in
memory.
AFAIK the only alternative would be to use H5Diterate (can we use 2
reading
threads there?) and I admit I have no idea how expensive and optimized
this
is for such operations. On the other hand I see the benefit of an
application managing its own buffers (PyTables for example do it I
think),
loading parts of the two datasets and manipulating them. I am not sure
if I
got all the concepts right until now...
That's a really interesting subject, thanks for the replies and help!
-- dimitris
----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.