to filter or not to filter

Dougherty_Matthew_T · June 15, 2013, 8:46pm

I am seeking an opinion as to the computational load involving a simple HDF filter, which will dictate how to encapsulate images with HDF5.

Problem:
Most of the software that generates images in cryo-EM creates a single density modality, 3D lattice. Generally the types of numerical values are IEEE floating point, but on occasion they are unsigned byte; several other numerical representations are permissible. The range of the values are pretty arbitrary, which is the source of the problem. Typically these values range from +/- 10,000. The value of zero has no significant meaning, which is causing a lot of visualization problems for data between -1 and +1, particularly involving division or the fact that some programs attach a special meaning to zero, such as void used in masking/clamping which introduces a new problem of co-mingling density and mask values indistinguishably, forever altering the distinction and the histogram.

Proposed solution:
Shift all density values to positive, and start with the number one as the minimum value. Zero would be reserved to indicate nothingness, such as clamping to exclude density used in mask segmentation. An alternate approach would be to use NaN, but this has several problems including breaking a lot of software.

When encapsulating my 3D image into HDF5 I could perform the simple shift, creating metadata indicating the shift. To do this does not require supporting a shift filter.
The alternate approach is to keep the density files as-is during encapsulating, and upon reading the file I dynamically shift the density values using an HDF5 filter.
The image sizes range from 30GB to 4TB.

My inclination is to go with the first method, for reasons of computation, and not having to maintain/distribute the filter.
But I am curious as to whether there is any significant computational costs for the second method.

Matthew Dougherty
National Center for Macromolecular Imaging
Baylor College of Medicine

Rudi_Gens · June 15, 2013, 9:35pm

We are dealing with remote sensing of slightly smaller size (several
hundred MB to 1 GB range) that are two-dimensional and have floating point
values. However, the general visualization requirement are probably very
similar. I am still wondering why you would want to shift the value range
in the first place. For most of our data, when they are map projected, with
have to deal with zero fill. For visualization purposes, we usually define
zero as the no data value and exclude the value from our statistics. In
order to deal with the occasional outliers in data we clamp the data to
values of two standard deviations around the mean value. That takes care of
most issues. Is that an approach that would work for your case?

Best regards,
Rudi

···

--
Rudi Gens
Alaska Satellite Facility, Geophysical Institute, University of Alaska
Fairbanks
903 Koyukuk Dr., P.O. Box 757320, Fairbanks, AK 99775-7320, USA
Phone: 1-907-4747621 Fax: 1-907-4746441
Email: rgens@alaska.edu URL: http://www.gi.alaska.edu/~rgens

On Sat, Jun 15, 2013 at 12:46 PM, Dougherty, Matthew T <matthewd@bcm.edu>wrote:

I am seeking an opinion as to the computational load involving a simple
HDF filter, which will dictate how to encapsulate images with HDF5.

Problem:
Most of the software that generates images in cryo-EM creates a single
density modality, 3D lattice. Generally the types of numerical values are
IEEE floating point, but on occasion they are unsigned byte; several other
numerical representations are permissible. The range of the values are
pretty arbitrary, which is the source of the problem. Typically these
values range from +/- 10,000. The value of zero has no significant
meaning, which is causing a lot of visualization problems for data between
-1 and +1, particularly involving division or the fact that some programs
attach a special meaning to zero, such as void used in masking/clamping
which introduces a new problem of co-mingling density and mask values
indistinguishably, forever altering the distinction and the histogram.

Proposed solution:
Shift all density values to positive, and start with the number one as the
minimum value. Zero would be reserved to indicate nothingness, such as
clamping to exclude density used in mask segmentation. An alternate
approach would be to use NaN, but this has several problems including
breaking a lot of software.

When encapsulating my 3D image into HDF5 I could perform the simple shift,
creating metadata indicating the shift. To do this does not require
supporting a shift filter.
The alternate approach is to keep the density files as-is during
encapsulating, and upon reading the file I dynamically shift the density
values using an HDF5 filter.
The image sizes range from 30GB to 4TB.

My inclination is to go with the first method, for reasons of computation,
and not having to maintain/distribute the filter.
But I am curious as to whether there is any significant computational
costs for the second method.

Matthew Dougherty
National Center for Macromolecular Imaging
Baylor College of Medicine
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

werner · June 16, 2013, 3:29pm

Hi Matt,

I seem to remember that there already is a compression filter in HDF5 that
allows to store an IEEE floating point number as an integer value with an
floating point scale and offset parameter being attached to them, so on reading
you get floating point values, limited to the precision of the underlaying
integers, but also benefiting from integer compression schemes.

For 'special values' it might be best to use an attribute on the dataset
that tells which value, or even value range, means something particular
such a mask. You could even use the HDF5 internal default fill value for
this purpose - this value will be used if you for instance read a chunk
from a chunked dataset that doesn't exist on disk, which would make it
a good case for an 'undefined' region in the data.

http://www.hdfgroup.org/HDF5/doc_resource/H5Fill_Values.html

You would just need to set a fill value that does not occur as valid data.

Shifting data values to circumvent limitations of the visualizations
software sounds like a last-resort solution, but if required, such
should be fairly quick with a compression-like filter that is specific
this software. In the HDF5 file itself you might prefer to have data
values as close to their original values as possible, so doing the
dynamical shift (specific to each viz software that you use) upon
reading sounds better than during writing. I would expect the
computational cost to be negligible as compared to disk I/O - however
there is an overhead in terms of RAM usage since compression filters
require additional storage memory. This might be an issue for large
data sets, but it should be controllable with bug-free coding (i.e.
having no memory leaks) and sufficiently small chunking of the data sets.

Werner

···

On Sat, 15 Jun 2013 15:46:58 -0500, Dougherty, Matthew T <matthewd@bcm.edu> wrote:

I am seeking an opinion as to the computational load involving a simple HDF filter, which will dictate how to encapsulate images with HDF5.

Problem:
Most of the software that generates images in cryo-EM creates a single density modality, 3D lattice. Generally the types of numerical values are IEEE floating point, but on occasion they are unsigned byte; several other numerical representations are permissible. The range of the values are pretty arbitrary, which is the source of the problem. Typically these values range from +/- 10,000. The value of zero has no significant meaning, which is causing a lot of visualization problems for data between -1 and +1, particularly involving division or the fact that some programs attach a special meaning to zero, such as void used in masking/clamping which introduces a new problem of co-mingling density and mask values indistinguishably, forever altering the distinction and the histogram.

Proposed solution:
Shift all density values to positive, and start with the number one as the minimum value. Zero would be reserved to indicate nothingness, such as clamping to exclude density used in mask segmentation. An alternate approach would be to use NaN, but this has several problems including breaking a lot of software.

When encapsulating my 3D image into HDF5 I could perform the simple shift, creating metadata indicating the shift. To do this does not require supporting a shift filter.
The alternate approach is to keep the density files as-is during encapsulating, and upon reading the file I dynamically shift the density values using an HDF5 filter.
The image sizes range from 30GB to 4TB.

My inclination is to go with the first method, for reasons of computation, and not having to maintain/distribute the filter.
But I am curious as to whether there is any significant computational costs for the second method.

Matthew Dougherty
National Center for Macromolecular Imaging
Baylor College of Medicine
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

--
___________________________________________________________________________
Dr. Werner Benger Visualization Research
Laboratory for Creative Arts and Technology (LCAT)
Center for Computation & Technology at Louisiana State University (CCT/LSU)
211 Johnston Hall, Baton Rouge, Louisiana 70803
Tel.: +1 225 578 4809 Fax.: +1 225 578-5362

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

to filter or not to filter