control of scale-offset parameters in HDF5

I am a fan of the scale-offset filter followed by the gzip filter to
really reduce the size of big 3D datasets of weather model data. I am
using this compression strategy with HDF5 to do massively parallel
simulations and writing out one HDF5 file per MPI process.

I recently discovered when rendering data spanning multiple files that
there is a boundary issue as you hop from one dataset to the next.
There is a slight discontinuity in the uncompressed floating point
data between values as you go from one file to the next. I would
imagine this has to do with the internal parameters chosen by the
filter algorithm which must look for the maximum and minimum values in
the dataset being operated upon, which will vary from file to file
(from MPI proc. to MPI proc).

Is there some way to have the scale offset filter use global
parameters such that the discontinuities vanish? Before I used HDF5 I
used HDF4 and wrote my own scale/offset filter which used the global
max and min values (using a collective MPI call to determine this) and
this worked fine. However I like the transparency of the HDF5 filters
and would prefer to not write my own.

Any suggestions appreciated.

Thanks,

Leigh

···

--
Leigh Orf
Associate Professor of Atmospheric Science
Room 130G Engineering and Technology
Department of Geology and Meteorology
Central Michigan University
Mount Pleasant, MI 48859
(989)774-1923
Amateur radio callsign: KG4ULP

Hi Leigh!

I am a fan of the scale-offset filter followed by the gzip filter to
really reduce the size of big 3D datasets of weather model data. I am
using this compression strategy with HDF5 to do massively parallel
simulations and writing out one HDF5 file per MPI process.

  Glad they are turning out to be useful to you. Adding a "shuffle" filter preprocessing step may improve the compression ratio further.

I recently discovered when rendering data spanning multiple files that
there is a boundary issue as you hop from one dataset to the next.
There is a slight discontinuity in the uncompressed floating point
data between values as you go from one file to the next. I would
imagine this has to do with the internal parameters chosen by the
filter algorithm which must look for the maximum and minimum values in
the dataset being operated upon, which will vary from file to file
(from MPI proc. to MPI proc).

  Hmm, yes, I would expect that...

Is there some way to have the scale offset filter use global
parameters such that the discontinuities vanish? Before I used HDF5 I
used HDF4 and wrote my own scale/offset filter which used the global
max and min values (using a collective MPI call to determine this) and
this worked fine. However I like the transparency of the HDF5 filters
and would prefer to not write my own.

  It's definitely a good idea, but since each dataset is compressed independently, there isn't a way to have a global set of min/max values, at least currently. However, I don't imagine it would be too difficult to add a new "scale type" to the filter... I'll add an issue to our bugtracker and Elena can prioritize it with the other work there. If you'd like to submit a patch or find a little bit of funding for us to perform this work, that'll speed things up. :slight_smile:

  Quincey

···

On Jun 26, 2010, at 8:06 AM, Leigh Orf wrote:

Hi Leigh!

I am a fan of the scale-offset filter followed by the gzip filter to
really reduce the size of big 3D datasets of weather model data. I am
using this compression strategy with HDF5 to do massively parallel
simulations and writing out one HDF5 file per MPI process.

   Glad they are turning out to be useful to you\.  Adding a "shuffle" filter preprocessing step may improve the compression ratio further\.

You know, I once tried scaleoffset -> shuffle -> gzip but it didn't
make the files smaller, it made them bigger... or maybe I messed
something up, I'll try it again.

I recently discovered when rendering data spanning multiple files that
there is a boundary issue as you hop from one dataset to the next.
There is a slight discontinuity in the uncompressed floating point
data between values as you go from one file to the next. I would
imagine this has to do with the internal parameters chosen by the
filter algorithm which must look for the maximum and minimum values in
the dataset being operated upon, which will vary from file to file
(from MPI proc. to MPI proc).

   Hmm, yes, I would expect that\.\.\.

Is there some way to have the scale offset filter use global
parameters such that the discontinuities vanish? Before I used HDF5 I
used HDF4 and wrote my own scale/offset filter which used the global
max and min values (using a collective MPI call to determine this) and
this worked fine. However I like the transparency of the HDF5 filters
and would prefer to not write my own.

   It's definitely a good idea, but since each dataset is compressed independently, there isn't a way to have a global set of min/max values, at least currently\.  However, I don't imagine it would be too difficult to add a new "scale type" to the filter\.\.\.  I'll add an issue to our bugtracker and Elena can prioritize it with the other work there\.  If you'd like to submit a patch or find a little bit of funding for us to perform this work, that'll speed things up\. :\-\)

That would probably be the best approach, and then add a new routine
like H5Pset_scaleoffset_maxmin. A collective MPI_MAX / MPI_MIN call
would get that and it could be fed to the scaleoffset routine with
minimal pain / code changes. If you have pointers as to how to add
this kind of functionality, I'd be happy to try submitting a patch.

I did look into the code and there are several routines designed to
calculate max and/or min for different datatypes. In essence I would
be removing functionality from the filter, not adding new
functionality!

Concerning the funding, I feel your pain, believe me!

Leigh

···

On Sat, Jun 26, 2010 at 2:02 PM, Quincey Koziol <koziol@hdfgroup.org> wrote:

On Jun 26, 2010, at 8:06 AM, Leigh Orf wrote:

   Quincey

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

--
Leigh Orf
Associate Professor of Atmospheric Science
Room 130G Engineering and Technology
Department of Geology and Meteorology
Central Michigan University
Mount Pleasant, MI 48859
(989)774-1923
Amateur radio callsign: KG4ULP

Hi Leigh,

Hi Leigh!

I am a fan of the scale-offset filter followed by the gzip filter to
really reduce the size of big 3D datasets of weather model data. I am
using this compression strategy with HDF5 to do massively parallel
simulations and writing out one HDF5 file per MPI process.

       Glad they are turning out to be useful to you. Adding a "shuffle" filter preprocessing step may improve the compression ratio further.

You know, I once tried scaleoffset -> shuffle -> gzip but it didn't
make the files smaller, it made them bigger... or maybe I messed
something up, I'll try it again.

  It's best to put the shuffle filter first to rearrange the uncompressed bytes. Putting it later will have no affect on the compression ratio and will just chew up cycles.

I recently discovered when rendering data spanning multiple files that
there is a boundary issue as you hop from one dataset to the next.
There is a slight discontinuity in the uncompressed floating point
data between values as you go from one file to the next. I would
imagine this has to do with the internal parameters chosen by the
filter algorithm which must look for the maximum and minimum values in
the dataset being operated upon, which will vary from file to file
(from MPI proc. to MPI proc).

       Hmm, yes, I would expect that...

Is there some way to have the scale offset filter use global
parameters such that the discontinuities vanish? Before I used HDF5 I
used HDF4 and wrote my own scale/offset filter which used the global
max and min values (using a collective MPI call to determine this) and
this worked fine. However I like the transparency of the HDF5 filters
and would prefer to not write my own.

       It's definitely a good idea, but since each dataset is compressed independently, there isn't a way to have a global set of min/max values, at least currently. However, I don't imagine it would be too difficult to add a new "scale type" to the filter... I'll add an issue to our bugtracker and Elena can prioritize it with the other work there. If you'd like to submit a patch or find a little bit of funding for us to perform this work, that'll speed things up. :slight_smile:

That would probably be the best approach, and then add a new routine
like H5Pset_scaleoffset_maxmin. A collective MPI_MAX / MPI_MIN call
would get that and it could be fed to the scaleoffset routine with
minimal pain / code changes. If you have pointers as to how to add
this kind of functionality, I'd be happy to try submitting a patch.

  I'm happy to give you some guidance, but I'm on vacation this week, so you'll have to ping me again next week.

I did look into the code and there are several routines designed to
calculate max and/or min for different datatypes. In essence I would
be removing functionality from the filter, not adding new
functionality!

  :-)

  Quincey

···

On Jun 26, 2010, at 10:09 PM, Leigh Orf wrote:

On Sat, Jun 26, 2010 at 2:02 PM, Quincey Koziol <koziol@hdfgroup.org> wrote:

On Jun 26, 2010, at 8:06 AM, Leigh Orf wrote: