Shuffle filter for array data

Hi all,

I was playing today with the different compression options
for our data trying to get some optimal numbers. Our data
are 16-bit images with dynamic range which is limited most
of the time so I would expect that shuffle filter should
get us some improvement compared to just using plain zlib
compression. To my surprise enabling shuffle did not
change compression factor at all. Looking at the shuffle
filter code it seems that the reason for that is the structure
of our data. The dataset which contains the images is a
1-dimensional dataspace with each element containing another
2- or 3-dimensional image stack:

DATASET "..." {
   DATATYPE H5T_ARRAY { [32][185][388] H5T_STD_I16LE }
   DATASPACE SIMPLE { ( 2132 ) / ( H5S_UNLIMITED ) }
   STORAGE_LAYOUT {
      CHUNKED ( 1 )
      SIZE 6608778714 (1.482:1 COMPRESSION)
    }
   FILTERS {
      COMPRESSION DEFLATE { LEVEL 1 }
   }

The size of the image arrays is quite big so the chunks
fit just one single array most of the time.

My understanding is that shuffle algorithm tries to
re-order bytes form multiple objects in a chunk but because
there is just one object (which is array) in this case
it does not do anything at all. What I would like shuffle
to do in this case is to shuffle 16-bit words from the array,
not to treat the array as a single object but look inside it.

I did some experimenting with the code and with a small change
to the code I managed to convince it to shuffle things correctly.
The diff is below this message. It does indeed improve
compression of the image data and the data can be read back
correctly after de-shuffling with the standard code (h5dump
shows identical results).

It would be really helpful for us if something like this could
be added to HDF5 library. I do not particularly care about other
types such as compounds, but for the datasets whose elements are
plain arrays it can probably be done without breaking compatibility.
OTOH if more options could be added to shuffle which control
shuffling of arrays and other types of data it could become even
more useful.

Thanks,
Andy

···

----------------------------------------------------------------------
This is the change applied to 1.8.6 code:

*** H5Zshuffle.c.orig 2011-02-14 08:23:19.000000000 -0800
--- H5Zshuffle.c 2011-09-06 17:18:13.022259993 -0700
***************
*** 88,93 ****
--- 88,98 ----
      if(H5P_get_filter_by_id(dcpl_plist, H5Z_FILTER_SHUFFLE, &flags, &cd_nelmts, cd_values, (size_t)0, NULL, NULL) < 0)
        HGOTO_ERROR(H5E_PLINE, H5E_CANTGET, FAIL, "can't get shuffle parameters")
  
+ /* If object is an array use its base type */
+ while (H5T_get_class(type, FALSE) == H5T_ARRAY) {
+ type = H5T_get_super(type);
+ }
+
      /* Set "local" parameter for this dataset */
      if((cd_values[H5Z_SHUFFLE_PARM_SIZE] = (unsigned)H5T_get_size(type)) == 0)
        HGOTO_ERROR(H5E_PLINE, H5E_BADTYPE, FAIL, "bad datatype size")

Hi Andrei,
  Sounds like a good suggestion to me, I'll put it into our issue tracker and we can get it scheduled for an upcoming release. (It might not make it into the 1.8.8 release in November though)

  Thanks for the idea,
    Quincey

···

On Sep 6, 2011, at 7:46 PM, Salnikov, Andrei A. wrote:

Hi all,

I was playing today with the different compression options
for our data trying to get some optimal numbers. Our data
are 16-bit images with dynamic range which is limited most
of the time so I would expect that shuffle filter should
get us some improvement compared to just using plain zlib
compression. To my surprise enabling shuffle did not
change compression factor at all. Looking at the shuffle
filter code it seems that the reason for that is the structure
of our data. The dataset which contains the images is a
1-dimensional dataspace with each element containing another
2- or 3-dimensional image stack:

DATASET "..." {
  DATATYPE H5T_ARRAY { [32][185][388] H5T_STD_I16LE }
  DATASPACE SIMPLE { ( 2132 ) / ( H5S_UNLIMITED ) }
  STORAGE_LAYOUT {
     CHUNKED ( 1 )
     SIZE 6608778714 (1.482:1 COMPRESSION)
   }
  FILTERS {
     COMPRESSION DEFLATE { LEVEL 1 }
  }

The size of the image arrays is quite big so the chunks
fit just one single array most of the time.

My understanding is that shuffle algorithm tries to
re-order bytes form multiple objects in a chunk but because
there is just one object (which is array) in this case
it does not do anything at all. What I would like shuffle
to do in this case is to shuffle 16-bit words from the array,
not to treat the array as a single object but look inside it.

I did some experimenting with the code and with a small change
to the code I managed to convince it to shuffle things correctly.
The diff is below this message. It does indeed improve
compression of the image data and the data can be read back
correctly after de-shuffling with the standard code (h5dump
shows identical results).

It would be really helpful for us if something like this could
be added to HDF5 library. I do not particularly care about other
types such as compounds, but for the datasets whose elements are
plain arrays it can probably be done without breaking compatibility.
OTOH if more options could be added to shuffle which control
shuffling of arrays and other types of data it could become even
more useful.

Thanks,
Andy

----------------------------------------------------------------------
This is the change applied to 1.8.6 code:

*** H5Zshuffle.c.orig 2011-02-14 08:23:19.000000000 -0800
--- H5Zshuffle.c 2011-09-06 17:18:13.022259993 -0700
***************
*** 88,93 ****
--- 88,98 ----
     if(H5P_get_filter_by_id(dcpl_plist, H5Z_FILTER_SHUFFLE, &flags, &cd_nelmts, cd_values, (size_t)0, NULL, NULL) < 0)
       HGOTO_ERROR(H5E_PLINE, H5E_CANTGET, FAIL, "can't get shuffle parameters")

+ /* If object is an array use its base type */
+ while (H5T_get_class(type, FALSE) == H5T_ARRAY) {
+ type = H5T_get_super(type);
+ }
+
     /* Set "local" parameter for this dataset */
     if((cd_values[H5Z_SHUFFLE_PARM_SIZE] = (unsigned)H5T_get_size(type)) == 0)
       HGOTO_ERROR(H5E_PLINE, H5E_BADTYPE, FAIL, "bad datatype size")

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Hi Quincey,

thank you so much, we really appreciate this.

Cheers,
Andy

···

Quincey Koziol wrote on 2011-09-07:

Hi Andrei,
  Sounds like a good suggestion to me, I'll put it into our issue
tracker and we can get it scheduled for an upcoming release. (It might
not make it into the 1.8.8 release in November though)

  Thanks for the idea,
    Quincey

On Sep 6, 2011, at 7:46 PM, Salnikov, Andrei A. wrote:

Hi all,

I was playing today with the different compression options
for our data trying to get some optimal numbers. Our data
are 16-bit images with dynamic range which is limited most
of the time so I would expect that shuffle filter should
get us some improvement compared to just using plain zlib
compression. To my surprise enabling shuffle did not
change compression factor at all. Looking at the shuffle
filter code it seems that the reason for that is the structure
of our data. The dataset which contains the images is a
1-dimensional dataspace with each element containing another
2- or 3-dimensional image stack:

DATASET "..." {
  DATATYPE H5T_ARRAY { [32][185][388] H5T_STD_I16LE }
  DATASPACE SIMPLE { ( 2132 ) / ( H5S_UNLIMITED ) }
  STORAGE_LAYOUT {
     CHUNKED ( 1 )
     SIZE 6608778714 (1.482:1 COMPRESSION)
   }
  FILTERS {
     COMPRESSION DEFLATE { LEVEL 1 }
  }
The size of the image arrays is quite big so the chunks
fit just one single array most of the time.

My understanding is that shuffle algorithm tries to
re-order bytes form multiple objects in a chunk but because
there is just one object (which is array) in this case
it does not do anything at all. What I would like shuffle
to do in this case is to shuffle 16-bit words from the array,
not to treat the array as a single object but look inside it.

I did some experimenting with the code and with a small change
to the code I managed to convince it to shuffle things correctly.
The diff is below this message. It does indeed improve
compression of the image data and the data can be read back
correctly after de-shuffling with the standard code (h5dump
shows identical results).

It would be really helpful for us if something like this could
be added to HDF5 library. I do not particularly care about other
types such as compounds, but for the datasets whose elements are
plain arrays it can probably be done without breaking compatibility.
OTOH if more options could be added to shuffle which control
shuffling of arrays and other types of data it could become even
more useful.

Thanks,
Andy

--------------------------------------------------------------------- -
This is the change applied to 1.8.6 code:

*** H5Zshuffle.c.orig 2011-02-14 08:23:19.000000000 -0800
--- H5Zshuffle.c 2011-09-06 17:18:13.022259993 -0700
***************
*** 88,93 ****
--- 88,98 ----
     if(H5P_get_filter_by_id(dcpl_plist, H5Z_FILTER_SHUFFLE, &flags,

&cd_nelmts, cd_values, (size_t)0, NULL, NULL) < 0)

       HGOTO_ERROR(H5E_PLINE, H5E_CANTGET, FAIL, "can't get shuffle

parameters")

+ /* If object is an array use its base type */
+ while (H5T_get_class(type, FALSE) == H5T_ARRAY) {
+ type = H5T_get_super(type);
+ }
+
     /* Set "local" parameter for this dataset */
     if((cd_values[H5Z_SHUFFLE_PARM_SIZE] =

(unsigned)H5T_get_size(type)) == 0)

       HGOTO_ERROR(H5E_PLINE, H5E_BADTYPE, FAIL, "bad datatype size")

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________ Hdf-forum is for HDF
software users discussion. Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org