nbit filter question

I am experimenting with the n-bit filter on floating point data to reduce
file size. I have been tweaking precision settings in order to reduce
precision. Mostly, things are working; however, I am a bit confused.

In the documentation (
http://www.hdfgroup.org/HDF5/doc/RM/RM_H5P.html#Property-SetNbit) I read the
following statement:

"By nature, the N-Bit filter should not be used together with other I/O
filters"

I also read the following in the documentation (
http://www.hdfgroup.org/HDF5/doc/UG/10_Datasets.html) in discussion integer
nbit compression:

"After n-bit compression, none of these discarded bits, known as *padding
bits* will be stored on disk."

While no such statement is found under the floating point discussion below
that, I would assume it also holds for floating point data.

There is also this statement:

"The n-bit decompression algorithm is very similar to n-bit compression. The
only difference is that at the byte level, compression packs out all padding
bits and stores only significant bits into a continous buffer (unsigned
char) while decompression unpacks significant bits and inserts padding bits
(zeros) at the proper positions to recover the data bytes as they existed
before compression."

So, when I look at all these statements combined, I am led to believe that
just applying n-bit compression should give me a good reduction in file
size. However, that does not happen.

Using the example C code found on
http://www.hdfgroup.org/HDF5/doc/UG/10_Datasets.html I modified it to fill a
2D array of size 1000x1000 with a known function (something involving logs
and sins to get me variability roughly what I find in my real scientific
data sets).

Below are four hdf5 files:
seagrape:/users/orf/test% ls -l *.h5

-rw-r--r-- 1 orf users 4004016 Nov 9 15:01 uncompressed-float.h5
-rw-r--r-- 1 orf users 4004016 Nov 9 15:01 nbit.h5
-rw-r--r-- 1 orf users 3398723 Nov 9 15:02 gzip-compressed-float.h5
-rw-r--r-- 1 orf users 880108 Nov 9 15:02 nbit-gzip.h5

uncompressed-float.h5 is without any compression whatsoever. As expected,
the file is roughly 1000x1000x4 bytes in size.
nbit.h5 has the n-bit filter applied. It is the same size!
gzip-compressed-float.h5 is the floating data with gzip (level 6) applied,
so it's lossless.
nbit-gzip.h5 has the nbit filter followed by the gzip filter. Lots o'
compression!!

So, it seems to me that the nbit filter applied to floating point data
stores the zeroed padding bits...? I thought I'd see the file reduction
without having to apply gzip compression?

I have done several tests and I am sure that the nbit.h5 file has a loss of
precision when I subtract the lossless data from it.

Some more info. Note, the dataset name is the same for all four files.

seagrape:/users/orf/test% h5ls -lrv uncompressed-float.h5
Opened "uncompressed-float.h5" with sec2 driver.
/ Group
    Location: 1:96
    Links: 1
/nbit_float Dataset {1000/1000, 1000/1000}
    Location: 1:800
    Links: 1
    Chunks: {1000, 1000} 4000000 bytes
    Storage: 4000000 logical bytes, 4000000 allocated bytes, 100.00%
utilization
    Type: IEEE 32-bit big-endian float
seagrape:/users/orf/test% h5ls -lrv gzip-compressed-float.h5
Opened "gzip-compressed-float.h5" with sec2 driver.
/ Group
    Location: 1:96
    Links: 1
/nbit_float Dataset {1000/1000, 1000/1000}
    Location: 1:800
    Links: 1
    Chunks: {1000, 1000} 4000000 bytes
    Storage: 4000000 logical bytes, 3394707 allocated bytes, 117.83%
utilization
    Filter-0: deflate-1 OPT {6}
    Type: IEEE 32-bit big-endian float
seagrape:/users/orf/test% h5ls -lrv nbit.h5
Opened "nbit.h5" with sec2 driver.
/ Group
    Location: 1:96
    Links: 1
/nbit_float Dataset {1000/1000, 1000/1000}
    Location: 1:800
    Links: 1
    Chunks: {1000, 1000} 4000000 bytes
    Storage: 4000000 logical bytes, 4000000 allocated bytes, 100.00%
utilization
    Filter-0: nbit-5 OPT {8, 0, 1000000, 1, 4, 1, 16, 7}
    Type: 32-bit big-endian floating-point
               (16 bits of precision beginning at bit 7)
               (7 zero bits at bit 0, 9 zero bits at bit 23)
               (significant for 9 bits at bit 7, msb implied)
               (exponent for 6 bits at bit 16, bias is 0x1f)
               (sign bit at 22)
seagrape:/users/orf/test% h5ls -lrv nbit-gzip.h5
Opened "nbit-gzip.h5" with sec2 driver.
/ Group
    Location: 1:96
    Links: 1
/nbit_float Dataset {1000/1000, 1000/1000}
    Location: 1:800
    Links: 1
    Chunks: {1000, 1000} 4000000 bytes
    Storage: 4000000 logical bytes, 876092 allocated bytes, 456.57%
utilization
    Filter-0: nbit-5 OPT {8, 0, 1000000, 1, 4, 1, 16, 7}
    Filter-1: deflate-1 OPT {6}
    Type: 32-bit big-endian floating-point
               (16 bits of precision beginning at bit 7)
               (7 zero bits at bit 0, 9 zero bits at bit 23)
               (significant for 9 bits at bit 7, msb implied)
               (exponent for 6 bits at bit 16, bias is 0x1f)
               (sign bit at 22)

Leigh

···

--
Leigh Orf
Associate Professor of Atmospheric Science
Department of Geology and Meteorology
Central Michigan University
Currently on sabbatical at the National Center for Atmospheric Research
in Boulder, CO
NCAR office phone: (303) 497-8200

Leigh,

Thank you for your report. We will be looking into the problem shortly.

Elena

···

On Nov 9, 2010, at 4:23 PM, Leigh Orf wrote:

I am experimenting with the n-bit filter on floating point data to reduce file size. I have been tweaking precision settings in order to reduce precision. Mostly, things are working; however, I am a bit confused.

In the documentation (http://www.hdfgroup.org/HDF5/doc/RM/RM_H5P.html#Property-SetNbit) I read the following statement:

"By nature, the N-Bit filter should not be used together with other I/O filters"

I also read the following in the documentation (http://www.hdfgroup.org/HDF5/doc/UG/10_Datasets.html) in discussion integer nbit compression:

"After n-bit compression, none of these discarded bits, known as padding bits will be stored on disk."

While no such statement is found under the floating point discussion below that, I would assume it also holds for floating point data.

There is also this statement:

"The n-bit decompression algorithm is very similar to n-bit compression. The only difference is that at the byte level, compression packs out all padding bits and stores only significant bits into a continous buffer (unsigned char) while decompression unpacks significant bits and inserts padding bits (zeros) at the proper positions to recover the data bytes as they existed before compression."

So, when I look at all these statements combined, I am led to believe that just applying n-bit compression should give me a good reduction in file size. However, that does not happen.

Using the example C code found on http://www.hdfgroup.org/HDF5/doc/UG/10_Datasets.html I modified it to fill a 2D array of size 1000x1000 with a known function (something involving logs and sins to get me variability roughly what I find in my real scientific data sets).

Below are four hdf5 files:
seagrape:/users/orf/test% ls -l *.h5

-rw-r--r-- 1 orf users 4004016 Nov 9 15:01 uncompressed-float.h5
-rw-r--r-- 1 orf users 4004016 Nov 9 15:01 nbit.h5
-rw-r--r-- 1 orf users 3398723 Nov 9 15:02 gzip-compressed-float.h5
-rw-r--r-- 1 orf users 880108 Nov 9 15:02 nbit-gzip.h5

uncompressed-float.h5 is without any compression whatsoever. As expected, the file is roughly 1000x1000x4 bytes in size.
nbit.h5 has the n-bit filter applied. It is the same size!
gzip-compressed-float.h5 is the floating data with gzip (level 6) applied, so it's lossless.
nbit-gzip.h5 has the nbit filter followed by the gzip filter. Lots o' compression!!

So, it seems to me that the nbit filter applied to floating point data stores the zeroed padding bits...? I thought I'd see the file reduction without having to apply gzip compression?

I have done several tests and I am sure that the nbit.h5 file has a loss of precision when I subtract the lossless data from it.

Some more info. Note, the dataset name is the same for all four files.

seagrape:/users/orf/test% h5ls -lrv uncompressed-float.h5
Opened "uncompressed-float.h5" with sec2 driver.
/ Group
    Location: 1:96
    Links: 1
/nbit_float Dataset {1000/1000, 1000/1000}
    Location: 1:800
    Links: 1
    Chunks: {1000, 1000} 4000000 bytes
    Storage: 4000000 logical bytes, 4000000 allocated bytes, 100.00% utilization
    Type: IEEE 32-bit big-endian float
seagrape:/users/orf/test% h5ls -lrv gzip-compressed-float.h5
Opened "gzip-compressed-float.h5" with sec2 driver.
/ Group
    Location: 1:96
    Links: 1
/nbit_float Dataset {1000/1000, 1000/1000}
    Location: 1:800
    Links: 1
    Chunks: {1000, 1000} 4000000 bytes
    Storage: 4000000 logical bytes, 3394707 allocated bytes, 117.83% utilization
    Filter-0: deflate-1 OPT {6}
    Type: IEEE 32-bit big-endian float
seagrape:/users/orf/test% h5ls -lrv nbit.h5
Opened "nbit.h5" with sec2 driver.
/ Group
    Location: 1:96
    Links: 1
/nbit_float Dataset {1000/1000, 1000/1000}
    Location: 1:800
    Links: 1
    Chunks: {1000, 1000} 4000000 bytes
    Storage: 4000000 logical bytes, 4000000 allocated bytes, 100.00% utilization
    Filter-0: nbit-5 OPT {8, 0, 1000000, 1, 4, 1, 16, 7}
    Type: 32-bit big-endian floating-point
               (16 bits of precision beginning at bit 7)
               (7 zero bits at bit 0, 9 zero bits at bit 23)
               (significant for 9 bits at bit 7, msb implied)
               (exponent for 6 bits at bit 16, bias is 0x1f)
               (sign bit at 22)
seagrape:/users/orf/test% h5ls -lrv nbit-gzip.h5
Opened "nbit-gzip.h5" with sec2 driver.
/ Group
    Location: 1:96
    Links: 1
/nbit_float Dataset {1000/1000, 1000/1000}
    Location: 1:800
    Links: 1
    Chunks: {1000, 1000} 4000000 bytes
    Storage: 4000000 logical bytes, 876092 allocated bytes, 456.57% utilization
    Filter-0: nbit-5 OPT {8, 0, 1000000, 1, 4, 1, 16, 7}
    Filter-1: deflate-1 OPT {6}
    Type: 32-bit big-endian floating-point
               (16 bits of precision beginning at bit 7)
               (7 zero bits at bit 0, 9 zero bits at bit 23)
               (significant for 9 bits at bit 7, msb implied)
               (exponent for 6 bits at bit 16, bias is 0x1f)
               (sign bit at 22)

Leigh

--
Leigh Orf
Associate Professor of Atmospheric Science
Department of Geology and Meteorology
Central Michigan University
Currently on sabbatical at the National Center for Atmospheric Research in Boulder, CO
NCAR office phone: (303) 497-8200

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org