RFC. display a "compression ratio" in h5dump

Pedro_Vicente2 · April 30, 2008, 3:27pm

Dear James, Chris and all HDF users

This is a RFC (Request For Comments) regarding a new feature in h5dump.
As you might know, h5dump has an option to display several properties of the dataset creation property list. If you specify this option at the command line

-p, --properties Print dataset filters, storage layout and fill value

h5dump prints several properties regarding filters, storage layout and fill value

for example

./h5dump -H -p -d deflate tfilters.h5

produces the output

HDF5 "tfilters.h5" {
DATASET "deflate" {
   DATATYPE H5T_STD_I32LE
   DATASPACE SIMPLE { ( 20, 10 ) / ( 20, 10 ) }
   STORAGE_LAYOUT {
      CHUNKED ( 10, 5 )
      SIZE 385
    }
   FILTERS {
      COMPRESSION DEFLATE { LEVEL 9 }
   }
   FILLVALUE {
      FILL_TIME H5D_FILL_TIME_IFSET
      VALUE 0
   }
   ALLOCATION_TIME {
      H5D_ALLOC_TIME_INCR
   }
}
}

There was a request to display a "compression ratio" in cases where compression filters are present

The values to compare are

A = theoretical maximum size of a dataset, obtained by multiplying the number of elements in a dataset by the size in bytes of each element. For example, for a dataset with 25 elements with integer type of 4 bytes, this size is 100

B = size obtained by the HDF5 function H5Dget_storage_size, that returns the amount of storage required for a dataset. If the dataset has compression filters this number is typically smaller than A

http://www.hdfgroup.org/HDF5/doc/RM/RM_H5D.html#Dataset-GetStorageSize

Note: for the moment, we assume that all chunks are written. For cases where this is not true, a new function H5Dget_chunk_info is being developed, that will provide a better measure for this.

In our view, this "compression ratio" could be achieved in 2 ways.

1) a simple ratio, e.g , B/A
2) a percentage, e.g, (A-B)/B.

So, what we are asking is if you have any preferred way to achieve this, either one of the above formulations or any other means to express it that you would like to suggest.

We propose to do the printing of this value after the SIZE information, for example, in the case above

SIZE 385 (51.9%COMPRESSION)

The final print would look like

HDF5 "tfilters.h5" {
DATASET "deflate" {
   DATATYPE H5T_STD_I32LE
   DATASPACE SIMPLE { ( 20, 10 ) / ( 20, 10 ) }
   STORAGE_LAYOUT {
      CHUNKED ( 10, 5 )
      SIZE 385 (51.9%COMPRESSION)
    }
   FILTERS {
      COMPRESSION DEFLATE { LEVEL 9 }
   }
   FILLVALUE {
      FILL_TIME H5D_FILL_TIME_IFSET
      VALUE 0
   }
   ALLOCATION_TIME {
      H5D_ALLOC_TIME_INCR
   }
}
}

Here's the complete h5dump usage for your reference

usage: h5dump [OPTIONS] file
  OPTIONS
     -h, --help Print a usage message and exit
     -n, --contents Print a list of the file contents and exit
     -B, --superblock Print the content of the super block
     -H, --header Print the header only; no data is displayed
     -A, --onlyattr Print the header and value of attributes
     -i, --object-ids Print the object ids
     -r, --string Print 1-byte integer datasets as ASCII
     -e, --escape Escape non printing characters
     -V, --version Print version number and exit
     -a P, --attribute=P Print the specified attribute
     -d P, --dataset=P Print the specified dataset
     -y, --noindex Do not print array indices with the data
     -p, --properties Print dataset filters, storage layout and fill value
     -f D, --filedriver=D Specify which driver to open the file with
     -g P, --group=P Print the specified group and all members
     -l P, --soft-link=P Print the value(s) of the specified soft link
     -o F, --output=F Output raw data into file F
     -b B, --binary=B Binary file output, of form B
     -t P, --datatype=P Print the specified named datatype
     -w N, --width=N Set the number of columns of output
     -q Q, --sort_by=Q Sort groups and attributes by index Q
     -z Z, --sort_order=Z Sort groups and attributes by order Z
     -x, --xml Output in XML using Schema
     -u, --use-dtd Output in XML using DTD
     -D U, --xml-dtd=U Use the DTD or schema at U
     -X S, --xml-ns=S (XML Schema) Use qualified names n the XML
                          ":": no namespace, default: "hdf5:"
                          E.g., to dump a file called `-f', use h5dump -- -f

Subsetting is available by using the following options with a dataset
attribute. Subsetting is done by selecting a hyperslab from the data.
Thus, the options mirror those for performing a hyperslab selection.
The START and COUNT parameters are mandatory if you do subsetting.
The STRIDE and BLOCK parameters are optional and will default to 1 in
each dimension.

      -s L, --start=L Offset of start of subsetting selection
      -S L, --stride=L Hyperslab stride
      -c L, --count=L Number of blocks to include in selection
      -k L, --block=L Size of block in hyperslab

  D - is the file driver to use in opening the file. Acceptable values
        are "sec2", "family", "split", "multi", "direct", and "stream". Without
        the file driver flag, the file will be opened with each driver in
        turn and in the order specified above until one driver succeeds
        in opening the file.
  F - is a filename.
  P - is the full path from the root group to the object.
  N - is an integer greater than 1.
  L - is a list of integers the number of which are equal to the
        number of dimensions in the dataspace being queried
  U - is a URI reference (as defined in [IETF RFC 2396],
        updated by [IETF RFC 2732])
  B - is the form of binary output: MEMORY for a memory type, FILE for the
        file type, LE or BE for pre-existing little or big endian types.
        Must be used with -o (output file) and it is recommended that
        -d (dataset) is used
  Q - is the sort index type. It can be "creation_order" or "name" (default)
  Z - is the sort order type. It can be "descending" or "ascending" (default)

Examples:

1) Attribute foo of the group /bar_none in file quux.h5

h5dump -a /bar_none/foo quux.h5

2) Selecting a subset from dataset /foo in file quux.h5

h5dump -d /foo -s "0,1" -S "1,1" -c "2,3" -k "2,2" quux.h5

3) Saving dataset 'dset' in file quux.h5 to binary file 'out.bin'
using a little-endian type

h5dump -d /dset -b LE -o out.bin quux.h5

···

--------------------------------------------------------------
Pedro Vicente (T) 217.265-0311
pvn@hdfgroup.org
The HDF Group. 1901 S. First. Champaign, IL 61820

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Pedro_Vicente2 · April 30, 2008, 4:04pm

I think B/A is the simpler and therefore more intuitive formulation,
so I vote for that. Your algorithm would appear to work fine for our
purposes.

BTW, would this information also be available in h5repack verbose
output?

Chris, I call this an ultra fast response ! Yes, we can add the same information to h5repack and thanks for your vote

Pedro

···

At 10:32 AM 4/30/2008, Christopher Lynnes wrote:

On Apr 30, 2008, at 11:27 AM, Pedro Vicente wrote:

Dear James, Chris and all HDF users

This is a RFC (Request For Comments) regarding a new feature in
h5dump.
As you might know, h5dump has an option to display several
properties of the dataset creation property list. If you specify
this option at the command line

-p, --properties Print dataset filters, storage layout and fill
value

h5dump prints several properties regarding filters, storage layout
and fill value

for example

./h5dump -H -p -d deflate tfilters.h5

produces the output

HDF5 "tfilters.h5" {
DATASET "deflate" {
DATATYPE H5T_STD_I32LE
DATASPACE SIMPLE { ( 20, 10 ) / ( 20, 10 ) }
STORAGE_LAYOUT {
    CHUNKED ( 10, 5 )
    SIZE 385
  }
FILTERS {
    COMPRESSION DEFLATE { LEVEL 9 }
}
FILLVALUE {
    FILL_TIME H5D_FILL_TIME_IFSET
    VALUE 0
}
ALLOCATION_TIME {
    H5D_ALLOC_TIME_INCR
}
}
}

There was a request to display a "compression ratio" in cases where
compression filters are present

The values to compare are

A = theoretical maximum size of a dataset, obtained by multiplying
the number of elements in a dataset by the size in bytes of each
element. For example, for a dataset with 25 elements with integer
type of 4 bytes, this size is 100

B = size obtained by the HDF5 function H5Dget_storage_size, that
returns the amount of storage required for a dataset. If the dataset
has compression filters this number is typically smaller than A

http://www.hdfgroup.org/HDF5/doc/RM/RM_H5D.html#Dataset-GetStorageSize

Note: for the moment, we assume that all chunks are written. For
cases where this is not true, a new function H5Dget_chunk_info is
being developed, that will provide a better measure for this.

In our view, this "compression ratio" could be achieved in 2 ways.

1) a simple ratio, e.g , B/A
2) a percentage, e.g, (A-B)/B.

So, what we are asking is if you have any preferred way to achieve
this, either one of the above formulations or any other means to
express it that you would like to suggest.

We propose to do the printing of this value after the SIZE
information, for example, in the case above

SIZE 385 (51.9%COMPRESSION)

The final print would look like

HDF5 "tfilters.h5" {
DATASET "deflate" {
DATATYPE H5T_STD_I32LE
DATASPACE SIMPLE { ( 20, 10 ) / ( 20, 10 ) }
STORAGE_LAYOUT {
    CHUNKED ( 10, 5 )
    SIZE 385 (51.9%COMPRESSION)
  }
FILTERS {
    COMPRESSION DEFLATE { LEVEL 9 }
}
FILLVALUE {
    FILL_TIME H5D_FILL_TIME_IFSET
    VALUE 0
}
ALLOCATION_TIME {
    H5D_ALLOC_TIME_INCR
}
}
}

Here's the complete h5dump usage for your reference

usage: h5dump [OPTIONS] file
OPTIONS
   -h, --help    -n, --contents    -B, --superblock    -H, --header    -A, --onlyattr    -i, --object-ids    -r, --string    -e, --escape    -V, --version    -a P, --attribute=P    -d P, --dataset=P    -y, --noindex    -p, --properties fill value
   -f D,    -g P, --group=P    -l P, --soft-link=P    -o F, --output=F    -b B, --binary=B    -t P, --datatype=P    -w N, --width=N    -q Q, --sort_by=Q    -z Z,    -x, --xml    -u, --use-dtd    -D U, --xml-dtd=U    -X S, --xml-ns=S      &nb      &nb h5dump -- -f Print a usage message and exit
Print a list of the file contents and exit
Print the content of the super block
Print the header only; no data is displayed
Print the header and value of attributes
Print the object ids
Print 1-byte integer datasets as ASCII
Escape non printing characters
Print version number and exit
Print the specified attribute
Print the specified dataset
Do not print array indices with the data
Print dataset filters, storage layout and
--filedriver=D Specify which driver to open the file with
Print the specified group and all members
Print the value(s) of the specified soft link
Output raw data into file F
Binary file output, of form B
Print the specified named datatype
Set the number of columns of output
Sort groups and attributes by index Q
--sort_order=Z Sort groups and attributes by order Z
Output in XML using Schema
Output in XML using DTD
Use the DTD or schema at U
(XML Schema) Use qualified names n the XML
sp;                  ":": no namespace, default: "hdf5:"
sp;                  E.g., to dump a file called `-f', use

Subsetting is available by using the following options with a dataset
attribute. Subsetting is done by selecting a hyperslab from the data.
Thus, the options mirror those for performing a hyperslab selection.
The START and COUNT parameters are mandatory if you do subsetting.
The STRIDE and BLOCK parameters are optional and will default to 1 in
each dimension.

    -s L, --start=L Offset of start of subsetting selection
    -S L, --stride=L Hyperslab stride
    -c L, --count=L Number of blocks to include in selection
    -k L, --block=L Size of block in hyperslab

D - is the file driver to use in opening the file. Acceptable values
      are "sec2", "family", "split", "multi", "direct", and
"stream". Without
      the file driver flag, the file will be opened with each
driver in
      turn and in the order specified above until one driver succeeds
      in opening the file.
F - is a filename.
P - is the full path from the root group to the object.
N - is an integer greater than 1.
L - is a list of integers the number of which are equal to the
      number of dimensions in the dataspace being queried
U - is a URI reference (as defined in [IETF RFC 2396],
      updated by [IETF RFC 2732])
B - is the form of binary output: MEMORY for a memory type, FILE
for the
      file type, LE or BE for pre-existing little or big endian
types.
      Must be used with -o (output file) and it is recommended that
      -d (dataset) is used
Q - is the sort index type. It can be "creation_order" or
"name" (default)
Z - is the sort order type. It can be "descending" or
"ascending" (default)

Examples:

1) Attribute foo of the group /bar_none in file quux.h5

      h5dump -a /bar_none/foo quux.h5

2) Selecting a subset from dataset /foo in file quux.h5

    h5dump -d /foo -s "0,1" -S "1,1" -c "2,3" -k "2,2" quux.h5

3) Saving dataset 'dset' in file quux.h5 to binary file 'out.bin'
      using a little-endian type

    h5dump -d /dset -b LE -o out.bin quux.h5

--------------------------------------------------------------
Pedro Vicente (T) 217.265-0311
pvn@hdfgroup.org
The HDF Group. 1901 S. First. Champaign, IL 61820

--
Christopher Lynnes NASA/GSFC, Code 610.2
301-614-5185

--------------------------------------------------------------
Pedro Vicente (T) 217.265-0311
pvn@hdfgroup.org
The HDF Group. 1901 S. First. Champaign, IL 61820

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Pedro_Vicente2 · April 30, 2008, 5:20pm

I think B/A is the simpler and therefore more intuitive formulation,
so I vote for that. Your algorithm would appear to work fine for our
purposes.

BTW, would this information also be available in h5repack verbose
output?

Chris, what do you suggest we print in the B/A <number> below, instead of

SIZE 385 (51.9%COMPRESSION)

the format is (<number>COMPRESSION)

should we take out the "%" sign ? how many digits of precision ? instead of "COMPRESSION", any other nomenclature ?

also, a correction, my formula 2) is a percentage, so, it should be

1) a simple ratio, e.g , B/A
2) a percentage, e.g, (A-B)/B * 100

Pedro

···

At 10:32 AM 4/30/2008, Christopher Lynnes wrote:

On Apr 30, 2008, at 11:27 AM, Pedro Vicente wrote:

Dear James, Chris and all HDF users

This is a RFC (Request For Comments) regarding a new feature in
h5dump.
As you might know, h5dump has an option to display several
properties of the dataset creation property list. If you specify
this option at the command line

-p, --properties Print dataset filters, storage layout and fill
value

h5dump prints several properties regarding filters, storage layout
and fill value

for example

./h5dump -H -p -d deflate tfilters.h5

produces the output

HDF5 "tfilters.h5" {
DATASET "deflate" {
DATATYPE H5T_STD_I32LE
DATASPACE SIMPLE { ( 20, 10 ) / ( 20, 10 ) }
STORAGE_LAYOUT {
    CHUNKED ( 10, 5 )
    SIZE 385
  }
FILTERS {
    COMPRESSION DEFLATE { LEVEL 9 }
}
FILLVALUE {
    FILL_TIME H5D_FILL_TIME_IFSET
    VALUE 0
}
ALLOCATION_TIME {
    H5D_ALLOC_TIME_INCR
}
}
}

There was a request to display a "compression ratio" in cases where
compression filters are present

The values to compare are

A = theoretical maximum size of a dataset, obtained by multiplying
the number of elements in a dataset by the size in bytes of each
element. For example, for a dataset with 25 elements with integer
type of 4 bytes, this size is 100

B = size obtained by the HDF5 function H5Dget_storage_size, that
returns the amount of storage required for a dataset. If the dataset
has compression filters this number is typically smaller than A

http://www.hdfgroup.org/HDF5/doc/RM/RM_H5D.html#Dataset-GetStorageSize

Note: for the moment, we assume that all chunks are written. For
cases where this is not true, a new function H5Dget_chunk_info is
being developed, that will provide a better measure for this.

In our view, this "compression ratio" could be achieved in 2 ways.

1) a simple ratio, e.g , B/A
2) a percentage, e.g, (A-B)/B.

So, what we are asking is if you have any preferred way to achieve
this, either one of the above formulations or any other means to
express it that you would like to suggest.

We propose to do the printing of this value after the SIZE
information, for example, in the case above

SIZE 385 (51.9%COMPRESSION)

The final print would look like

HDF5 "tfilters.h5" {
DATASET "deflate" {
DATATYPE H5T_STD_I32LE
DATASPACE SIMPLE { ( 20, 10 ) / ( 20, 10 ) }
STORAGE_LAYOUT {
    CHUNKED ( 10, 5 )
    SIZE 385 (51.9%COMPRESSION)
  }
FILTERS {
    COMPRESSION DEFLATE { LEVEL 9 }
}
FILLVALUE {
    FILL_TIME H5D_FILL_TIME_IFSET
    VALUE 0
}
ALLOCATION_TIME {
    H5D_ALLOC_TIME_INCR
}
}
}

Here's the complete h5dump usage for your reference

usage: h5dump [OPTIONS] file
OPTIONS
   -h, --help Print a usage message and exit
   -n, --contents Print a list of the file contents and exit
   -B, --superblock Print the content of the super block
   -H, --header Print the header only; no data is displayed
   -A, --onlyattr Print the header and value of attributes
   -i, --object-ids Print the object ids
   -r, --string Print 1-byte integer datasets as ASCII
   -e, --escape Escape non printing characters
   -V, --version Print version number and exit
   -a P, --attribute=P Print the specified attribute
   -d P, --dataset=P Print the specified dataset
   -y, --noindex Do not print array indices with the data
   -p, --properties Print dataset filters, storage layout and
fill value
   -f D, --filedriver=D Specify which driver to open the file with
   -g P, --group=P Print the specified group and all members
   -l P, --soft-link=P Print the value(s) of the specified soft link
   -o F, --output=F Output raw data into file F
   -b B, --binary=B Binary file output, of form B
   -t P, --datatype=P Print the specified named datatype
   -w N, --width=N Set the number of columns of output
   -q Q, --sort_by=Q Sort groups and attributes by index Q
   -z Z, --sort_order=Z Sort groups and attributes by order Z
   -x, --xml Output in XML using Schema
   -u, --use-dtd Output in XML using DTD
   -D U, --xml-dtd=U Use the DTD or schema at U
   -X S, --xml-ns=S (XML Schema) Use qualified names n the XML
                        ":": no namespace, default: "hdf5:"
                        E.g., to dump a file called `-f', use
h5dump -- -f

Subsetting is available by using the following options with a dataset
attribute. Subsetting is done by selecting a hyperslab from the data.
Thus, the options mirror those for performing a hyperslab selection.
The START and COUNT parameters are mandatory if you do subsetting.
The STRIDE and BLOCK parameters are optional and will default to 1 in
each dimension.

    -s L, --start=L Offset of start of subsetting selection
    -S L, --stride=L Hyperslab stride
    -c L, --count=L Number of blocks to include in selection
    -k L, --block=L Size of block in hyperslab

D - is the file driver to use in opening the file. Acceptable values
      are "sec2", "family", "split", "multi", "direct", and
"stream". Without
      the file driver flag, the file will be opened with each
driver in
      turn and in the order specified above until one driver succeeds
      in opening the file.
F - is a filename.
P - is the full path from the root group to the object.
N - is an integer greater than 1.
L - is a list of integers the number of which are equal to the
      number of dimensions in the dataspace being queried
U - is a URI reference (as defined in [IETF RFC 2396],
      updated by [IETF RFC 2732])
B - is the form of binary output: MEMORY for a memory type, FILE
for the
      file type, LE or BE for pre-existing little or big endian
types.
      Must be used with -o (output file) and it is recommended that
      -d (dataset) is used
Q - is the sort index type. It can be "creation_order" or
"name" (default)
Z - is the sort order type. It can be "descending" or
"ascending" (default)

Examples:

1) Attribute foo of the group /bar_none in file quux.h5

      h5dump -a /bar_none/foo quux.h5

2) Selecting a subset from dataset /foo in file quux.h5

    h5dump -d /foo -s "0,1" -S "1,1" -c "2,3" -k "2,2" quux.h5

3) Saving dataset 'dset' in file quux.h5 to binary file 'out.bin'
      using a little-endian type

    h5dump -d /dset -b LE -o out.bin quux.h5

--------------------------------------------------------------
Pedro Vicente (T) 217.265-0311
pvn@hdfgroup.org
The HDF Group. 1901 S. First. Champaign, IL 61820

--
Christopher Lynnes NASA/GSFC, Code 610.2
301-614-5185

--------------------------------------------------------------
Pedro Vicente (T) 217.265-0311
pvn@hdfgroup.org
The HDF Group. 1901 S. First. Champaign, IL 61820

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

McCloskey_David_L · April 30, 2008, 6:58pm

Everything I've read about compression ratio including on wikipedia
(http://en.wikipedia.org/wiki/Data_compression_ratio) is that it should
be A/B.

Also, Dividing by B to calculate the percentage doesn't make much sense
to me. If I saw a percentage, I'd think it was "this is the percentage
of the total size that I'm using" So if I compressed 100 bytes to 30
bytes, I'd think I should see 30%. As in, "it's using 30% of the space
it would if I didn't compress".

A/B makes the most sense to me because it removes all of this ambiguity.
If I see a compression ratio of 3, I know that if I didn't have
compression, I'd need 3 times the space. That makes it very simple. If
it's B/A, I'm not very sure what I'm seeing when I see a number like
0.33.

···

---

Based on the wikipedia article, I think the confusion is summarized by
this paragraph: Note: There is some confusion about the term
'compression ratio', particularly outside academia and commerce. In
particular, some authors use the term 'compression ratio' to mean 'space
savings', even though the latter is not a ratio; and others use the term
'compression ratio' to mean its inverse, even though that equates higher
compression ratio with lower compression.

Dave McCloskey

-----Original Message-----
From: Pedro Vicente [mailto:pvn@hdfgroup.org]
Sent: Wednesday, April 30, 2008 11:27 AM
To: hdf-forum@hdfgroup.org
Cc: jjohnson@disc.sci.gsfc.nasa.gov; Chris.Lynnes@nasa.gov
Subject: [hdf-forum] RFC. display a "compression ratio" in h5dump

Dear James, Chris and all HDF users

This is a RFC (Request For Comments) regarding a new feature in h5dump.
As you might know, h5dump has an option to display several properties of
the dataset creation property list. If you specify this option at the
command line

-p, --properties Print dataset filters, storage layout and fill
value

h5dump prints several properties regarding filters, storage layout and
fill value

for example

./h5dump -H -p -d deflate tfilters.h5

produces the output

HDF5 "tfilters.h5" {
DATASET "deflate" {
   DATATYPE H5T_STD_I32LE
   DATASPACE SIMPLE { ( 20, 10 ) / ( 20, 10 ) }
   STORAGE_LAYOUT {
      CHUNKED ( 10, 5 )
      SIZE 385
    }
   FILTERS {
      COMPRESSION DEFLATE { LEVEL 9 }
   }
   FILLVALUE {
      FILL_TIME H5D_FILL_TIME_IFSET
      VALUE 0
   }
   ALLOCATION_TIME {
      H5D_ALLOC_TIME_INCR
   }
}
}

There was a request to display a "compression ratio" in cases where
compression filters are present

The values to compare are

A = theoretical maximum size of a dataset, obtained by multiplying the
number of elements in a dataset by the size in bytes of each element.
For example, for a dataset with 25 elements with integer type of 4
bytes, this size is 100

B = size obtained by the HDF5 function H5Dget_storage_size, that returns
the amount of storage required for a dataset. If the dataset has
compression filters this number is typically smaller than A

http://www.hdfgroup.org/HDF5/doc/RM/RM_H5D.html#Dataset-GetStorageSize

Note: for the moment, we assume that all chunks are written. For cases
where this is not true, a new function H5Dget_chunk_info is being
developed, that will provide a better measure for this.

In our view, this "compression ratio" could be achieved in 2 ways.

1) a simple ratio, e.g , B/A
2) a percentage, e.g, (A-B)/B.

So, what we are asking is if you have any preferred way to achieve this,
either one of the above formulations or any other means to express it
that you would like to suggest.

We propose to do the printing of this value after the SIZE information,
for example, in the case above

SIZE 385 (51.9%COMPRESSION)

The final print would look like

HDF5 "tfilters.h5" {
DATASET "deflate" {
   DATATYPE H5T_STD_I32LE
   DATASPACE SIMPLE { ( 20, 10 ) / ( 20, 10 ) }
   STORAGE_LAYOUT {
      CHUNKED ( 10, 5 )
      SIZE 385 (51.9%COMPRESSION)
    }
   FILTERS {
      COMPRESSION DEFLATE { LEVEL 9 }
   }
   FILLVALUE {
      FILL_TIME H5D_FILL_TIME_IFSET
      VALUE 0
   }
   ALLOCATION_TIME {
      H5D_ALLOC_TIME_INCR
   }
}
}

Here's the complete h5dump usage for your reference

usage: h5dump [OPTIONS] file
  OPTIONS
     -h, --help Print a usage message and exit
     -n, --contents Print a list of the file contents and exit
     -B, --superblock Print the content of the super block
     -H, --header Print the header only; no data is displayed
     -A, --onlyattr Print the header and value of attributes
     -i, --object-ids Print the object ids
     -r, --string Print 1-byte integer datasets as ASCII
     -e, --escape Escape non printing characters
     -V, --version Print version number and exit
     -a P, --attribute=P Print the specified attribute
     -d P, --dataset=P Print the specified dataset
     -y, --noindex Do not print array indices with the data
     -p, --properties Print dataset filters, storage layout and fill
value
     -f D, --filedriver=D Specify which driver to open the file with
     -g P, --group=P Print the specified group and all members
     -l P, --soft-link=P Print the value(s) of the specified soft link
     -o F, --output=F Output raw data into file F
     -b B, --binary=B Binary file output, of form B
     -t P, --datatype=P Print the specified named datatype
     -w N, --width=N Set the number of columns of output
     -q Q, --sort_by=Q Sort groups and attributes by index Q
     -z Z, --sort_order=Z Sort groups and attributes by order Z
     -x, --xml Output in XML using Schema
     -u, --use-dtd Output in XML using DTD
     -D U, --xml-dtd=U Use the DTD or schema at U
     -X S, --xml-ns=S (XML Schema) Use qualified names n the XML
                          ":": no namespace, default: "hdf5:"
                          E.g., to dump a file called `-f', use h5dump
-- -f

Subsetting is available by using the following options with a dataset
attribute. Subsetting is done by selecting a hyperslab from the data.
Thus, the options mirror those for performing a hyperslab selection.
The START and COUNT parameters are mandatory if you do subsetting.
The STRIDE and BLOCK parameters are optional and will default to 1 in
each dimension.

      -s L, --start=L Offset of start of subsetting selection
      -S L, --stride=L Hyperslab stride
      -c L, --count=L Number of blocks to include in selection
      -k L, --block=L Size of block in hyperslab

  D - is the file driver to use in opening the file. Acceptable values
        are "sec2", "family", "split", "multi", "direct", and "stream".
Without
        the file driver flag, the file will be opened with each driver
in
        turn and in the order specified above until one driver succeeds
        in opening the file.
  F - is a filename.
  P - is the full path from the root group to the object.
  N - is an integer greater than 1.
  L - is a list of integers the number of which are equal to the
        number of dimensions in the dataspace being queried
  U - is a URI reference (as defined in [IETF RFC 2396],
        updated by [IETF RFC 2732])
  B - is the form of binary output: MEMORY for a memory type, FILE for
the
        file type, LE or BE for pre-existing little or big endian types.
        Must be used with -o (output file) and it is recommended that
        -d (dataset) is used
  Q - is the sort index type. It can be "creation_order" or "name"
(default)
  Z - is the sort order type. It can be "descending" or "ascending"
(default)

Examples:

1) Attribute foo of the group /bar_none in file quux.h5

h5dump -a /bar_none/foo quux.h5

2) Selecting a subset from dataset /foo in file quux.h5

h5dump -d /foo -s "0,1" -S "1,1" -c "2,3" -k "2,2" quux.h5

3) Saving dataset 'dset' in file quux.h5 to binary file 'out.bin'
using a little-endian type

h5dump -d /dset -b LE -o out.bin quux.h5

--------------------------------------------------------------
Pedro Vicente (T) 217.265-0311
pvn@hdfgroup.org
The HDF Group. 1901 S. First. Champaign, IL 61820

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to
hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Pedro_Vicente2 · April 30, 2008, 7:32pm

Everything I've read about compression ratio including on wikipedia
(http://en.wikipedia.org/wiki/Data_compression_ratio\) is that it should
be A/B.

Also, Dividing by B to calculate the percentage doesn't make much sense
to me. If I saw a percentage, I'd think it was "this is the percentage
of the total size that I'm using" So if I compressed 100 bytes to 30
bytes, I'd think I should see 30%. As in, "it's using 30% of the space
it would if I didn't compress".

A/B makes the most sense to me because it removes all of this ambiguity.
If I see a compression ratio of 3, I know that if I didn't have
compression, I'd need 3 times the space. That makes it very simple. If
it's B/A, I'm not very sure what I'm seeing when I see a number like
0.33.

---

Based on the wikipedia article, I think the confusion is summarized by
this paragraph: Note: There is some confusion about the term
'compression ratio', particularly outside academia and commerce. In
particular, some authors use the term 'compression ratio' to mean 'space
savings', even though the latter is not a ratio; and others use the term
'compression ratio' to mean its inverse, even though that equates higher
compression ratio with lower compression.

Dave McCloskey

yes, you are right, it should be, from that wikipedia article

compression ratio = uncompressed size / compressed size

in our formulas, that translates to

compression ratio = A / B

they propose 2 "standard" ways to write this, either

"often notated as an explicit ratio, 5:1 (read "five to one"), or as an implicit ratio, 5X. "

translating in h5dump nomenclature and from the example below, that would be either

SIZE 385 (5:1 RATIO)

or

SIZE 385 (5X RATIO)

Pedro

···

At 01:58 PM 4/30/2008, McCloskey, David L. wrote:

-----Original Message-----
From: Pedro Vicente [mailto:pvn@hdfgroup.org]
Sent: Wednesday, April 30, 2008 11:27 AM
To: hdf-forum@hdfgroup.org
Cc: jjohnson@disc.sci.gsfc.nasa.gov; Chris.Lynnes@nasa.gov
Subject: [hdf-forum] RFC. display a "compression ratio" in h5dump

Dear James, Chris and all HDF users

This is a RFC (Request For Comments) regarding a new feature in h5dump.
As you might know, h5dump has an option to display several properties of
the dataset creation property list. If you specify this option at the
command line

-p, --properties Print dataset filters, storage layout and fill
value

h5dump prints several properties regarding filters, storage layout and
fill value

for example

./h5dump -H -p -d deflate tfilters.h5

produces the output

HDF5 "tfilters.h5" {
DATASET "deflate" {
  DATATYPE H5T_STD_I32LE
  DATASPACE SIMPLE { ( 20, 10 ) / ( 20, 10 ) }
  STORAGE_LAYOUT {
     CHUNKED ( 10, 5 )
     SIZE 385
   }
  FILTERS {
     COMPRESSION DEFLATE { LEVEL 9 }
  }
  FILLVALUE {
     FILL_TIME H5D_FILL_TIME_IFSET
     VALUE 0
  }
  ALLOCATION_TIME {
     H5D_ALLOC_TIME_INCR
  }
}
}

There was a request to display a "compression ratio" in cases where
compression filters are present

The values to compare are

A = theoretical maximum size of a dataset, obtained by multiplying the
number of elements in a dataset by the size in bytes of each element.
For example, for a dataset with 25 elements with integer type of 4
bytes, this size is 100

B = size obtained by the HDF5 function H5Dget_storage_size, that returns
the amount of storage required for a dataset. If the dataset has
compression filters this number is typically smaller than A

http://www.hdfgroup.org/HDF5/doc/RM/RM_H5D.html#Dataset-GetStorageSize

Note: for the moment, we assume that all chunks are written. For cases
where this is not true, a new function H5Dget_chunk_info is being
developed, that will provide a better measure for this.

In our view, this "compression ratio" could be achieved in 2 ways.

1) a simple ratio, e.g , B/A
2) a percentage, e.g, (A-B)/B.

So, what we are asking is if you have any preferred way to achieve this,
either one of the above formulations or any other means to express it
that you would like to suggest.

We propose to do the printing of this value after the SIZE information,
for example, in the case above

SIZE 385 (51.9%COMPRESSION)

The final print would look like

HDF5 "tfilters.h5" {
DATASET "deflate" {
  DATATYPE H5T_STD_I32LE
  DATASPACE SIMPLE { ( 20, 10 ) / ( 20, 10 ) }
  STORAGE_LAYOUT {
     CHUNKED ( 10, 5 )
     SIZE 385 (51.9%COMPRESSION)
   }
  FILTERS {
     COMPRESSION DEFLATE { LEVEL 9 }
  }
  FILLVALUE {
     FILL_TIME H5D_FILL_TIME_IFSET
     VALUE 0
  }
  ALLOCATION_TIME {
     H5D_ALLOC_TIME_INCR
  }
}
}

Here's the complete h5dump usage for your reference

usage: h5dump [OPTIONS] file
OPTIONS
    -h, --help Print a usage message and exit
    -n, --contents Print a list of the file contents and exit
    -B, --superblock Print the content of the super block
    -H, --header Print the header only; no data is displayed
    -A, --onlyattr Print the header and value of attributes
    -i, --object-ids Print the object ids
    -r, --string Print 1-byte integer datasets as ASCII
    -e, --escape Escape non printing characters
    -V, --version Print version number and exit
    -a P, --attribute=P Print the specified attribute
    -d P, --dataset=P Print the specified dataset
    -y, --noindex Do not print array indices with the data
    -p, --properties Print dataset filters, storage layout and fill
value
    -f D, --filedriver=D Specify which driver to open the file with
    -g P, --group=P Print the specified group and all members
    -l P, --soft-link=P Print the value(s) of the specified soft link
    -o F, --output=F Output raw data into file F
    -b B, --binary=B Binary file output, of form B
    -t P, --datatype=P Print the specified named datatype
    -w N, --width=N Set the number of columns of output
    -q Q, --sort_by=Q Sort groups and attributes by index Q
    -z Z, --sort_order=Z Sort groups and attributes by order Z
    -x, --xml Output in XML using Schema
    -u, --use-dtd Output in XML using DTD
    -D U, --xml-dtd=U Use the DTD or schema at U
    -X S, --xml-ns=S (XML Schema) Use qualified names n the XML
                         ":": no namespace, default: "hdf5:"
                         E.g., to dump a file called `-f', use h5dump
-- -f

Subsetting is available by using the following options with a dataset
attribute. Subsetting is done by selecting a hyperslab from the data.
Thus, the options mirror those for performing a hyperslab selection.
The START and COUNT parameters are mandatory if you do subsetting.
The STRIDE and BLOCK parameters are optional and will default to 1 in
each dimension.

     -s L, --start=L Offset of start of subsetting selection
     -S L, --stride=L Hyperslab stride
     -c L, --count=L Number of blocks to include in selection
     -k L, --block=L Size of block in hyperslab

D - is the file driver to use in opening the file. Acceptable values
       are "sec2", "family", "split", "multi", "direct", and "stream".
Without
       the file driver flag, the file will be opened with each driver
in
       turn and in the order specified above until one driver succeeds
       in opening the file.
F - is a filename.
P - is the full path from the root group to the object.
N - is an integer greater than 1.
L - is a list of integers the number of which are equal to the
       number of dimensions in the dataspace being queried
U - is a URI reference (as defined in [IETF RFC 2396],
       updated by [IETF RFC 2732])
B - is the form of binary output: MEMORY for a memory type, FILE for
the
       file type, LE or BE for pre-existing little or big endian types.
       Must be used with -o (output file) and it is recommended that
       -d (dataset) is used
Q - is the sort index type. It can be "creation_order" or "name"
(default)
Z - is the sort order type. It can be "descending" or "ascending"
(default)

Examples:

1) Attribute foo of the group /bar_none in file quux.h5

       h5dump -a /bar_none/foo quux.h5

2) Selecting a subset from dataset /foo in file quux.h5

     h5dump -d /foo -s "0,1" -S "1,1" -c "2,3" -k "2,2" quux.h5

3) Saving dataset 'dset' in file quux.h5 to binary file 'out.bin'
       using a little-endian type

     h5dump -d /dset -b LE -o out.bin quux.h5

--------------------------------------------------------------
Pedro Vicente (T) 217.265-0311
pvn@hdfgroup.org
The HDF Group. 1901 S. First. Champaign, IL 61820

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to
hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

--------------------------------------------------------------
Pedro Vicente (T) 217.265-0311
pvn@hdfgroup.org
The HDF Group. 1901 S. First. Champaign, IL 61820

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

McCloskey_David_L · April 30, 2008, 7:37pm

I think (5:1 COMPRESSION) reads best: "five to one compression", though
(5:1 COMPRESSION RATIO) may be more clear to people unfamiliar with the
term compression ratio while still reading well.

Dave McCloskey

···

-----Original Message-----
From: Pedro Vicente [mailto:pvn@hdfgroup.org]
Sent: Wednesday, April 30, 2008 3:33 PM
To: McCloskey, David L.; hdf-forum@hdfgroup.org
Cc: jjohnson@disc.sci.gsfc.nasa.gov; Chris.Lynnes@nasa.gov
Subject: RE: [hdf-forum] RFC. display a "compression ratio" in h5dump

At 01:58 PM 4/30/2008, McCloskey, David L. wrote:

Everything I've read about compression ratio including on wikipedia
(http://en.wikipedia.org/wiki/Data_compression_ratio\) is that it should
be A/B.

Also, Dividing by B to calculate the percentage doesn't make much sense
to me. If I saw a percentage, I'd think it was "this is the percentage
of the total size that I'm using" So if I compressed 100 bytes to 30
bytes, I'd think I should see 30%. As in, "it's using 30% of the space
it would if I didn't compress".

A/B makes the most sense to me because it removes all of this

ambiguity.

If I see a compression ratio of 3, I know that if I didn't have
compression, I'd need 3 times the space. That makes it very simple.

If

it's B/A, I'm not very sure what I'm seeing when I see a number like
0.33.

---

Based on the wikipedia article, I think the confusion is summarized by
this paragraph: Note: There is some confusion about the term
'compression ratio', particularly outside academia and commerce. In
particular, some authors use the term 'compression ratio' to mean

'space

savings', even though the latter is not a ratio; and others use the

term

'compression ratio' to mean its inverse, even though that equates

higher

compression ratio with lower compression.

Dave McCloskey

yes, you are right, it should be, from that wikipedia article

compression ratio = uncompressed size / compressed size

in our formulas, that translates to

compression ratio = A / B

they propose 2 "standard" ways to write this, either

"often notated as an explicit ratio, 5:1 (read "five to one"), or as an
implicit ratio, 5X. "

translating in h5dump nomenclature and from the example below, that
would be either

SIZE 385 (5:1 RATIO)

or

SIZE 385 (5X RATIO)

Pedro

-----Original Message-----
From: Pedro Vicente [mailto:pvn@hdfgroup.org]
Sent: Wednesday, April 30, 2008 11:27 AM
To: hdf-forum@hdfgroup.org
Cc: jjohnson@disc.sci.gsfc.nasa.gov; Chris.Lynnes@nasa.gov
Subject: [hdf-forum] RFC. display a "compression ratio" in h5dump

Dear James, Chris and all HDF users

This is a RFC (Request For Comments) regarding a new feature in h5dump.
As you might know, h5dump has an option to display several properties

of

the dataset creation property list. If you specify this option at the
command line

-p, --properties Print dataset filters, storage layout and fill
value

h5dump prints several properties regarding filters, storage layout and
fill value

for example

./h5dump -H -p -d deflate tfilters.h5

produces the output

HDF5 "tfilters.h5" {
DATASET "deflate" {
  DATATYPE H5T_STD_I32LE
  DATASPACE SIMPLE { ( 20, 10 ) / ( 20, 10 ) }
  STORAGE_LAYOUT {
     CHUNKED ( 10, 5 )
     SIZE 385
   }
  FILTERS {
     COMPRESSION DEFLATE { LEVEL 9 }
  }
  FILLVALUE {
     FILL_TIME H5D_FILL_TIME_IFSET
     VALUE 0
  }
  ALLOCATION_TIME {
     H5D_ALLOC_TIME_INCR
  }
}
}

There was a request to display a "compression ratio" in cases where
compression filters are present

The values to compare are

A = theoretical maximum size of a dataset, obtained by multiplying the
number of elements in a dataset by the size in bytes of each element.
For example, for a dataset with 25 elements with integer type of 4
bytes, this size is 100

B = size obtained by the HDF5 function H5Dget_storage_size, that

returns

the amount of storage required for a dataset. If the dataset has
compression filters this number is typically smaller than A

http://www.hdfgroup.org/HDF5/doc/RM/RM_H5D.html#Dataset-GetStorageSize

Note: for the moment, we assume that all chunks are written. For cases
where this is not true, a new function H5Dget_chunk_info is being
developed, that will provide a better measure for this.

In our view, this "compression ratio" could be achieved in 2 ways.

1) a simple ratio, e.g , B/A
2) a percentage, e.g, (A-B)/B.

So, what we are asking is if you have any preferred way to achieve

this,

either one of the above formulations or any other means to express it
that you would like to suggest.

We propose to do the printing of this value after the SIZE information,
for example, in the case above

SIZE 385 (51.9%COMPRESSION)

The final print would look like

HDF5 "tfilters.h5" {
DATASET "deflate" {
  DATATYPE H5T_STD_I32LE
  DATASPACE SIMPLE { ( 20, 10 ) / ( 20, 10 ) }
  STORAGE_LAYOUT {
     CHUNKED ( 10, 5 )
     SIZE 385 (51.9%COMPRESSION)
   }
  FILTERS {
     COMPRESSION DEFLATE { LEVEL 9 }
  }
  FILLVALUE {
     FILL_TIME H5D_FILL_TIME_IFSET
     VALUE 0
  }
  ALLOCATION_TIME {
     H5D_ALLOC_TIME_INCR
  }
}
}

Here's the complete h5dump usage for your reference

usage: h5dump [OPTIONS] file
OPTIONS
    -h, --help Print a usage message and exit
    -n, --contents Print a list of the file contents and exit
    -B, --superblock Print the content of the super block
    -H, --header Print the header only; no data is displayed
    -A, --onlyattr Print the header and value of attributes
    -i, --object-ids Print the object ids
    -r, --string Print 1-byte integer datasets as ASCII
    -e, --escape Escape non printing characters
    -V, --version Print version number and exit
    -a P, --attribute=P Print the specified attribute
    -d P, --dataset=P Print the specified dataset
    -y, --noindex Do not print array indices with the data
    -p, --properties Print dataset filters, storage layout and

fill

value
    -f D, --filedriver=D Specify which driver to open the file with
    -g P, --group=P Print the specified group and all members
    -l P, --soft-link=P Print the value(s) of the specified soft link
    -o F, --output=F Output raw data into file F
    -b B, --binary=B Binary file output, of form B
    -t P, --datatype=P Print the specified named datatype
    -w N, --width=N Set the number of columns of output
    -q Q, --sort_by=Q Sort groups and attributes by index Q
    -z Z, --sort_order=Z Sort groups and attributes by order Z
    -x, --xml Output in XML using Schema
    -u, --use-dtd Output in XML using DTD
    -D U, --xml-dtd=U Use the DTD or schema at U
    -X S, --xml-ns=S (XML Schema) Use qualified names n the XML
                         ":": no namespace, default: "hdf5:"
                         E.g., to dump a file called `-f', use h5dump
-- -f

Subsetting is available by using the following options with a dataset
attribute. Subsetting is done by selecting a hyperslab from the data.
Thus, the options mirror those for performing a hyperslab selection.
The START and COUNT parameters are mandatory if you do subsetting.
The STRIDE and BLOCK parameters are optional and will default to 1 in
each dimension.

     -s L, --start=L Offset of start of subsetting selection
     -S L, --stride=L Hyperslab stride
     -c L, --count=L Number of blocks to include in selection
     -k L, --block=L Size of block in hyperslab

D - is the file driver to use in opening the file. Acceptable values
       are "sec2", "family", "split", "multi", "direct", and "stream".
Without
       the file driver flag, the file will be opened with each driver
in
       turn and in the order specified above until one driver succeeds
       in opening the file.
F - is a filename.
P - is the full path from the root group to the object.
N - is an integer greater than 1.
L - is a list of integers the number of which are equal to the
       number of dimensions in the dataspace being queried
U - is a URI reference (as defined in [IETF RFC 2396],
       updated by [IETF RFC 2732])
B - is the form of binary output: MEMORY for a memory type, FILE for
the
       file type, LE or BE for pre-existing little or big endian

types.

       Must be used with -o (output file) and it is recommended that
       -d (dataset) is used
Q - is the sort index type. It can be "creation_order" or "name"
(default)
Z - is the sort order type. It can be "descending" or "ascending"
(default)

Examples:

1) Attribute foo of the group /bar_none in file quux.h5

       h5dump -a /bar_none/foo quux.h5

2) Selecting a subset from dataset /foo in file quux.h5

     h5dump -d /foo -s "0,1" -S "1,1" -c "2,3" -k "2,2" quux.h5

3) Saving dataset 'dset' in file quux.h5 to binary file 'out.bin'
       using a little-endian type

     h5dump -d /dset -b LE -o out.bin quux.h5

--------------------------------------------------------------
Pedro Vicente (T) 217.265-0311
pvn@hdfgroup.org
The HDF Group. 1901 S. First. Champaign, IL 61820

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to
hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to

hdf-forum-subscribe@hdfgroup.org.

To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

--------------------------------------------------------------
Pedro Vicente (T) 217.265-0311
pvn@hdfgroup.org
The HDF Group. 1901 S. First. Champaign, IL 61820

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Pedro_Vicente2 · April 30, 2008, 7:43pm

I think (5:1 COMPRESSION) reads best: "five to one compression", though
(5:1 COMPRESSION RATIO) may be more clear to people unfamiliar with the
term compression ratio while still reading well.

Dave McCloskey

sure, and thanks for catching that. I always go to wikipedia before doing anything, but this time I slipped over it

wikipedia is great

Pedro

···

At 02:37 PM 4/30/2008, McCloskey, David L. wrote:

-----Original Message-----
From: Pedro Vicente [mailto:pvn@hdfgroup.org]
Sent: Wednesday, April 30, 2008 3:33 PM
To: McCloskey, David L.; hdf-forum@hdfgroup.org
Cc: jjohnson@disc.sci.gsfc.nasa.gov; Chris.Lynnes@nasa.gov
Subject: RE: [hdf-forum] RFC. display a "compression ratio" in h5dump

At 01:58 PM 4/30/2008, McCloskey, David L. wrote:

Everything I've read about compression ratio including on wikipedia
(http://en.wikipedia.org/wiki/Data_compression_ratio\) is that it should
be A/B.

Also, Dividing by B to calculate the percentage doesn't make much sense
to me. If I saw a percentage, I'd think it was "this is the percentage
of the total size that I'm using" So if I compressed 100 bytes to 30
bytes, I'd think I should see 30%. As in, "it's using 30% of the space
it would if I didn't compress".

A/B makes the most sense to me because it removes all of this

ambiguity.

If I see a compression ratio of 3, I know that if I didn't have
compression, I'd need 3 times the space. That makes it very simple.

If

it's B/A, I'm not very sure what I'm seeing when I see a number like
0.33.

---

Based on the wikipedia article, I think the confusion is summarized by
this paragraph: Note: There is some confusion about the term
'compression ratio', particularly outside academia and commerce. In
particular, some authors use the term 'compression ratio' to mean

'space

savings', even though the latter is not a ratio; and others use the

term

'compression ratio' to mean its inverse, even though that equates

higher

compression ratio with lower compression.

Dave McCloskey

yes, you are right, it should be, from that wikipedia article

compression ratio = uncompressed size / compressed size

in our formulas, that translates to

compression ratio = A / B

they propose 2 "standard" ways to write this, either

"often notated as an explicit ratio, 5:1 (read "five to one"), or as an
implicit ratio, 5X. "

translating in h5dump nomenclature and from the example below, that
would be either

SIZE 385 (5:1 RATIO)

or

SIZE 385 (5X RATIO)

Pedro

-----Original Message-----
From: Pedro Vicente [mailto:pvn@hdfgroup.org]
Sent: Wednesday, April 30, 2008 11:27 AM
To: hdf-forum@hdfgroup.org
Cc: jjohnson@disc.sci.gsfc.nasa.gov; Chris.Lynnes@nasa.gov
Subject: [hdf-forum] RFC. display a "compression ratio" in h5dump

Dear James, Chris and all HDF users

This is a RFC (Request For Comments) regarding a new feature in h5dump.
As you might know, h5dump has an option to display several properties

of

the dataset creation property list. If you specify this option at the
command line

-p, --properties Print dataset filters, storage layout and fill
value

h5dump prints several properties regarding filters, storage layout and
fill value

for example

./h5dump -H -p -d deflate tfilters.h5

produces the output

HDF5 "tfilters.h5" {
DATASET "deflate" {
  DATATYPE H5T_STD_I32LE
  DATASPACE SIMPLE { ( 20, 10 ) / ( 20, 10 ) }
  STORAGE_LAYOUT {
     CHUNKED ( 10, 5 )
     SIZE 385
   }
  FILTERS {
     COMPRESSION DEFLATE { LEVEL 9 }
  }
  FILLVALUE {
     FILL_TIME H5D_FILL_TIME_IFSET
     VALUE 0
  }
  ALLOCATION_TIME {
     H5D_ALLOC_TIME_INCR
  }
}
}

There was a request to display a "compression ratio" in cases where
compression filters are present

The values to compare are

A = theoretical maximum size of a dataset, obtained by multiplying the
number of elements in a dataset by the size in bytes of each element.
For example, for a dataset with 25 elements with integer type of 4
bytes, this size is 100

B = size obtained by the HDF5 function H5Dget_storage_size, that

returns

the amount of storage required for a dataset. If the dataset has
compression filters this number is typically smaller than A

http://www.hdfgroup.org/HDF5/doc/RM/RM_H5D.html#Dataset-GetStorageSize

Note: for the moment, we assume that all chunks are written. For cases
where this is not true, a new function H5Dget_chunk_info is being
developed, that will provide a better measure for this.

In our view, this "compression ratio" could be achieved in 2 ways.

1) a simple ratio, e.g , B/A
2) a percentage, e.g, (A-B)/B.

So, what we are asking is if you have any preferred way to achieve

this,

either one of the above formulations or any other means to express it
that you would like to suggest.

We propose to do the printing of this value after the SIZE information,
for example, in the case above

SIZE 385 (51.9%COMPRESSION)

The final print would look like

HDF5 "tfilters.h5" {
DATASET "deflate" {
  DATATYPE H5T_STD_I32LE
  DATASPACE SIMPLE { ( 20, 10 ) / ( 20, 10 ) }
  STORAGE_LAYOUT {
     CHUNKED ( 10, 5 )
     SIZE 385 (51.9%COMPRESSION)
   }
  FILTERS {
     COMPRESSION DEFLATE { LEVEL 9 }
  }
  FILLVALUE {
     FILL_TIME H5D_FILL_TIME_IFSET
     VALUE 0
  }
  ALLOCATION_TIME {
     H5D_ALLOC_TIME_INCR
  }
}
}

Here's the complete h5dump usage for your reference

usage: h5dump [OPTIONS] file
OPTIONS
    -h, --help Print a usage message and exit
    -n, --contents Print a list of the file contents and exit
    -B, --superblock Print the content of the super block
    -H, --header Print the header only; no data is displayed
    -A, --onlyattr Print the header and value of attributes
    -i, --object-ids Print the object ids
    -r, --string Print 1-byte integer datasets as ASCII
    -e, --escape Escape non printing characters
    -V, --version Print version number and exit
    -a P, --attribute=P Print the specified attribute
    -d P, --dataset=P Print the specified dataset
    -y, --noindex Do not print array indices with the data
    -p, --properties Print dataset filters, storage layout and

fill

value
    -f D, --filedriver=D Specify which driver to open the file with
    -g P, --group=P Print the specified group and all members
    -l P, --soft-link=P Print the value(s) of the specified soft link
    -o F, --output=F Output raw data into file F
    -b B, --binary=B Binary file output, of form B
    -t P, --datatype=P Print the specified named datatype
    -w N, --width=N Set the number of columns of output
    -q Q, --sort_by=Q Sort groups and attributes by index Q
    -z Z, --sort_order=Z Sort groups and attributes by order Z
    -x, --xml Output in XML using Schema
    -u, --use-dtd Output in XML using DTD
    -D U, --xml-dtd=U Use the DTD or schema at U
    -X S, --xml-ns=S (XML Schema) Use qualified names n the XML
                         ":": no namespace, default: "hdf5:"
                         E.g., to dump a file called `-f', use h5dump
-- -f

Subsetting is available by using the following options with a dataset
attribute. Subsetting is done by selecting a hyperslab from the data.
Thus, the options mirror those for performing a hyperslab selection.
The START and COUNT parameters are mandatory if you do subsetting.
The STRIDE and BLOCK parameters are optional and will default to 1 in
each dimension.

     -s L, --start=L Offset of start of subsetting selection
     -S L, --stride=L Hyperslab stride
     -c L, --count=L Number of blocks to include in selection
     -k L, --block=L Size of block in hyperslab

D - is the file driver to use in opening the file. Acceptable values
       are "sec2", "family", "split", "multi", "direct", and "stream".
Without
       the file driver flag, the file will be opened with each driver
in
       turn and in the order specified above until one driver succeeds
       in opening the file.
F - is a filename.
P - is the full path from the root group to the object.
N - is an integer greater than 1.
L - is a list of integers the number of which are equal to the
       number of dimensions in the dataspace being queried
U - is a URI reference (as defined in [IETF RFC 2396],
       updated by [IETF RFC 2732])
B - is the form of binary output: MEMORY for a memory type, FILE for
the
       file type, LE or BE for pre-existing little or big endian

types.

       Must be used with -o (output file) and it is recommended that
       -d (dataset) is used
Q - is the sort index type. It can be "creation_order" or "name"
(default)
Z - is the sort order type. It can be "descending" or "ascending"
(default)

Examples:

1) Attribute foo of the group /bar_none in file quux.h5

       h5dump -a /bar_none/foo quux.h5

2) Selecting a subset from dataset /foo in file quux.h5

     h5dump -d /foo -s "0,1" -S "1,1" -c "2,3" -k "2,2" quux.h5

3) Saving dataset 'dset' in file quux.h5 to binary file 'out.bin'
       using a little-endian type

     h5dump -d /dset -b LE -o out.bin quux.h5

--------------------------------------------------------------
Pedro Vicente (T) 217.265-0311
pvn@hdfgroup.org
The HDF Group. 1901 S. First. Champaign, IL 61820

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to
hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to

hdf-forum-subscribe@hdfgroup.org.

To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

--------------------------------------------------------------
Pedro Vicente (T) 217.265-0311
pvn@hdfgroup.org
The HDF Group. 1901 S. First. Champaign, IL 61820

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

--------------------------------------------------------------
Pedro Vicente (T) 217.265-0311
pvn@hdfgroup.org
The HDF Group. 1901 S. First. Champaign, IL 61820

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Pedro_Vicente2 · April 30, 2008, 8:35pm

>I think (5:1 COMPRESSION) reads best: "five to one compression", though
>(5:1 COMPRESSION RATIO) may be more clear to people unfamiliar with the
>term compression ratio while still reading well.
>
>Dave McCloskey

sure, and thanks for catching that. I always go to wikipedia before doing
anything, but this time I slipped over it

wikipedia is great

Pedro

You aren't always going to have nice neat integer ratios like 5:1, so you have
to decide on how many digits of precision to use.

sure, i.m.o 1 digit of precision will suffice, but if anybody has the need for more, we will do it

Pedro

···

At 03:23 PM 4/30/2008, James E. Johnson wrote:

On Wednesday April 30 2008 15:43:11 Pedro Vicente wrote:

At 02:37 PM 4/30/2008, McCloskey, David L. wrote:

--
------------------------------------------------------------------------------

James E. Johnson | address: |
Wyle Information Systems, Inc. | NASA Goddard Space Flight Center |
e-mail: James.Johnson@nasa.gov | Distributed Active Archive Center |
phone: 301-614-5121 | Code 610.2, Bldg 32, Room S130F |
fax: 301-614-5268 | Greenbelt, MD 20771 |

------------------------------------------------------------------------------

--------------------------------------------------------------
Pedro Vicente (T) 217.265-0311
pvn@hdfgroup.org
The HDF Group. 1901 S. First. Champaign, IL 61820

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

RFC. display a "compression ratio" in h5dump