Finding the size of the datasets in a file

Konrad_Hinsen · November 5, 2013, 2:31pm

Hi everyone,

I just spent some time looking for a command-line tool that shows the
size occupied by each dataset in a file. I didn't find anything. The
most promising candidates were h5stat, h5ls, and h5dump, but it seems
that none of them can provide the information I am looking for.

Is there perhaps a third-party tool for that purpose?

I realize that "size" can be defined in lots of ways, but I don't
really care about the details. I have lots of files that each contain
hundreds of datasets, of which most are small but a few are very big.
I am looking for a simple way to identify the big ones. My ideal
definition of size is "how much smaller would the file be if dataset X
were not in there".

Konrad.

···

--
---------------------------------------------------------------------
Konrad Hinsen
Centre de Biophysique Moléculaire, CNRS Orléans
Synchrotron Soleil - Division Expériences
Saint Aubin - BP 48
91192 Gif sur Yvette Cedex, France
Tel. +33-1 69 35 97 15
E-Mail: research AT khinsen DOT fastmail DOT net
http://dirac.cnrs-orleans.fr/~hinsen/
ORCID: http://orcid.org/0000-0003-0330-9428
Twitter: @khinsen
---------------------------------------------------------------------

Babak_Behzad · November 5, 2013, 2:29pm

"h5dump -p" gives you a per dataset storage_layout information which contains the SIZE and OFFSET of the dataset. I always use it with "-H" command so that it just prints the header of the HDF5 file. For example:

h5dump -pH sample_dataset.h5

Hope this helps,
Babak

···

On 11/05/2013 08:31 AM, Konrad Hinsen wrote:

Hi everyone,

I just spent some time looking for a command-line tool that shows the
size occupied by each dataset in a file. I didn't find anything. The
most promising candidates were h5stat, h5ls, and h5dump, but it seems
that none of them can provide the information I am looking for.

Is there perhaps a third-party tool for that purpose?

I realize that "size" can be defined in lots of ways, but I don't
really care about the details. I have lots of files that each contain
hundreds of datasets, of which most are small but a few are very big.
I am looking for a simple way to identify the big ones. My ideal
definition of size is "how much smaller would the file be if dataset X
were not in there".

Konrad.

lrknox · November 5, 2013, 3:00pm

Hi Konrad,

h5dump with the -p option may give you what you want Combining it with -H will rmove the data from the output, or add -d to limit the output to a specific dataset or -g for a specific group. The output for a dataset looks like this:

         DATASET "BeamLatitude" {
            DATATYPE H5T_IEEE_F32BE
            DATASPACE SIMPLE { ( 24, 96, 5 ) / ( H5S_UNLIMITED, H5S_UNLIMITED, H5S_UNLIMITED
) }
            STORAGE_LAYOUT {
               CHUNKED ( 12, 48, 5 )
               SIZE 46080
            }
            FILTERS {
               NONE
            }
            FILLVALUE {
               FILL_TIME H5D_FILL_TIME_IFSET
               VALUE -999.3
            }
            ALLOCATION_TIME {
               H5D_ALLOC_TIME_INCR
            }
         }

SIZE is the storage size of the dataset in bytes. If the dataset is compressed, the storage size is the size of the compressed data.

Larry

···

-----Original Message-----
From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of Konrad Hinsen
Sent: Tuesday, November 05, 2013 8:31 AM
To: hdf-forum@lists.hdfgroup.org
Subject: [Hdf-forum] Finding the size of the datasets in a file

Hi everyone,

I just spent some time looking for a command-line tool that shows the size occupied by each dataset in a file. I didn't find anything. The most promising candidates were h5stat, h5ls, and h5dump, but it seems that none of them can provide the information I am looking for.

Is there perhaps a third-party tool for that purpose?

I realize that "size" can be defined in lots of ways, but I don't really care about the details. I have lots of files that each contain hundreds of datasets, of which most are small but a few are very big.
I am looking for a simple way to identify the big ones. My ideal definition of size is "how much smaller would the file be if dataset X were not in there".

Konrad.
--
---------------------------------------------------------------------
Konrad Hinsen
Centre de Biophysique Moléculaire, CNRS Orléans Synchrotron Soleil - Division Expériences Saint Aubin - BP 48
91192 Gif sur Yvette Cedex, France
Tel. +33-1 69 35 97 15
E-Mail: research AT khinsen DOT fastmail DOT net http://dirac.cnrs-orleans.fr/~hinsen/
ORCID: http://orcid.org/0000-0003-0330-9428
Twitter: @khinsen
---------------------------------------------------------------------

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

r.m.van.hees · November 5, 2013, 2:43pm

Hi Konrad,

"h5ls -v" provides an estimate of the datasets in a file. Is not this what you are looking for?

Greetings, Richard

···

On 11/05/2013 03:31 PM, Konrad Hinsen wrote:

Hi everyone,

I just spent some time looking for a command-line tool that shows the
size occupied by each dataset in a file. I didn't find anything. The
most promising candidates were h5stat, h5ls, and h5dump, but it seems
that none of them can provide the information I am looking for.

Is there perhaps a third-party tool for that purpose?

I realize that "size" can be defined in lots of ways, but I don't
really care about the details. I have lots of files that each contain
hundreds of datasets, of which most are small but a few are very big.
I am looking for a simple way to identify the big ones. My ideal
definition of size is "how much smaller would the file be if dataset X
were not in there".

Konrad.

Konrad_Hinsen · November 7, 2013, 3:56pm

Richard van Hees writes:

> "h5ls -v" provides an estimate of the datasets in a file. Is not this
> what you are looking for?

Babak Behzad writes:

> "h5dump -p" gives you a per dataset storage_layout information which
> contains the SIZE and OFFSET of the dataset. I always use it with "-H"
> command so that it just prints the header of the HDF5 file. For example:

Larry Knox writes:

> h5dump with the -p option may give you what you want Combining it
> with -H will rmove the data from the output, or add -d to limit the

Thanks to all of you for these suggestions. Both h5ls -v and h5dump -p
provide the information about the size of the dataset, with h5ls -v
providing more detailed information (allocated size plus real usage).
Unfortunately, both produce tons of other output, requiring serious
postprocessing for extracting just the size information for a large
number of datasets in a large number of files.

Konrad.

···

--
---------------------------------------------------------------------
Konrad Hinsen
Centre de Biophysique Moléculaire, CNRS Orléans
Synchrotron Soleil - Division Expériences
Saint Aubin - BP 48
91192 Gif sur Yvette Cedex, France
Tel. +33-1 69 35 97 15
E-Mail: research AT khinsen DOT fastmail DOT net
http://dirac.cnrs-orleans.fr/~hinsen/
ORCID: http://orcid.org/0000-0003-0330-9428
Twitter: @khinsen
---------------------------------------------------------------------

r.m.van.hees · November 7, 2013, 4:21pm

Richard van Hees writes:

  > "h5ls -v" provides an estimate of the datasets in a file. Is not this
  > what you are looking for?

Babak Behzad writes:

  > "h5dump -p" gives you a per dataset storage_layout information which
  > contains the SIZE and OFFSET of the dataset. I always use it with "-H"
  > command so that it just prints the header of the HDF5 file. For example:

Larry Knox writes:

  > h5dump with the -p option may give you what you want Combining it
  > with -H will rmove the data from the output, or add -d to limit the

Thanks to all of you for these suggestions. Both h5ls -v and h5dump -p
provide the information about the size of the dataset, with h5ls -v
providing more detailed information (allocated size plus real usage).
Unfortunately, both produce tons of other output, requiring serious
postprocessing for extracting just the size information for a large
number of datasets in a large number of files.

Konrad.

Hi Konrad,

It is likely still not what you want, but maybe my suggestions below are of any help:

[omit "-r" in case your files do not have any groups]

> h5ls -v -r myFile.h5 | egrep 'Dataset|Storage:'

or when you are interested in dataset "myData":

> h5ls -v -r myFile.h5 | egrep 'Dataset|Storage:' | grep -A1 myData | grep -v '\-\-'

Greetings, Richard

···

On 11/07/2013 04:56 PM, Konrad Hinsen wrote:

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Finding the size of the datasets in a file