HDF5 Deleting Datasets AND Recovering Space

I understand section 5.2 of the user guide (below) says that when one
deletes groups/datasets using H5G.unlink, the space on disk is NOT
recovered. This has come to be an issue in my application. We are
working with very large datasets that after some time no longer need
to be stored. However, with this limitation, it seems that my HDF5
files are essentially equivalent to a very large CD-R from a storage
perspective. I understand from the guide as well that I can recover
space by copying the data over to a new file, but when my file size is
several gigabytes, this can be a slow process.

Has this not become an issue for other users and applications? I
understand HDF5 is commonly used for oceanography and satellite
imagery. Wouldn't these application require intermittent deleting of
data, especially in situations where the file size is on the order of
terabytes?

···

---------------------------------------------------------------------------------------------------------------

From the User Guide:

    The size of the dataset cannot be reduced after it is created. The
dataset can be expanded by extending one or more dimensions, with
H5Dextend. It is not possible to contract a dataspace, or to reclaim
allocated space.

    HDF5 does not at this time provide a mechanism to remove a dataset
from a file, or to reclaim the storage from deleted objects. Through
the H5Gunlink function one can remove links to a dataset from the file
structure. Once all links to a dataset have been removed, that dataset
becomes inaccessible to any application and is effectively removed
from the file. But this does not recover the space the dataset
occupies.

    The only way to recover the space is to write all the objects of
the file into a new file. Any unlinked object is inaccessible to the
application and will not be included in the new file.

---------------------------------------------------------------------------------------------------------------

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Hi Tom,

I understand section 5.2 of the user guide (below) says that when one
deletes groups/datasets using H5G.unlink, the space on disk is NOT
recovered. This has come to be an issue in my application. We are
working with very large datasets that after some time no longer need
to be stored. However, with this limitation, it seems that my HDF5
files are essentially equivalent to a very large CD-R from a storage
perspective. I understand from the guide as well that I can recover
space by copying the data over to a new file, but when my file size is
several gigabytes, this can be a slow process.

Has this not become an issue for other users and applications? I
understand HDF5 is commonly used for oceanography and satellite
imagery. Wouldn't these application require intermittent deleting of
data, especially in situations where the file size is on the order of
terabytes?

  You could use the 'h5repack' utility on your file, which might be an OK solution for you. Also, the latest 1.8.x release (1.8.3 currently) is much more efficient about recovering space in the file, until the file is closed. The next major version of HDF5 (1.10.x) should have a mechanism for persistent free space tracking, which will take even more pressure off the problem. However, it's still possible that even with persistent free space tracking the file will need to be repacked if the internal fragmentation gets to be too large.

  Quincey

···

On Jun 2, 2009, at 1:35 PM, Tom wrote:

---------------------------------------------------------------------------------------------------------------
From the User Guide:

   The size of the dataset cannot be reduced after it is created. The
dataset can be expanded by extending one or more dimensions, with
H5Dextend. It is not possible to contract a dataspace, or to reclaim
allocated space.

   HDF5 does not at this time provide a mechanism to remove a dataset
from a file, or to reclaim the storage from deleted objects. Through
the H5Gunlink function one can remove links to a dataset from the file
structure. Once all links to a dataset have been removed, that dataset
becomes inaccessible to any application and is effectively removed
from the file. But this does not recover the space the dataset
occupies.

   The only way to recover the space is to write all the objects of
the file into a new file. Any unlinked object is inaccessible to the
application and will not be included in the new file.

---------------------------------------------------------------------------------------------------------------

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Tom wrote:


I understand section 5.2 of the user guide (below) says that when one
deletes groups/datasets using H5G.unlink, the space on disk is NOT
recovered. This has come to be an issue in my application. We are
working with very large datasets that after some time no longer need
to be stored. However, with this limitation, it seems that my HDF5
files are essentially equivalent to a very large CD-R from a storage
perspective. I understand from the guide as well that I can recover
space by copying the data over to a new file, but when my file size is
several gigabytes, this can be a slow process.
Has this not become an issue for other users and applications? I
understand HDF5 is commonly used for oceanography and satellite
imagery. Wouldn't these application require intermittent deleting of
data, especially in situations where the file size is on the order of
terabytes?

Tom,

I have not encountered this need in area of satellite remote sensing in
which I have some experience. In fact, the one application I know that
only added data was problematic, not because HDF5 but because
of our storage methodology. The processes of reducing data is
sub-divided into a sequence of “atomic” steps and the data are stored
in HDF5 files with unique names at each step. The number of minutes of
data stored in an HDF5 is chosen to keep the file sizes and processing
times reasonable. If an improvement to a step in the sequence is
developed and all the data are reprocessed; the old files, unnecessary
files are just deleted. In our business, once a file is created and
its unique name assigned it is considered bad form to modify it.

Versions later than 1.8.0 of HDF5 allow you to link from an HDF5 file
to objects in another HDF5 file using an external link. I
think applications reading the data can follow the path from a
"master" file into external ones transparently, i.e. without knowing if
the link is external or not, which allows a developer to introduce
external links with a minimum of code changes. If you remove the link
and delete the external file your disk space is recovered. The key
here is to design your HDF5 hierarchy and external links to match the
pattern of usage you expect, in particular the pattern of how datasets
and groups are deleted when they are no longer needed.

(Note I have not used this feature myself, see http://www.docstoc.com/docs/5688152/External-Links-in-HDF5
for a description of abilities and limitations of this technique.)

–dan

···
-- Daniel Kahn
Science Systems and Applications Inc.
301-867-2162

Thanks for the responses. And great ideas!

For our application, we are using MATLAB, and MATLAB's HDF5 library.
Unfortunately, the latest version of MATLAB only supports version
HDF5-1.6.5 of the library. So there is no built-in function 'h5repack'
or capability for external links. I like the idea of keeping several
smaller h5 files rather than one large. Less chance of corruption,
etc. But because our processing steps are interlinked at various
levels, we would need that external linking capability to process
along several files 'simutaneously'.

Any ideas on how to access h5repack or the new external linking
capability in MATLAB that uses an older HDF5 library? MATLAB offers a
C-conversion capability called 'MEX'. So perhaps I can take some of
those functions from the C library, convert them manually to C-code,
and access via MATLAB?

Thanks again,

Tom

···

On Jun 2, 3:06 pm, Daniel Kahn <daniel_k...@ssaihq.com> wrote:

Tom wrote:I understand section 5.2 of the user guide (below) says that when one deletes groups/datasets using H5G.unlink, the space on disk is NOT recovered. This has come to be an issue in my application. We are working with very large datasets that after some time no longer need to be stored. However, with this limitation, it seems that my HDF5 files are essentially equivalent to a very large CD-R from a storage perspective. I understand from the guide as well that I can recover space by copying the data over to a new file, but when my file size is several gigabytes, this can be a slow process. Has this not become an issue for other users and applications? I understand HDF5 is commonly used for oceanography and satellite imagery. Wouldn't these application require intermittent deleting of data, especially in situations where the file size is on the order of terabytes?Tom,
I have not encountered this need in area of satellite remote sensing in which I have some experience. In fact, the one application I know that onlyaddeddata was problematic, not because HDF5 but because of our storage methodology. The processes of reducing data is sub-divided into a sequence of "atomic" steps and the data are stored in HDF5 files with unique names at each step. The number of minutes of data stored in an HDF5 is chosen to keep the file sizes and processing times reasonable. If an improvement to a step in the sequence is developed and all the data are reprocessed; the old files, unnecessary files are just deleted. In our business, once a file is created and its unique name assigned it is considered bad form to modify it.
Versions later than 1.8.0 of HDF5 allow you to link from an HDF5 file to objects in another HDF5 file using anexternal link. I think applicationsreadingthe data can follow the path from a "master" file into external ones transparently, i.e. without knowing if the link is external or not, which allows a developer to introduce external links with a minimum of code changes. If you remove the link and delete the external file your disk space is recovered. The key here is to design your HDF5 hierarchy and external links to match the pattern of usage you expect, in particular the pattern of how datasets and groups are deleted when they are no longer needed.
(Note I have not used this feature myself, seehttp://www.docstoc.com/docs/5688152/External-Links-in-HDF5for a description of abilities and limitations of this technique.)
--dan-- Daniel Kahn Science Systems and Applications Inc. 301-867-2162---------------------------------------------------------------------- This mailing list is for HDF software users discussion. To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org. To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Thanks! I'll try the h5repack command line tomorrow!

···

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Hi Tom,

Thanks for the responses. And great ideas!

For our application, we are using MATLAB, and MATLAB's HDF5 library.
Unfortunately, the latest version of MATLAB only supports version
HDF5-1.6.5 of the library. So there is no built-in function 'h5repack'
or capability for external links. I like the idea of keeping several
smaller h5 files rather than one large. Less chance of corruption,
etc. But because our processing steps are interlinked at various
levels, we would need that external linking capability to process
along several files 'simutaneously'.

Any ideas on how to access h5repack or the new external linking
capability in MATLAB that uses an older HDF5 library? MATLAB offers a
C-conversion capability called 'MEX'. So perhaps I can take some of
those functions from the C library, convert them manually to C-code,
and access via MATLAB?

  Hmm, I know that the "Spring '09" MATLAB release included a HDF5-1.8.x library. Which version are you using? Even if it's an older version, you should be able to run h5repack on your files (its a command-line utility) without affecting them being read by older versions of the library.

  Quincey

···

On Jun 2, 2009, at 2:14 PM, Tom wrote:

Thanks again,

Tom

On Jun 2, 3:06 pm, Daniel Kahn <daniel_k...@ssaihq.com> wrote:

Tom wrote:I understand section 5.2 of the user guide (below) says that when one deletes groups/datasets using H5G.unlink, the space on disk is NOT recovered. This has come to be an issue in my application. We are working with very large datasets that after some time no longer need to be stored. However, with this limitation, it seems that my HDF5 files are essentially equivalent to a very large CD-R from a storage perspective. I understand from the guide as well that I can recover space by copying the data over to a new file, but when my file size is several gigabytes, this can be a slow process. Has this not become an issue for other users and applications? I understand HDF5 is commonly used for oceanography and satellite imagery. Wouldn't these application require intermittent deleting of data, especially in situations where the file size is on the order of terabytes?Tom,
I have not encountered this need in area of satellite remote sensing in which I have some experience. In fact, the one application I know that onlyaddeddata was problematic, not because HDF5 but because of our storage methodology. The processes of reducing data is sub-divided into a sequence of "atomic" steps and the data are stored in HDF5 files with unique names at each step. The number of minutes of data stored in an HDF5 is chosen to keep the file sizes and processing times reasonable. If an improvement to a step in the sequence is developed and all the data are reprocessed; the old files, unnecessary files are just deleted. In our business, once a file is created and its unique name assigned it is considered bad form to modify it.
Versions later than 1.8.0 of HDF5 allow you to link from an HDF5 file to objects in another HDF5 file using anexternal link. I think applicationsreadingthe data can follow the path from a "master" file into external ones transparently, i.e. without knowing if the link is external or not, which allows a developer to introduce external links with a minimum of code changes. If you remove the link and delete the external file your disk space is recovered. The key here is to design your HDF5 hierarchy and external links to match the pattern of usage you expect, in particular the pattern of how datasets and groups are deleted when they are no longer needed.
(Note I have not used this feature myself, seehttp://www.docstoc.com/docs/5688152/External-Links-in-HDF5for a description of abilities and limitations of this technique.)
--dan-- Daniel Kahn Science Systems and Applications Inc. 301-867-2162---------------------------------------------------------------------- This mailing list is for HDF software users discussion. To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org. To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Tom

Thanks for the responses. And great ideas!

For our application, we are using MATLAB, and MATLAB's HDF5 library.
Unfortunately, the latest version of MATLAB only supports version
HDF5-1.6.5 of the library. So there is no built-in function 'h5repack'
or capability for external links. I like the idea of keeping several
smaller h5 files rather than one large. Less chance of corruption,
etc. But because our processing steps are interlinked at various
levels, we would need that external linking capability to process
along several files 'simutaneously'.

Any ideas on how to access h5repack or the new external linking
capability in MATLAB that uses an older HDF5 library? MATLAB offers a
C-conversion capability called 'MEX'. So perhaps I can take some of
those functions from the C library, convert them manually to C-code,
and access via MATLAB?

Thanks again,

h5repack is distributed with HDF5 as a command line utility.
/hdf5/tools/h5repack

you can find more details about its usage here

http://www.hdfgroup.org/HDF5/doc/RM/Tools.html

Pedro

···

At 02:14 PM 6/2/2009, Tom wrote:

Tom

On Jun 2, 3:06 pm, Daniel Kahn <daniel_k...@ssaihq.com> wrote:

Tom wrote:I understand section 5.2 of the user guide (below) says that when one deletes groups/datasets using H5G.unlink, the space on disk is NOT recovered. This has come to be an issue in my application. We are working with very large datasets that after some time no longer need to be stored. However, with this limitation, it seems that my HDF5 files are essentially equivalent to a very large CD-R from a storage perspective. I understand from the guide as well that I can recover space by copying the data over to a new file, but when my file size is several gigabytes, this can be a slow process. Has this not become an issue for other users and applications? I understand HDF5 is commonly used for oceanography and satellite imagery. Wouldn't these application require intermittent deleting of data, especially in situations where the file size is on the order of terabytes?Tom,
I have not encountered this need in area of satellite remote sensing in which I have some experience. In fact, the one application I know that onlyaddeddata was problematic, not because HDF5 but because of our storage methodology. The processes of reducing data is sub-divided into a sequence of "atomic" steps and the data are stored in HDF5 files with unique names at each step. The number of minutes of data stored in an HDF5 is chosen to keep the file sizes and processing times reasonable. If an improvement to a step in the sequence is developed and all the data are reprocessed; the old files, unnecessary files are just deleted. In our business, once a file is created and its unique name assigned it is considered bad form to modify it.
Versions later than 1.8.0 of HDF5 allow you to link from an HDF5 file to objects in another HDF5 file using anexternal link. I think applicationsreadingthe data can follow the path from a "master" file into external ones transparently, i.e. without knowing if the link is external or not, which allows a developer to introduce external links with a minimum of code changes. If you remove the link and delete the external file your disk space is recovered. The key here is to design your HDF5 hierarchy and external links to match the pattern of usage you expect, in particular the pattern of how datasets and groups are deleted when they are no longer needed.
(Note I have not used this feature myself, seehttp://www.docstoc.com/docs/5688152/External-Links-in-HDF5for a description of abilities and limitations of this technique.)
--dan-- Daniel Kahn Science Systems and Applications Inc. 301-867-2162---------------------------------------------------------------------- This mailing list is for HDF software users discussion. To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org. To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

--------------------------------------------------------------
Pedro Vicente
The HDF Group
pvn@hdfgroup.org

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Thanks! I'll try the h5repack command line tomorrow!

you're welcome. By the way, the original intent of h5repack was to compress HDF5 files using HDF5 filters (like gzip).
A typical usage using gzip compression is

h5repack -f GZIP=1 -v file1 file2

which applies GZIP compression to all objects (the =1 is the compression level) in file1 and saves the output in file2 and prints verbose output.

Be sure to use the -v option (verbose), since it tells you extra information about the file, like a list of objects, compression ratio applied, etc.

h5repack works by traversing the original HDF5 file file1 and recreating all objects one by one in file2.

Pedro

···

At 03:25 PM 6/2/2009, Tom wrote:

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

--------------------------------------------------------------
Pedro Vicente
The HDF Group
pvn@hdfgroup.org

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.