manipulate data in hdf5

magawake · July 10, 2009, 1:14am

How easy is it to manipulate data in a preexisting hdf5 file?

For example, if we have this in 1 hdf5 file

/set
   xxx 1 2
   yyy 2 4
   zzz 4 5

We want

/set
/xxx
   1 2
  /yyy
   2 4
  /zzz
   4 5

Would this be terribly hard to do?

···

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Joonas_Pulakka · July 13, 2009, 7:45am

HDF5 is pretty much a write-once data format. There are few commands like
"move", "rename" and "unlink" to do some obvious stuff, but as far as I
know, manipulating the data itself within a file can only be accomplished by
reading it, doing whatever to it, writing the manipulated data into a new
file, and deleting the old file.

Best Regards,
Joonas

···

2009/7/10 Mag Gam <magawake@gmail.com>

How easy is it to manipulate data in a preexisting hdf5 file?

For example, if we have this in 1 hdf5 file

/set
  xxx 1 2
  yyy 2 4
  zzz 4 5

We want

/set
/xxx
  1 2
/yyy
  2 4
/zzz
  4 5

Would this be terribly hard to do?

andrew.collette · July 13, 2009, 6:27pm

HDF5 is pretty much a write-once data format. There are few commands like
"move", "rename" and "unlink" to do some obvious stuff, but as far as I
know, manipulating the data itself within a file can only be accomplished by
reading it, doing whatever to it, writing the manipulated data into a new
file, and deleting the old file.

I'm not sure what this is supposed to mean... HDF5 absolutely supports
in-place modification of data. You read a selection, do whatever to
it, and then just write it back to the dataset with the same selection
(see e.g. H5Dwrite). If you want to change your data model (e.g.
split a 3 x 100 x 100 dataset into three 100 x 100 datasets), then
you'll have to do some additional work, but you don't need to throw
the file away. In the worst case you might end up with some empty
space or fragmentation in the file structure, which would then require
repacking.

Andrew

Joonas_Pulakka · July 13, 2009, 7:02pm

All right, thanks for clarification - I was incorrect. It's indeed possible
to do change existing datasets in the way you described - changing the
elements within a selection, without touching the structure (data model).
But let's say you have a 1x1000000 dataset and you want to remove elements
100000-600000 to save space; if it's possibly to do that elegantly, I'm
interested to know, how. I thought it's easier to just create a new file in
this kind of cases (which is what repack does).

Best Regards,
Joonas

···

2009/7/13 Andrew Collette <andrew.collette@gmail.com>

> HDF5 is pretty much a write-once data format. There are few commands like
> "move", "rename" and "unlink" to do some obvious stuff, but as far as I
> know, manipulating the data itself within a file can only be accomplished
by
> reading it, doing whatever to it, writing the manipulated data into a new
> file, and deleting the old file.

I'm not sure what this is supposed to mean... HDF5 absolutely supports
in-place modification of data. You read a selection, do whatever to
it, and then just write it back to the dataset with the same selection
(see e.g. H5Dwrite). If you want to change your data model (e.g.
split a 3 x 100 x 100 dataset into three 100 x 100 datasets), then
you'll have to do some additional work, but you don't need to throw
the file away. In the worst case you might end up with some empty
space or fragmentation in the file structure, which would then require
repacking.

andrew.collette · July 13, 2009, 10:30pm

All right, thanks for clarification - I was incorrect. It's indeed possible
to do change existing datasets in the way you described - changing the
elements within a selection, without touching the structure (data model).
But let's say you have a 1x1000000 dataset and you want to remove elements
100000-600000 to save space; if it's possibly to do that elegantly, I'm
interested to know, how. I thought it's easier to just create a new file in
this kind of cases (which is what repack does).

There's no one-step way to do it; first, you would have to manually
move the elements 600,000 through 1,000,000 down to the lower
boundary. Then you can call H5Dset_extent to remove the leftover
elements at the end of the array. It's possible this could result in
unclaimed free space in the file, depending on your use patterns. The
advantage is that if you have other datasets in the file you don't
have to copy all of them over when you change one.

If you're only concerned about the file size, and these elements are
no longer used, you can also consider zeroing them out and using
compression. Even gzip level 1 will be able to pack a stretch of
500,000 zeros down to almost nothing. If course, you would then have
to keep track of what parts of the dataset are "masked". The most
appropriate solution depends on your particular mix of reads, writes
and shape changes.

Andrew

magawake · July 14, 2009, 12:40am

Thankyou all for the clarifications!

···

On Mon, Jul 13, 2009 at 6:30 PM, Andrew Collette<andrew.collette@gmail.com> wrote:

All right, thanks for clarification - I was incorrect. It's indeed possible
to do change existing datasets in the way you described - changing the
elements within a selection, without touching the structure (data model).
But let's say you have a 1x1000000 dataset and you want to remove elements
100000-600000 to save space; if it's possibly to do that elegantly, I'm
interested to know, how. I thought it's easier to just create a new file in
this kind of cases (which is what repack does).

There's no one-step way to do it; first, you would have to manually
move the elements 600,000 through 1,000,000 down to the lower
boundary. Then you can call H5Dset_extent to remove the leftover
elements at the end of the array. It's possible this could result in
unclaimed free space in the file, depending on your use patterns. The
advantage is that if you have other datasets in the file you don't
have to copy all of them over when you change one.

If you're only concerned about the file size, and these elements are
no longer used, you can also consider zeroing them out and using
compression. Even gzip level 1 will be able to pack a stretch of
500,000 zeros down to almost nothing. If course, you would then have
to keep track of what parts of the dataset are "masked". The most
appropriate solution depends on your particular mix of reads, writes
and shape changes.

Andrew

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Joonas_Pulakka · July 14, 2009, 7:00am

Yep, thanks, especially the zeroing-gzip-trick was ingenious, I'll try it!

As you wrote, the optimal way of doing things is obviously case-dependent.
I'm currently working with a small number of quite large datasets, so I'm
really concerned about file size but not so much about whether the file is
to be rewritten by me or by repack. Someone who has lots of various, small
datasets might think differently.
Which leads to a question: is there (going to be) any kind of HDF5 book,
community wiki or something, where this kind of "experience-based wisdom of
serious hdf5 users" would be presented? Having to depend on a mailing list
as a main source of important how to -information feels kind of fragile.

Best Regards,
Joonas

···

2009/7/14 Mag Gam <magawake@gmail.com>

Thankyou all for the clarifications!

On Mon, Jul 13, 2009 at 6:30 PM, Andrew > Collette<andrew.collette@gmail.com> wrote:
>> All right, thanks for clarification - I was incorrect. It's indeed
possible
>> to do change existing datasets in the way you described - changing the
>> elements within a selection, without touching the structure (data
model).
>> But let's say you have a 1x1000000 dataset and you want to remove
elements
>> 100000-600000 to save space; if it's possibly to do that elegantly, I'm
>> interested to know, how. I thought it's easier to just create a new file
in
>> this kind of cases (which is what repack does).
>
> There's no one-step way to do it; first, you would have to manually
> move the elements 600,000 through 1,000,000 down to the lower
> boundary. Then you can call H5Dset_extent to remove the leftover
> elements at the end of the array. It's possible this could result in
> unclaimed free space in the file, depending on your use patterns. The
> advantage is that if you have other datasets in the file you don't
> have to copy all of them over when you change one.
>
> If you're only concerned about the file size, and these elements are
> no longer used, you can also consider zeroing them out and using
> compression. Even gzip level 1 will be able to pack a stretch of
> 500,000 zeros down to almost nothing. If course, you would then have
> to keep track of what parts of the dataset are "masked". The most
> appropriate solution depends on your particular mix of reads, writes
> and shape changes.
>
> Andrew
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> Hdf-forum@hdfgroup.org
> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
>

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

manipulate data in hdf5