rsync with hdf5 files

Does anyone out there have any experience using rsync to copy HDF5 files? I've been trying to use rsync to make back-ups of hdf5 files as they grow, but instead of the expected fairly constant time required for each update, the rsync time increases as the HDF5 file grows. This suggests to me that rsync is re-transferring data instead of just transferring differences. That, or as I add data to the HDF5 file, changes are being made to numerous locations in the file.

I thought maybe the problem was that the time spent doing checksums was causing the increase as the files grew in size, but the rsync output indicates a linear increase in actual data transferred as well, just like the run time.

The files in question contain multiple data sets that are being updated, each of which is stored as chunked, compressed data.

The only thing I can think of to fiddle with on the rsync end is the checksum block size, and try and make it more like the size of blocks in the HDF5 file, which is an unknown to me at the moment.

Alternately, I can make the files smaller, but that would not be my first choice as it would be a major design change.

If anyone has any suggestions as to how to resolve this "creeping transfer time" issue, I'd appreciate it.

Hmm. Never thought of this before but one of the first things that comes
to mind is whether rsync supports 'diffing' of HDF5 binary, compressed,
chunked files?

My naive understanding of tools like rsync is that they come
pre-packaged with the ability to diff ascii text files but not binary
files of any kind.

Based on behavior you describe, I am suspecting rsync cannot diff your
HDF5 files and so it is doing the only thing it can really do; copy the
whole darn, binary file?

So, next question is can you 'smarten' rsync to somehow be able to diff
HDF5 files using maybe h5diff tool? Thats as far as my thinking takes
me. Good luck.

Mark

···

On Mon, 2011-12-19 at 10:37 -0800, John Knutson wrote:

Does anyone out there have any experience using rsync to copy HDF5
files? I've been trying to use rsync to make back-ups of hdf5 files as
they grow, but instead of the expected fairly constant time required for
each update, the rsync time increases as the HDF5 file grows. This
suggests to me that rsync is re-transferring data instead of just
transferring differences. That, or as I add data to the HDF5 file,
changes are being made to numerous locations in the file.

I thought maybe the problem was that the time spent doing checksums was
causing the increase as the files grew in size, but the rsync output
indicates a linear increase in actual data transferred as well, just
like the run time.

The files in question contain multiple data sets that are being updated,
each of which is stored as chunked, compressed data.

The only thing I can think of to fiddle with on the rsync end is the
checksum block size, and try and make it more like the size of blocks in
the HDF5 file, which is an unknown to me at the moment.

Alternately, I can make the files smaller, but that would not be my
first choice as it would be a major design change.

If anyone has any suggestions as to how to resolve this "creeping
transfer time" issue, I'd appreciate it.

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

--
Mark C. Miller, Lawrence Livermore National Laboratory
================!!LLNL BUSINESS ONLY!!================
miller86@llnl.gov urgent: miller86@pager.llnl.gov
T:8-6 (925)-423-5901 M/W/Th:7-12,2-7 (530)-753-8511

It looks like you're wrong. Proof by Wikipedia:
"Unlike diff, the process of creating a delta file has two steps: first a signature file is created from file A, and then this (relatively small) signature and file B are used to create the delta file. Also unlike diff, rdiff works well with binary files."

Compression might screw with that, though. An idea is to use rsync compression instead, and leave the hdf5 files uncompressed. From man rsync:
  -z, --compress compress file data during the transfer

Cheers
Paul

···

On 19. des. 2011, at 19:46, Mark Miller wrote:

Hmm. Never thought of this before but one of the first things that comes
to mind is whether rsync supports 'diffing' of HDF5 binary, compressed,
chunked files?

My naive understanding of tools like rsync is that they come
pre-packaged with the ability to diff ascii text files but not binary
files of any kind.

Hi,

Does anyone out there have any experience using rsync to copy HDF5
files? I've been trying to use rsync to make back-ups of hdf5 files
as they grow, but instead of the expected fairly constant time
required for each update, the rsync time increases as the HDF5 file
grows. This suggests to me that rsync is re-transferring data
instead of just transferring differences. That, or as I add data to
the HDF5 file, changes are being made to numerous locations in the
file.

I thought maybe the problem was that the time spent doing checksums
was causing the increase as the files grew in size, but the rsync
output indicates a linear increase in actual data transferred as
well, just like the run time.

The files in question contain multiple data sets that are being
updated, each of which is stored as chunked, compressed data.

I don't think it's much of a surprise that rsync can't do small
deltas on binary compressed files. If you change just a single
bit in a file the compressed files before and after can be
radically different, so a "diff" would be huge... It's rather
well known that rsync has problems with these kinds of files
- that's why gzip has an option named '--rsyncable' which makes
it output files that are a bit larger but can be rsync'ed more
effectively. I don't know how you compress those files, perhaps
if you use gzip you can try that option?

                          Best regards, Jens

···

On Mon, Dec 19, 2011 at 12:37:34PM -0600, John Knutson wrote:
--
  \ Jens Thoms Toerring ________ jt@toerring.de
   \_______________________________ http://toerring.de

Does anyone out there have any experience using rsync to copy HDF5
files? I've been trying to use rsync to make back-ups of hdf5 files
as they grow, but instead of the expected fairly constant time
required for each update, the rsync time increases as the HDF5 file
grows. This suggests to me that rsync is re-transferring data
instead of just transferring differences. That, or as I add data to
the HDF5 file, changes are being made to numerous locations in the
file.

are you transferring locally, like to an external hard drive?

rsync's -W (whole file) option says this:

   This [-W] is the default when both the source and destination
   are specified as local paths, but only if no batch-writing option
   is in effect.

Does the '--no-whole-file' option help at all?

==rob

···

On Mon, Dec 19, 2011 at 12:37:34PM -0600, John Knutson wrote:

I thought maybe the problem was that the time spent doing checksums
was causing the increase as the files grew in size, but the rsync
output indicates a linear increase in actual data transferred as
well, just like the run time.

The files in question contain multiple data sets that are being
updated, each of which is stored as chunked, compressed data.

The only thing I can think of to fiddle with on the rsync end is the
checksum block size, and try and make it more like the size of
blocks in the HDF5 file, which is an unknown to me at the moment.

Alternately, I can make the files smaller, but that would not be my
first choice as it would be a major design change.

If anyone has any suggestions as to how to resolve this "creeping
transfer time" issue, I'd appreciate it.

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

Paul Anton Letnes wrote:

Compression might screw with that, though. An idea is to use rsync compression instead, and leave the hdf5 files uncompressed. From man rsync:
  -z, --compress compress file data during the transfer

Cheers
Paul
  

I'm a bit reluctant to turn off compression due to the storage requirements of the data in question (it's an enormous amount of data that happens to compress very well). Still, aren't the chunks compressed individually? I was under the impression that when compressing, each chunk was individually compressed. As such, the only things that should be changing are those chunks that have had new data added, and the table of contents (I forget the term the devs use). How much is actually changed probably depends a lot on the pre-allocation of data.

Now, I'm using the gzip *filter*, in conjunction with the shuffle filter. There's no indication of an "rsyncable" option here, or in the (admittedly dated) gzip binaries I have installed. Using gzip outside of HDF5 would require an awful lot of reengineering and even so would significantly limit the accessibility of the data.

One thing that I might do to quantify this is save a copy of one of the files between rsyncs and do a binary diff afterward to see what's really changing. Unfortunately I don't have the ins and outs of the HDF5 file format stored in my brain so interpretation of the results of such a test will be time consuming.

John Knutson wrote:

One thing that I might do to quantify this is save a copy of one of the files between rsyncs and do a binary diff afterward to see what's really changing. Unfortunately I don't have the ins and outs of the HDF5 file format stored in my brain so interpretation of the results of such a test will be time consuming.

On a related note, are there any test tools that do a verbose, raw-ish dump of an hdf5 file? That is, something that turns the low-level format stuff into readable text that I can subsequently use diff on. The details as in what's in http://www.hdfgroup.org/HDF5/doc/H5.format.html. So far I've been looking at diffs using od (gnu. using hex output. Seriously, who uses octal anymore? :slight_smile:

Paul Anton Letnes wrote:
> Compression might screw with that, though. An idea is to use rsync
> compression instead, and leave the hdf5 files uncompressed. From man
> rsync:
> -z, --compress compress file data during the transfer
>
> Cheers
> Paul
>
I'm a bit reluctant to turn off compression due to the storage
requirements of the data in question (it's an enormous amount of data
that happens to compress very well). Still, aren't the chunks
compressed individually? I was under the impression that when
compressing, each chunk was individually compressed.

Yes, chunks are compressed individually. One good reason for that is so
that partial readback doesn't wind up requiring decompression of entire
dataset. But, each compressed chunk can wind up taking a variable amount
of space in the file. Some chunks compress really well and others
don't.

As such, the only
things that should be changing are those chunks that have had new data
added, and the table of contents (I forget the term the devs use). How
much is actually changed probably depends a lot on the pre-allocation of
data.

Might there be other things that could possibly contribute to a cascade
of differences? I mean, what about order of HDF5 writes on the file? If
you overwrite and/or extend an existing dataset, what about impact of
'garbage collection'? I mean, if you overwrite some dataset with new
data (for a portion of it), if the new dataset's chunks don't compress
into the same spaces the old chunks fit, then I think you can get some
re-arrangement of chunks in the file and possibly deadspace that
couldn't be reclaimed. Are their timestamps on these things too?

Now, I'm using the gzip *filter*, in conjunction with the shuffle
filter. There's no indication of an "rsyncable" option here, or in the
(admittedly dated) gzip binaries I have installed.

I was not aware of that option for the gzip application (tool). And, I
am certain the HDF5 library does not have a 'property' to affect that in
its dataset creation property lists.

If the zlib has a way to affect rsycnable compression via its C
interface, you could try writing your own HDF5 filter to use in place of
HDF5's built-in gzip filter. I don't honestly know what affect chunking
would have on that though.

···

On Mon, 2011-12-19 at 12:05 -0800, John Knutson wrote:

Hi John,

···

On Dec 19, 2011, at 3:03 PM, John Knutson wrote:

John Knutson wrote:

One thing that I might do to quantify this is save a copy of one of the files between rsyncs and do a binary diff afterward to see what's really changing. Unfortunately I don't have the ins and outs of the HDF5 file format stored in my brain so interpretation of the results of such a test will be time consuming.

On a related note, are there any test tools that do a verbose, raw-ish dump of an hdf5 file? That is, something that turns the low-level format stuff into readable text that I can subsequently use diff on. The details as in what's in http://www.hdfgroup.org/HDF5/doc/H5.format.html. So far I've been looking at diffs using od (gnu. using hex output. Seriously, who uses octal anymore? :slight_smile:

  Try using the 'h5debug' tool in the tools/misc subdirectory. You should be able to walk the low level file structure with it (and a handy copy of the file format spec. :-).

  Quincey

Quincey Koziol wrote:

  Try using the 'h5debug' tool in the tools/misc subdirectory. You should be able to walk the low level file structure with it (and a handy copy of the file format spec. :-).

I guess it can be done with this tool but it sure makes it difficult, requiring
1) multiple runs of the program, probably hundreds of times to walk an entire file, and
2) enough knowledge about the format to know what address to specify to get it to do the walk

What I'd *really* like is a tool that dumped the file with the interpreted data in one column and the address and hex dump in another, e.g.

Addresses:

  Base: 0 0018: 00 00 00 00 00 00 00 00

  File Free-space Info: 18446744073709551615 0020: ff ff ff ff ff ff ff ff

  End of File: 17971784 0028: 48 3a 12 01 00 00 00 00

  Driver Information Block: 18446744073709551615 0030: ff ff ff ff ff ff ff ff

  Root Group Symbol Table Entry: 0 0038: 00 00 00 00

Rob Latham wrote:

are you transferring locally, like to an external hard drive?

rsync's -W (whole file) option says this:

   This [-W] is the default when both the source and destination
   are specified as local paths, but only if no batch-writing option
   is in effect.

Does the '--no-whole-file' option help at all?

==rob
  

I haven't tried that option, but the rsync is a remote rsync over ssh, between two separate hosts.

When I was still running it with the --progress option, it was clearly indicating that it was doing partial file transfers.

I ran rsync with the --stats option to get a bit more detail, and here it is:
Number of files: 429
Number of files transferred: 27
Total file size: 17242186574 bytes
Total transferred file size: 284497273 bytes
Literal data: 6386759 bytes
Matched data: 278246448 bytes
File list size: 9254
File list generation time: 0.050 seconds
File list transfer time: 0.000 seconds
Total bytes sent: 501086
Total bytes received: 5616588

So it looks like, in that particular instance, 6MB of 271MB was compressed down to ~5MB and transferred. I think I'm going to leave that option on for a while just to see how the literal/matched/received byte counts change over time.

Hi John,

···

On Dec 20, 2011, at 5:26 PM, John Knutson wrote:

Quincey Koziol wrote:

  Try using the 'h5debug' tool in the tools/misc subdirectory. You should be able to walk the low level file structure with it (and a handy copy of the file format spec. :-).

I guess it can be done with this tool but it sure makes it difficult, requiring
1) multiple runs of the program, probably hundreds of times to walk an entire file, and
2) enough knowledge about the format to know what address to specify to get it to do the walk

What I'd *really* like is a tool that dumped the file with the interpreted data in one column and the address and hex dump in another, e.g.

Addresses:

Base: 0 0018: 00 00 00 00 00 00 00 00

File Free-space Info: 18446744073709551615 0020: ff ff ff ff ff ff ff ff

End of File: 17971784 0028: 48 3a 12 01 00 00 00 00

Driver Information Block: 18446744073709551615 0030: ff ff ff ff ff ff ff ff

Root Group Symbol Table Entry: 0 0038: 00 00 00 00

  Interesting idea, I'll keep that in mind for future work on the debug tool.

  Thanks,
    Quincey