puzzled by chunking, storage, and performance.

I'm still trying to milk as much as I can, performance-wise, out of HDF5...

My latest bit of confusion comes from the following seeming paradox. I have two files, poo2.h5 and poo3-d-1.h5. Both files contain exactly the same data, though poo2, being the larger data set, has more blank-filled elements. Also, because of the larger size of the data set in poo2, the data starts at (0, 0, 694080) or thereabouts, vs. (0,0,0) for poo3-d-1. My question is: "why is the smaller data set 10x larger in size (bytes?) than the larger data set with the same data and chunking?"

Is there any way to look at the details of what data is stored in the file, i.e. how many and maybe which chunks are stored, etc.?

HDF5 "poo2.h5" {
DATASET "/Data/IS-GPS-200 ID 2 Ephemerides" {
   DATATYPE "/Types/Ephemeris IS-GPS-200 id 2"
   DATASPACE SIMPLE { ( 26, 160, 1051200 ) / ( H5S_UNLIMITED, H5S_UNLIMITED, 1051200 ) }
   STORAGE_LAYOUT {
      CHUNKED ( 1, 15, 250 )
      SIZE 23228 (52713869.468:1 COMPRESSION)
    }

vs.

HDF5 "poo3-d-1.h5" {
DATASET "/Data/IS-GPS-200 ID 2 Ephemerides" {
   DATATYPE "/Types/Ephemeris IS-GPS-200 id 2"
   DATASPACE SIMPLE { ( 1, 160, 2880 ) / ( H5S_UNLIMITED, H5S_UNLIMITED, 2880 ) }
   STORAGE_LAYOUT {
      CHUNKED ( 1, 15, 250 )
      SIZE 251461 (513.097:1 COMPRESSION)
    }

Hi John,

I'm still trying to milk as much as I can, performance-wise, out of HDF5...

My latest bit of confusion comes from the following seeming paradox. I have two files, poo2.h5 and poo3-d-1.h5. Both files contain exactly the same data, though poo2, being the larger data set, has more blank-filled elements. Also, because of the larger size of the data set in poo2, the data starts at (0, 0, 694080) or thereabouts, vs. (0,0,0) for poo3-d-1. My question is: "why is the smaller data set 10x larger in size (bytes?) than the larger data set with the same data and chunking?"

  The "compression ratio" reported accounts for the sparseness of the chunks in the dataset. You probably have written more data to the smaller dataset.

Is there any way to look at the details of what data is stored in the file, i.e. how many and maybe which chunks are stored, etc.?

  We don't have a way to return a "map" of the chunks for a dataset currently (although it is in our issue tracker).

  Quincey

···

On Sep 22, 2010, at 2:36 PM, John Knutson wrote:

HDF5 "poo2.h5" {
DATASET "/Data/IS-GPS-200 ID 2 Ephemerides" {
DATATYPE "/Types/Ephemeris IS-GPS-200 id 2"
DATASPACE SIMPLE { ( 26, 160, 1051200 ) / ( H5S_UNLIMITED, H5S_UNLIMITED, 1051200 ) }
STORAGE_LAYOUT {
    CHUNKED ( 1, 15, 250 )
    SIZE 23228 (52713869.468:1 COMPRESSION)
  }

vs.

HDF5 "poo3-d-1.h5" {
DATASET "/Data/IS-GPS-200 ID 2 Ephemerides" {
DATATYPE "/Types/Ephemeris IS-GPS-200 id 2"
DATASPACE SIMPLE { ( 1, 160, 2880 ) / ( H5S_UNLIMITED, H5S_UNLIMITED, 2880 ) }
STORAGE_LAYOUT {
    CHUNKED ( 1, 15, 250 )
    SIZE 251461 (513.097:1 COMPRESSION)
  }

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

A Wednesday 22 September 2010 21:36:57 John Knutson escrigué:

I'm still trying to milk as much as I can, performance-wise, out of
HDF5...

My latest bit of confusion comes from the following seeming paradox.
I have two files, poo2.h5 and poo3-d-1.h5. Both files contain
exactly the same data, though poo2, being the larger data set, has
more blank-filled elements. Also, because of the larger size of the
data set in poo2, the data starts at (0, 0, 694080) or thereabouts,
vs. (0,0,0) for poo3-d-1. My question is: "why is the smaller data
set 10x larger in size (bytes?) than the larger data set with the
same data and chunking?"

[clip]

May be compression has something to do? poo2 has a 5e7 compression
ratio, while poo3-d-1 has 5e2. While I can understand the latter
figure, the former compression ratio (5e7) is a bit too high? Maybe
poo2 is only made of zeros?

···

--
Francesc Alted

Setting aside the strange sizing issues in the earlier messages for a moment...

Let's say I have a data set, dimensioned ( 26, 160, 1051200 )
and chunked ( 1, 15, 240 )

As I understand it, each individual chunk in the file will be in the following order:
[ 0, 0, 0-239 ] - [ 0, 14, 0-239 ]

and the chunks will be ordered thus:
[ 0, 0, 0 ], [ 0, 0, 240 ] ... [ 0, 0, 1051200 ], [ 0, 15, 0 ], [ 0, 15, 240 ] ... [ 0, 15, 1051200 ]
and so on...

Is that correct?

Should I expect peak read performance by reading one chunk at a time in that order, assuming each chunk is 1MB in size, as is the cache?

I notice there are functions for examining the hit % of the metadata cache... any chance of equivalent functions for the raw data chunk cache?

You may find some of the chunking discussions in this paper of interest:

  http://www.hdfgroup.org/pubs/papers/2008-06_netcdf4_perf_report.pdf

in particular, section 3.2 and port6ions of sections 4 & 5.

Setting aside the strange sizing issues in the earlier messages for a moment...

Let's say I have a data set, dimensioned ( 26, 160, 1051200 )
and chunked ( 1, 15, 240 )

As I understand it, each individual chunk in the file will be in the following order:
[ 0, 0, 0-239 ] - [ 0, 14, 0-239 ]

and the chunks will be ordered thus:
[ 0, 0, 0 ], [ 0, 0, 240 ] ... [ 0, 0, 1051200 ], [ 0, 15, 0 ], [ 0, 15, 240 ] ... [ 0, 15, 1051200 ]
and so on...

Is that correct?

Chunks are not necessarily ordered on the disk, so the sequence in which you read the chunks shouldn't impact performance.

···

On Sep 27, 2010, at 4:19 PM, John Knutson wrote:

Should I expect peak read performance by reading one chunk at a time in that order, assuming each chunk is 1MB in size, as is the cache?

I notice there are functions for examining the hit % of the metadata cache... any chance of equivalent functions for the raw data chunk cache?

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Thanks.. I've read the pertinent sections and what I'm coming away with is that the chunk *sizes* should be designed around the I/O bandwidth of your disk subsystem, and the *shapes* should be designed around the access patterns for the data and around the data set itself (avoid mostly empty chunks and so on, as per 5.1.2 guidelines)...

What this doesn't really get into, it seems to me, is the role of the raw data chunk cache in all of this.

I don't think contiguous data is even an option for us, as we would have several multi-terabyte data sets which take quite some time just to initialize on disk.

Ruth Aydt wrote:

···

You may find some of the chunking discussions in this paper of interest:

  http://www.hdfgroup.org/pubs/papers/2008-06_netcdf4_perf_report.pdf

in particular, section 3.2 and port6ions of sections 4 & 5.

Several users have raised questions regarding chunking in HDF5. Partly in response to these questions, the initial draft of a new "Chunking in HDF5" document is now available on The HDF Group's website:
    http://www.hdfgroup.org/HDF5/doc/_topic/Chunking/

This draft includes sections on the following topics:
    General description of chunks
    Storage and access order
    Partial I/O
    Chunk caching
    I/O filters and compression
    Pitfalls and errors to avoid
    Additional Resources
    Future directions
Several suggestions for tuning chunking in an application are provided along the way.

As a draft, this remains a work in progress; your feedback will be appreciated and will be very useful in the document's development. For example, let us know if there are additional questions that you would like to see treated.

Regards,
-- Frank Baker
   HDF Documentation
   fbaker@hdfgroup.org

Does the hdfgroup have any kind of plan/schedule for enabling compression of chunks when using parallel IO?

The use case being that each process compresses its own chunk at write time and the overall file size is reduced.
(I understand that chunks are preallocated and this makes it hard to implement compressed chunking with Parallel IO).

Thanks

JB

···

-----Original Message-----
From: hdf-forum-bounces@hdfgroup.org [mailto:hdf-forum-bounces@hdfgroup.org] On Behalf Of Frank Baker
Sent: 04 October 2010 22:31
To: HDF Users Discussion List
Subject: [Hdf-forum] New "Chunking in HDF5" document

Several users have raised questions regarding chunking in HDF5. Partly in response to these questions, the initial draft of a new "Chunking in HDF5" document is now available on The HDF Group's website:
    http://www.hdfgroup.org/HDF5/doc/_topic/Chunking/

This draft includes sections on the following topics:
    General description of chunks
    Storage and access order
    Partial I/O
    Chunk caching
    I/O filters and compression
    Pitfalls and errors to avoid
    Additional Resources
    Future directions
Several suggestions for tuning chunking in an application are provided along the way.

As a draft, this remains a work in progress; your feedback will be appreciated and will be very useful in the document's development. For example, let us know if there are additional questions that you would like to see treated.

Regards,
-- Frank Baker
   HDF Documentation
   fbaker@hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Hi John,

Does the hdfgroup have any kind of plan/schedule for enabling compression of chunks when using parallel IO?

  It's on my agenda for the first year of work that we will be starting soon for LBNL. I think it's feasible for independent I/O, with some work. I think collective I/O will probably require a different approach, however. At least with collective I/O, all the processes are available to communicate and work on things together...

  The problem with the collective I/O [write] operations is that multiple processes may be writing into each chunk, which MPI-I/O can handle when the data is not compressed, but since compressed data is context-sensitive, straightforward collective I/O won't work for compressed chunks. Perhaps a two-phase approach where the data for each chunk was shipped to a single process, which updated the data in the chunk and compressed it, followed by 1+ passes of collective writes of compressed chunks.

  The problem with independent I/O [write] operations is that compressed chunks [almost always] change size when the data in the chunk is written (either initially, or when the data is overwritten), and since all the processes aren't available, communicating the space allocation is a problem. Each process needs to allocate space in the file, but since the other processes aren't "listening", it can't let them know that some space in the file has been used. A possible solution to this might involve just appending data to the end of the file, but that's prone to race conditions between processes (although maybe the "shared file pointer" I/O mode in MPI-I/O would help this). Also, if each process moves a chunk around in the file (because it resized it), how will other processes learn where that chunk is, if they need to read from it?

The use case being that each process compresses its own chunk at write time and the overall file size is reduced.
(I understand that chunks are preallocated and this makes it hard to implement compressed chunking with Parallel IO).

  Some other ideas that we've been kicking around recently are:

- Using a lossy compressor (like a wavelet encoder) to put a fixed upper limit on the size of each chunk, making them all the same size. This will obviously affect the precision of the data stored and thus may not be a good solution for restart dumps, although it might be fine for visualization/plot files. It's great from the perspective that it completely eliminates the space allocation problem, though.

- Use a lossless compressor (like gzip), but put an upper limit on the compressed size of a chunk, something that's likely to be achievable, like 2:1 or so. Then, if each chunk can't be compressed to that size, have the I/O operation fail. This eliminates the space allocation issue, but at the cost of possibly not being able to write compressed data at all.

- Alternatively, use a lossless compressor with an upper limit on the compressed size of a chunk, but also allow for chunks that aren't able to be compressed to the goal ratio to be stored uncompressed. So, the dataset will only have two sizes of chunks: full-size chunks and half-size (or third-size, etc) chunks, which limits the space allocation complexities involved. I'm not certain this buys much in the way of benefits, since it doesn't eliminate space allocation, and probably wouldn't address the space allocation problems with independent I/O.

  Any other ideas or input?

    Quincey

···

On Feb 22, 2011, at 4:16 AM, Biddiscombe, John A. wrote:

Thanks

JB

-----Original Message-----
From: hdf-forum-bounces@hdfgroup.org [mailto:hdf-forum-bounces@hdfgroup.org] On Behalf Of Frank Baker
Sent: 04 October 2010 22:31
To: HDF Users Discussion List
Subject: [Hdf-forum] New "Chunking in HDF5" document

Several users have raised questions regarding chunking in HDF5. Partly in response to these questions, the initial draft of a new "Chunking in HDF5" document is now available on The HDF Group's website:
   http://www.hdfgroup.org/HDF5/doc/_topic/Chunking/

This draft includes sections on the following topics:
   General description of chunks
   Storage and access order
   Partial I/O
   Chunk caching
   I/O filters and compression
   Pitfalls and errors to avoid
   Additional Resources
   Future directions
Several suggestions for tuning chunking in an application are provided along the way.

As a draft, this remains a work in progress; your feedback will be appreciated and will be very useful in the document's development. For example, let us know if there are additional questions that you would like to see treated.

Regards,
-- Frank Baker
  HDF Documentation
  fbaker@hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Something that puzzles me here is that if my parallel app. applied szip,
gzip or whatever compression to my data on each processor BEFORE ever
passing it to HDF5, I can then successfully engage in write operations
to HDF5 treating the data as an opaque array of bytes of known size
using collective or independent parallel I/O just as any other
'ordinary' HDF5 dataset (using either chunked or contig layouts).

The problem, of course, is that the HDF5 library would not be 'aware' of
the data's true nature (either its original pre-compressed type or the
fact that it had been compressed and by which algorithm(s) etc.).
Subsequent readers would have to 'know' what to do with it, etc.

So, why can't we fix the second half of this problem and invent a way to
hand HDF5 'pre-filtered' data, and bypass any subsequent attempts in
HDF5 to filter it (or chunks thereof) on write. On the read end, enough
information would be available to the library to 'do the right' thing.

I guess another way of saying this is that HDF5's chunking is specified
in terms of the dataset's 'native shape'. For compressed data, why not
'turn that around' and handle chunking as buckets of a fixed number of
compressed bytes of the dataset (where number of bytes is chosen to
equate to #bytes of a chunk as specified in the dataset's 'native'
layout) but when uncompressed yields a variable sized 'chunk' in the
native layout?

Mark

···

On Tue, 2011-02-22 at 12:29, Quincey Koziol wrote:

  The problem with the collective I/O [write] operations is that
multiple processes may be writing into each chunk, which MPI-I/O can
handle when the data is not compressed, but since compressed data is
context-sensitive, straightforward collective I/O won't work for
compressed chunks. Perhaps a two-phase approach where the data for
each chunk was shipped to a single process, which updated the data in
the chunk and compressed it, followed by 1+ passes of collective
writes of compressed chunks.

  The problem with independent I/O [write] operations is that
compressed chunks [almost always] change size when the data in the
chunk is written (either initially, or when the data is overwritten),
and since all the processes aren't available, communicating the space
allocation is a problem. Each process needs to allocate space in the
file, but since the other processes aren't "listening", it can't let
them know that some space in the file has been used. A possible
solution to this might involve just appending data to the end of the
file, but that's prone to race conditions between processes (although
maybe the "shared file pointer" I/O mode in MPI-I/O would help this).
Also, if each process moves a chunk around in the file (because it
resized it), how will other processes learn where that chunk is, if
they need to read from it?

--
Mark C. Miller, Lawrence Livermore National Laboratory
================!!LLNL BUSINESS ONLY!!================
miller86@llnl.gov urgent: miller86@pager.llnl.gov
T:8-6 (925)-423-5901 M/W/Th:7-12,2-7 (530)-753-8511

Hi Mark,

  The problem with the collective I/O [write] operations is that
multiple processes may be writing into each chunk, which MPI-I/O can
handle when the data is not compressed, but since compressed data is
context-sensitive, straightforward collective I/O won't work for
compressed chunks. Perhaps a two-phase approach where the data for
each chunk was shipped to a single process, which updated the data in
the chunk and compressed it, followed by 1+ passes of collective
writes of compressed chunks.

  The problem with independent I/O [write] operations is that
compressed chunks [almost always] change size when the data in the
chunk is written (either initially, or when the data is overwritten),
and since all the processes aren't available, communicating the space
allocation is a problem. Each process needs to allocate space in the
file, but since the other processes aren't "listening", it can't let
them know that some space in the file has been used. A possible
solution to this might involve just appending data to the end of the
file, but that's prone to race conditions between processes (although
maybe the "shared file pointer" I/O mode in MPI-I/O would help this).
Also, if each process moves a chunk around in the file (because it
resized it), how will other processes learn where that chunk is, if
they need to read from it?

Something that puzzles me here is that if my parallel app. applied szip,
gzip or whatever compression to my data on each processor BEFORE ever
passing it to HDF5, I can then successfully engage in write operations
to HDF5 treating the data as an opaque array of bytes of known size
using collective or independent parallel I/O just as any other
'ordinary' HDF5 dataset (using either chunked or contig layouts).

  Yes, but this would have to be to different datasets, with the appropriate overhead. (Along with the lack of self-description that you mention below) You are just pushing the space allocation problem to the dataset creation step, in this case. Also, this approach would only work for independent I/O, and possibly only for a subset of those...

The problem, of course, is that the HDF5 library would not be 'aware' of
the data's true nature (either its original pre-compressed type or the
fact that it had been compressed and by which algorithm(s) etc.).
Subsequent readers would have to 'know' what to do with it, etc.

So, why can't we fix the second half of this problem and invent a way to
hand HDF5 'pre-filtered' data, and bypass any subsequent attempts in
HDF5 to filter it (or chunks thereof) on write. On the read end, enough
information would be available to the library to 'do the right' thing.

I guess another way of saying this is that HDF5's chunking is specified
in terms of the dataset's 'native shape'. For compressed data, why not
'turn that around' and handle chunking as buckets of a fixed number of
compressed bytes of the dataset (where number of bytes is chosen to
equate to #bytes of a chunk as specified in the dataset's 'native'
layout) but when uncompressed yields a variable sized 'chunk' in the
native layout?

  Well, as I say above, with this approach, you push the space allocation problem to the dataset creation step (which has it's own set of problems), and then performing the I/O would be equivalent to the idea of using a lossless compressor with a [pre-computed by compressing the data] fixed upper limit on the size of the data. I'm concerned about having the application perform the compression directly... Maybe HDF5 could expose an API routine that the application could call, to pre-compress the data by passing it through the I/O filters?

  Quincey

···

On Feb 22, 2011, at 3:01 PM, Mark Miller wrote:

On Tue, 2011-02-22 at 12:29, Quincey Koziol wrote:

Hi John,

Does the hdfgroup have any kind of plan/schedule for enabling compression of chunks when using parallel IO?

  It's on my agenda for the first year of work that we will be starting soon for LBNL. I think it's feasible for independent I/O, with some work. I think collective I/O will probably require a different approach, however. At least with collective I/O, all the processes are available to communicate and work on things together...

  The problem with the collective I/O [write] operations is that multiple processes may be writing into each chunk, which MPI-I/O can handle when the data is not compressed, but since compressed data is context-sensitive, straightforward collective I/O won't work for compressed chunks. Perhaps a two-phase approach where the data for each chunk was shipped to a single process, which updated the data in the chunk and compressed it, followed by 1+ passes of collective writes of compressed chunks.

  The problem with independent I/O [write] operations is that compressed chunks [almost always] change size when the data in the chunk is written (either initially, or when the data is overwritten), and since all the processes aren't available, communicating the space allocation is a problem. Each process needs to allocate space in the file, but since the other processes aren't "listening", it can't let them know that some space in the file has been used. A possible solution to this might involve just appending data to the end of the file, but that's prone to race conditions between processes (although maybe the "shared file pointer" I/O mode in MPI-I/O would help this). Also, if each process moves a chunk around in the file (because it resized it), how will other processes learn where that chunk is, if they need to read from it?

The use case being that each process compresses its own chunk at write time and the overall file size is reduced.
(I understand that chunks are preallocated and this makes it hard to implement compressed chunking with Parallel IO).

  Some other ideas that we've been kicking around recently are:

- Using a lossy compressor (like a wavelet encoder) to put a fixed upper limit on the size of each chunk, making them all the same size. This will obviously affect the precision of the data stored and thus may not be a good solution for restart dumps, although it might be fine for visualization/plot files. It's great from the perspective that it completely eliminates the space allocation problem, though.

- Use a lossless compressor (like gzip), but put an upper limit on the compressed size of a chunk, something that's likely to be achievable, like 2:1 or so. Then, if each chunk can't be compressed to that size, have the I/O operation fail. This eliminates the space allocation issue, but at the cost of possibly not being able to write compressed data at all.

- Alternatively, use a lossless compressor with an upper limit on the compressed size of a chunk, but also allow for chunks that aren't able to be compressed to the goal ratio to be stored uncompressed. So, the dataset will only have two sizes of chunks: full-size chunks and half-size (or third-size, etc) chunks, which limits the space allocation complexities involved. I'm not certain this buys much in the way of benefits, since it doesn't eliminate space allocation, and probably wouldn't address the space allocation problems with independent I/O.

  Any other ideas or input?

Maybe HDF5 could allocate some space for the uncompressed data, and if the compressed data don't use all that space, re-use leftover space for other purposes within the same processor, similar to a sparse matrix. This would not reduce the file size when writing the first dataset, but subsequent writings could benefit from it, as will a h5copy of the final dataset later (if copying is an option).

          Werner

···

On Tue, 22 Feb 2011 21:29:03 +0100, Quincey Koziol <koziol@hdfgroup.org> wrote:

On Feb 22, 2011, at 4:16 AM, Biddiscombe, John A. wrote:

    Quincey

Thanks

JB

-----Original Message-----
From: hdf-forum-bounces@hdfgroup.org [mailto:hdf-forum-bounces@hdfgroup.org] On Behalf Of Frank Baker
Sent: 04 October 2010 22:31
To: HDF Users Discussion List
Subject: [Hdf-forum] New "Chunking in HDF5" document

Several users have raised questions regarding chunking in HDF5. Partly in response to these questions, the initial draft of a new "Chunking in HDF5" document is now available on The HDF Group's website:
   http://www.hdfgroup.org/HDF5/doc/_topic/Chunking/

This draft includes sections on the following topics:
   General description of chunks
   Storage and access order
   Partial I/O
   Chunk caching
   I/O filters and compression
   Pitfalls and errors to avoid
   Additional Resources
   Future directions
Several suggestions for tuning chunking in an application are provided along the way.

As a draft, this remains a work in progress; your feedback will be appreciated and will be very useful in the document's development. For example, let us know if there are additional questions that you would like to see treated.

Regards,
-- Frank Baker
  HDF Documentation
  fbaker@hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

--
___________________________________________________________________________
Dr. Werner Benger Visualization Research
Laboratory for Creative Arts and Technology (LCAT)
Center for Computation & Technology at Louisiana State University (CCT/LSU)
211 Johnston Hall, Baton Rouge, Louisiana 70803
Tel.: +1 225 578 4809 Fax.: +1 225 578-5362

  Well, as I say above, with this approach, you push the space
allocation problem to the dataset creation step (which has it's own
set of problems),

Yeah, but those 'problems' aren't new to parallel I/O issues. Anyone
that is currently doing concurrent parallel I/O with HDF5 has had to
already deal with this part of the problem -- space allocation at
dataset creation -- right? The point is the caller of HDF5 then knows
how big it will be after its been compressed and HDF5 doesn't have to
'discover' that during H5Dwrite. Hmm puzzling...

I am recalling my suggestion of a '2-pass-planning' VFD where the caller
executes slew of HDF5 operations on a file TWICE. The first pass, HDF5
doesn't do any of the actual raw data I/O but just records all the
information about it for a 'repeat performance' second pass. In the
second pass, HDF5 knows everything about what is 'about to happen' and
then can plan accordingly.

What about maybe doing that on a dataset-at-a-time basis? I mean, what
if you set dxpl props to indicate either 'pass 1' or 'pass 2' of a
2-pass H5Dwrite operation. On pass 1, between H5Dopen and H5Dclose,
H5Dwrites don't do any of the raw data I/O but do apply filters and
compute sizes of things it will eventually write. On H5Dclose of pass 1,
all the information of chunk sizes is recorded. Caller then does
everything again, a second time but sets 'pass' to 'pass 2' in dxpl for
H5Dwrite calls and everything 'works' because all processors know
everything they need to know.

  Maybe HDF5 could expose an API routine that the application could
call, to pre-compress the data by passing it through the I/O filters?

I think that could be useful in any case. Like its now possible to apply
type conversion to a buffer of bytes, it probably ought to be possible
to apply any 'filter' to a buffer of bytes. The second half of this
though would involve smartening HDF5 then to 'pass-through' pre-filtered
data so result is 'as if' HDF5 had done the filtering work itself during
H5Dwrite. Not sure how easy that would be :wink: But, you asked for
comments/input.

···

On Tue, 2011-02-22 at 14:06, Quincey Koziol wrote:

  Quincey

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

--
Mark C. Miller, Lawrence Livermore National Laboratory
================!!LLNL BUSINESS ONLY!!================
miller86@llnl.gov urgent: miller86@pager.llnl.gov
T:8-6 (925)-423-5901 M/W/Th:7-12,2-7 (530)-753-8511

Replying to multiple comments at once.

Quincey : "multiple processes may be writing into each chunk, which MPI-I/O can handle when the data is not compressed, but since compressed data is context-sensitive"
My initial use case would be much simpler. A chunk would be aligned with the boundaries of the domain decomposition and each process would write one chunk - one at a time - A compression filter would be applied by the process owning the data and then it would be written to disk (much like Marks' suggestion).
a) lossless. Problem understood, chunks varying in size, nasty metadata synchronization, sparse files, issues.
b) lossy. Seems feasible. We were in fact considering a wavelet type compression as a first pass (pun intended). "It's great from the perspective that it completely eliminates the space allocation problem". Absolutely. All chunks are known to be of size X beforehand, so nothing changes except for the indexing and actual chunk storage/retrieval + de/compression.

I also like the idea of using a lossless compression and having the IO operation fail if the data doesn't fit. Would give the user the chance to try their best to compress with some knowledge of the data type and if it doesn't fit the allocated space, to abort.

Mark : Multi-pass VFD. I like this too. It potentially allows a very flexible approach where even if collective IO is writing to the same chunk, the collection/compression phase can do the sums and transmit the info into the hdf5 metadata layer. We'd certainly need to extend the chunking interface to handle variable seized chunks to allow for more/less compression in different areas of the data (actually this would be true for any option involving lossless compression). I think the chunk hashing relies on all chunks being the same size, so any change to that is going to be a huge compatibility breaker. Also, the chunking layer sits on top of the VFD, so I'm not sure if the VFD would be able to manipulate the chunks in the way desired. Perhaps I'm mstaked and the VFD does see the chunks. Correct me anyway.

Quincey : One idea I had and which I think Mark also expounded on is ... each process takes its own data and compresses it as it sees fit, then the processes do a synchronization step to tell each other how much (new compressed) data they have got - and then a dataset create is called - using the size of the compressed data. Now each process creates a hyperslab for its piece of compressed data and writes into the file using collective IO. We now add an array of extent information and compression algorithm info to the dataset as an attribute where each entry has a start and end index of the data for each process.

Now the only trouble is that reading the data back requires a double step of reading the attributes and decompressing the desired piece- quite nasty when odd slices are being requested.

Now I start to think that Marks double VFD suggestion would do basically this (in one way or another), but maintaining the normal data layout rather than writing a special dataset representing the compressed data.
step 1 : Data is collected into chunks (if already aligned with domain decomposition, no-op), chunks are compressed.
step 2 : Sizes of chunks are exchanged and space is allocated in the file for all the chunks.
step 3 : chunks of compressed data are written
not sure two passes are actually needed, as long as the 3 steps are followed.

...but variable chunk sizes are not allowed in hdf (true or false?) - this seems like a showstopper.
Aha. I understand. The actual written data can/could vary in size, as long as the chunk indices as referring to the original dataspace are regular. yes?

JB
Please forgive my thinking out aloud

···

-----Original Message-----
From: hdf-forum-bounces@hdfgroup.org [mailto:hdf-forum-bounces@hdfgroup.org] On Behalf Of Mark Miller
Sent: 22 February 2011 23:43
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] New "Chunking in HDF5" document

On Tue, 2011-02-22 at 14:06, Quincey Koziol wrote:

  Well, as I say above, with this approach, you push the space
allocation problem to the dataset creation step (which has it's own
set of problems),

Yeah, but those 'problems' aren't new to parallel I/O issues. Anyone
that is currently doing concurrent parallel I/O with HDF5 has had to
already deal with this part of the problem -- space allocation at
dataset creation -- right? The point is the caller of HDF5 then knows
how big it will be after its been compressed and HDF5 doesn't have to
'discover' that during H5Dwrite. Hmm puzzling...

I am recalling my suggestion of a '2-pass-planning' VFD where the caller
executes slew of HDF5 operations on a file TWICE. The first pass, HDF5
doesn't do any of the actual raw data I/O but just records all the
information about it for a 'repeat performance' second pass. In the
second pass, HDF5 knows everything about what is 'about to happen' and
then can plan accordingly.

What about maybe doing that on a dataset-at-a-time basis? I mean, what
if you set dxpl props to indicate either 'pass 1' or 'pass 2' of a
2-pass H5Dwrite operation. On pass 1, between H5Dopen and H5Dclose,
H5Dwrites don't do any of the raw data I/O but do apply filters and
compute sizes of things it will eventually write. On H5Dclose of pass 1,
all the information of chunk sizes is recorded. Caller then does
everything again, a second time but sets 'pass' to 'pass 2' in dxpl for
H5Dwrite calls and everything 'works' because all processors know
everything they need to know.

  Maybe HDF5 could expose an API routine that the application could
call, to pre-compress the data by passing it through the I/O filters?

I think that could be useful in any case. Like its now possible to apply
type conversion to a buffer of bytes, it probably ought to be possible
to apply any 'filter' to a buffer of bytes. The second half of this
though would involve smartening HDF5 then to 'pass-through' pre-filtered
data so result is 'as if' HDF5 had done the filtering work itself during
H5Dwrite. Not sure how easy that would be :wink: But, you asked for
comments/input.

  Quincey

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

--
Mark C. Miller, Lawrence Livermore National Laboratory
================!!LLNL BUSINESS ONLY!!================
miller86@llnl.gov urgent: miller86@pager.llnl.gov
T:8-6 (925)-423-5901 M/W/Th:7-12,2-7 (530)-753-8511

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Hi Mark,

  Well, as I say above, with this approach, you push the space
allocation problem to the dataset creation step (which has it's own
set of problems),

Yeah, but those 'problems' aren't new to parallel I/O issues. Anyone
that is currently doing concurrent parallel I/O with HDF5 has had to
already deal with this part of the problem -- space allocation at
dataset creation -- right? The point is the caller of HDF5 then knows
how big it will be after its been compressed and HDF5 doesn't have to
'discover' that during H5Dwrite. Hmm puzzling...

  True, yes.

I am recalling my suggestion of a '2-pass-planning' VFD where the caller
executes slew of HDF5 operations on a file TWICE. The first pass, HDF5
doesn't do any of the actual raw data I/O but just records all the
information about it for a 'repeat performance' second pass. In the
second pass, HDF5 knows everything about what is 'about to happen' and
then can plan accordingly.

  Ah, yes, that may be a good segue into this two-pass feature. I've been thinking about this feature and wondering about how to implement it. Something that occurs to me would be to construct it like a "transaction", where the application opens a transaction, the HDF5 library just records those operations performed with API routines, then when the application closes the transaction, they are replayed twice: once to record the results of all the operations, and then a second pass that actually performs all the I/O. That would help to reduce the overhead from the collective metadata modification overhead also.

What about maybe doing that on a dataset-at-a-time basis? I mean, what
if you set dxpl props to indicate either 'pass 1' or 'pass 2' of a
2-pass H5Dwrite operation. On pass 1, between H5Dopen and H5Dclose,
H5Dwrites don't do any of the raw data I/O but do apply filters and
compute sizes of things it will eventually write. On H5Dclose of pass 1,
all the information of chunk sizes is recorded. Caller then does
everything again, a second time but sets 'pass' to 'pass 2' in dxpl for
H5Dwrite calls and everything 'works' because all processors know
everything they need to know.

  Ah, I like this also!

Maybe HDF5 could expose an API routine that the application could
call, to pre-compress the data by passing it through the I/O filters?

I think that could be useful in any case. Like its now possible to apply
type conversion to a buffer of bytes, it probably ought to be possible
to apply any 'filter' to a buffer of bytes. The second half of this
though would involve smartening HDF5 then to 'pass-through' pre-filtered
data so result is 'as if' HDF5 had done the filtering work itself during
H5Dwrite. Not sure how easy that would be :wink: But, you asked for
comments/input.

  Yes, that's the direction I was thinking about going.

  I think the transaction idea I mentioned above might be the most general and have the highest payoff. It could even be implemented with poor man's parallel I/O, when the transaction concluded.

  Quincey

···

On Feb 22, 2011, at 4:42 PM, Mark Miller wrote:

On Tue, 2011-02-22 at 14:06, Quincey Koziol wrote:

Mark : Multi-pass VFD. I like this too.

I used the phrase 'two-pass planning VFD' primarily to jar Quincey's
memory of a discussion we had a few weeks ago. However, the behavior I
proposed is actually a change to HDF5 lib internal running on top of
MPI-IO VFD. Since gap between CPU and I/O bandwidth is so wide and only
getting wider, I see know problem with doing all the H5Dwrite work
between H5Dopen and H5Dclose twice; once to get sizing information but
not actually do any I/O and the second to then proceed with actual I/O
given known sizing information.

Now, I have conceived of a VFD that could be used to affect same
behavior over a WHOLE FILE instead of an individual dataset. But, its
only a glimmer in my eye right now :wink:

At any rate, either of these ideas does involve changes to application
to essentially tell HDF5 twice what it wants to do.

···

On Wed, 2011-02-23 at 04:54, Biddiscombe, John A. wrote:

It potentially allows a very flexible approach where even if
collective IO is writing to the same chunk, the collection/compression
phase can do the sums and transmit the info into the hdf5 metadata
layer. We'd certainly need to extend the chunking interface to handle
variable seized chunks to allow for more/less compression in different
areas of the data (actually this would be true for any option
involving lossless compression). I think the chunk hashing relies on
all chunks being the same size, so any change to that is going to be a
huge compatibility breaker. Also, the chunking layer sits on top of
the VFD, so I'm not sure if the VFD would be able to manipulate the
chunks in the way desired. Perhaps I'm mstaked and the VFD does see
the chunks. Correct me anyway.

>
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> Hdf-forum@hdfgroup.org
> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

--
Mark C. Miller, Lawrence Livermore National Laboratory
================!!LLNL BUSINESS ONLY!!================
miller86@llnl.gov urgent: miller86@pager.llnl.gov
T:8-6 (925)-423-5901 M/W/Th:7-12,2-7 (530)-753-8511

Hi Mark,

  Well, as I say above, with this approach, you push the space
allocation problem to the dataset creation step (which has it's own
set of problems),

Yeah, but those 'problems' aren't new to parallel I/O issues. Anyone
that is currently doing concurrent parallel I/O with HDF5 has had to
already deal with this part of the problem -- space allocation at
dataset creation -- right? The point is the caller of HDF5 then knows
how big it will be after its been compressed and HDF5 doesn't have to
'discover' that during H5Dwrite. Hmm puzzling...

  True, yes.

I am recalling my suggestion of a '2-pass-planning' VFD where the caller
executes slew of HDF5 operations on a file TWICE. The first pass, HDF5
doesn't do any of the actual raw data I/O but just records all the
information about it for a 'repeat performance' second pass. In the
second pass, HDF5 knows everything about what is 'about to happen' and
then can plan accordingly.

  Ah, yes, that may be a good segue into this two-pass feature. I've been thinking about this feature and wondering about how to implement it. Something that occurs to me would be to construct it like a "transaction", where the application opens a transaction, the HDF5 library just records those operations performed with API routines, then when the application closes the transaction, they are replayed twice: once to record the results of all the operations, and then a second pass that actually performs all the I/O. That would help to reduce the overhead from the collective metadata modification overhead also.

  BTW, if we go down this "transaction" path, it allows the HDF5 library to push the fault tolerance up to the application level - the library could guarantee that the atomicity of what was "visible" in the file was an entire checkpoint, rather than the atomicity being on a per-API call basis.

  Quincey

···

On Feb 23, 2011, at 4:12 PM, Quincey Koziol wrote:

On Feb 22, 2011, at 4:42 PM, Mark Miller wrote:

On Tue, 2011-02-22 at 14:06, Quincey Koziol wrote:

What about maybe doing that on a dataset-at-a-time basis? I mean, what
if you set dxpl props to indicate either 'pass 1' or 'pass 2' of a
2-pass H5Dwrite operation. On pass 1, between H5Dopen and H5Dclose,
H5Dwrites don't do any of the raw data I/O but do apply filters and
compute sizes of things it will eventually write. On H5Dclose of pass 1,
all the information of chunk sizes is recorded. Caller then does
everything again, a second time but sets 'pass' to 'pass 2' in dxpl for
H5Dwrite calls and everything 'works' because all processors know
everything they need to know.

  Ah, I like this also!

Maybe HDF5 could expose an API routine that the application could
call, to pre-compress the data by passing it through the I/O filters?

I think that could be useful in any case. Like its now possible to apply
type conversion to a buffer of bytes, it probably ought to be possible
to apply any 'filter' to a buffer of bytes. The second half of this
though would involve smartening HDF5 then to 'pass-through' pre-filtered
data so result is 'as if' HDF5 had done the filtering work itself during
H5Dwrite. Not sure how easy that would be :wink: But, you asked for
comments/input.

  Yes, that's the direction I was thinking about going.

  I think the transaction idea I mentioned above might be the most general and have the highest payoff. It could even be implemented with poor man's parallel I/O, when the transaction concluded.

  Quincey

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Hi John,

Replying to multiple comments at once.

Quincey : "multiple processes may be writing into each chunk, which MPI-I/O can handle when the data is not compressed, but since compressed data is context-sensitive"
My initial use case would be much simpler. A chunk would be aligned with the boundaries of the domain decomposition and each process would write one chunk - one at a time - A compression filter would be applied by the process owning the data and then it would be written to disk (much like Marks' suggestion).
a) lossless. Problem understood, chunks varying in size, nasty metadata synchronization, sparse files, issues.
b) lossy. Seems feasible. We were in fact considering a wavelet type compression as a first pass (pun intended). "It's great from the perspective that it completely eliminates the space allocation problem". Absolutely. All chunks are known to be of size X beforehand, so nothing changes except for the indexing and actual chunk storage/retrieval + de/compression.

  Yup. (Although it's not impossible for collective I/O)

I also like the idea of using a lossless compression and having the IO operation fail if the data doesn't fit. Would give the user the chance to try their best to compress with some knowledge of the data type and if it doesn't fit the allocated space, to abort.

  OK, at least one other person thinks this is reasonable. :slight_smile:

Mark : Multi-pass VFD.
I like this too. It potentially allows a very flexible approach where even if collective IO is writing to the same chunk, the collection/compression phase can do the sums and transmit the info into the hdf5 metadata layer. We'd certainly need to extend the chunking interface to handle variable sized chunks to allow for more/less compression in different areas of the data (actually this would be true for any option involving lossless compression). I think the chunk hashing relies on all chunks being the same size, so any change to that is going to be a huge compatibility breaker. Also, the chunking layer sits on top of the VFD, so I'm not sure if the VFD would be able to manipulate the chunks in the way desired. Perhaps I'm mistaken and the VFD does see the chunks. Correct me anyway.

  If we go with the multi-pass/transaction idea, I don't think we need to worry about the chunks being different sizes.

  You are correct in that the VFD layer doesn't see the chunk information. (And I think it would be bad to make it so :slight_smile:

Quincey : One idea I had and which I think Mark also expounded on is ... each process takes its own data and compresses it as it sees fit, then the processes do a synchronization step to tell each other how much (new compressed) data they have got - and then a dataset create is called - using the size of the compressed data. Now each process creates a hyperslab for its piece of compressed data and writes into the file using collective IO. We now add an array of extent information and compression algorithm info to the dataset as an attribute where each entry has a start and end index of the data for each process.

Now the only trouble is that reading the data back requires a double step of reading the attributes and decompressing the desired piece- quite nasty when odd slices are being requested.

  Maybe. (Icky if so)

Now I start to think that Marks double VFD suggestion would do basically this (in one way or another), but maintaining the normal data layout rather than writing a special dataset representing the compressed data.
step 1 : Data is collected into chunks (if already aligned with domain decomposition, no-op), chunks are compressed.
step 2 : Sizes of chunks are exchanged and space is allocated in the file for all the chunks.
step 3 : chunks of compressed data are written
not sure two passes are actually needed, as long as the 3 steps are followed.

...but variable chunk sizes are not allowed in hdf (true or false?) - this seems like a showstopper.
Aha. I understand. The actual written data can/could vary in size, as long as the chunk indices as referring to the original dataspace are regular. yes?

  Yes.

JB
Please forgive my thinking out aloud

  Not a problem - please continue to participate!

    Quincey

···

On Feb 23, 2011, at 6:54 AM, Biddiscombe, John A. wrote:

-----Original Message-----
From: hdf-forum-bounces@hdfgroup.org [mailto:hdf-forum-bounces@hdfgroup.org] On Behalf Of Mark Miller
Sent: 22 February 2011 23:43
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] New "Chunking in HDF5" document

On Tue, 2011-02-22 at 14:06, Quincey Koziol wrote:

  Well, as I say above, with this approach, you push the space
allocation problem to the dataset creation step (which has it's own
set of problems),

Yeah, but those 'problems' aren't new to parallel I/O issues. Anyone
that is currently doing concurrent parallel I/O with HDF5 has had to
already deal with this part of the problem -- space allocation at
dataset creation -- right? The point is the caller of HDF5 then knows
how big it will be after its been compressed and HDF5 doesn't have to
'discover' that during H5Dwrite. Hmm puzzling...

I am recalling my suggestion of a '2-pass-planning' VFD where the caller
executes slew of HDF5 operations on a file TWICE. The first pass, HDF5
doesn't do any of the actual raw data I/O but just records all the
information about it for a 'repeat performance' second pass. In the
second pass, HDF5 knows everything about what is 'about to happen' and
then can plan accordingly.

What about maybe doing that on a dataset-at-a-time basis? I mean, what
if you set dxpl props to indicate either 'pass 1' or 'pass 2' of a
2-pass H5Dwrite operation. On pass 1, between H5Dopen and H5Dclose,
H5Dwrites don't do any of the raw data I/O but do apply filters and
compute sizes of things it will eventually write. On H5Dclose of pass 1,
all the information of chunk sizes is recorded. Caller then does
everything again, a second time but sets 'pass' to 'pass 2' in dxpl for
H5Dwrite calls and everything 'works' because all processors know
everything they need to know.

Maybe HDF5 could expose an API routine that the application could
call, to pre-compress the data by passing it through the I/O filters?

I think that could be useful in any case. Like its now possible to apply
type conversion to a buffer of bytes, it probably ought to be possible
to apply any 'filter' to a buffer of bytes. The second half of this
though would involve smartening HDF5 then to 'pass-through' pre-filtered
data so result is 'as if' HDF5 had done the filtering work itself during
H5Dwrite. Not sure how easy that would be :wink: But, you asked for
comments/input.

  Quincey

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

--
Mark C. Miller, Lawrence Livermore National Laboratory
================!!LLNL BUSINESS ONLY!!================
miller86@llnl.gov urgent: miller86@pager.llnl.gov
T:8-6 (925)-423-5901 M/W/Th:7-12,2-7 (530)-753-8511

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Hmm. Thats only true if 'transaction' is whole file scope, right? I mean
aren't you going to allow application to decide what 'granularity' a
transaction should be; a single dataset, a bunch of datasets in a group
in the file, etc.

If scope of 'transaction' is only a whole-file, then...

I may be misunderstanding your notions here but I don't think you'd want
to design this around the assumption that a 'transaction' could embody
something that included all buffer pointers passed into HDF5 by caller
and then HDF5 could automagically FINISH the transaction on behalf of
the application without returning control back to the application.

I think there are going to be too many situations where applications
unwind their own internal data structures placing data into temporary
buffers that are then handed off to HDF5 for I/O and freed. And, for a
given HDF5 file, this likely happens again and again as different parts
of the application's internal data is spit out to HDF5. But, not to
worry.

My idea included the notion the application would have to re-engage in
all such 'data prep for I/O' processes a second time. I assume time to
complete such process, relative to actual I/O time, is small enough that
it doesn't matter to the application that it has to do it twice. I think
for most applications, that would be true and relatively easy to
engineer to engage in the work in two passes.

Mark

···

On Wed, 2011-02-23 at 14:41, Quincey Koziol wrote:

>
> Ah, yes, that may be a good segue into this two-pass feature. I've
been thinking about this feature and wondering about how to implement
it. Something that occurs to me would be to construct it like a
"transaction", where the application opens a transaction, the HDF5
library just records those operations performed with API routines,
then when the application closes the transaction, they are replayed
twice: once to record the results of all the operations, and then a
second pass that actually performs all the I/O. That would help to
reduce the overhead from the collective metadata modification overhead
also.

  BTW, if we go down this "transaction" path, it allows the HDF5
library to push the fault tolerance up to the application level - the
library could guarantee that the atomicity of what was "visible" in
the file was an entire checkpoint, rather than the atomicity being on
a per-API call basis.

--
Mark C. Miller, Lawrence Livermore National Laboratory
================!!LLNL BUSINESS ONLY!!================
miller86@llnl.gov urgent: miller86@pager.llnl.gov
T:8-6 (925)-423-5901 M/W/Th:7-12,2-7 (530)-753-8511

  BTW, if we go down this &quot;transaction&quot; path, it allows the HDF5

library to push the fault tolerance up to the application level - the
library could guarantee that the atomicity of what was "visible" in
the file was an entire checkpoint, rather than the atomicity being on
a per-API call basis.

Hmm. Thats only true if 'transaction' is whole file scope, right? I mean
aren't you going to allow application to decide what 'granularity' a
transaction should be; a single dataset, a bunch of datasets in a group
in the file, etc.

Careful fellas... you'll end up implementing a good part of
conventional database transactions and their ACID guarantees before
you're done. And you won't have the benefit of SQL as a lingua
franca. If you want fancy transaction semantics why not just use a
database vendor with a particularly rich BLOB API?

99% tongue-in-cheek,
Rhys