Replying to multiple comments at once.
Quincey : "multiple processes may be writing into each chunk, which MPI-I/O can handle when the data is not compressed, but since compressed data is context-sensitive"
My initial use case would be much simpler. A chunk would be aligned with the boundaries of the domain decomposition and each process would write one chunk - one at a time - A compression filter would be applied by the process owning the data and then it would be written to disk (much like Marks' suggestion).
a) lossless. Problem understood, chunks varying in size, nasty metadata synchronization, sparse files, issues.
b) lossy. Seems feasible. We were in fact considering a wavelet type compression as a first pass (pun intended). "It's great from the perspective that it completely eliminates the space allocation problem". Absolutely. All chunks are known to be of size X beforehand, so nothing changes except for the indexing and actual chunk storage/retrieval + de/compression.
I also like the idea of using a lossless compression and having the IO operation fail if the data doesn't fit. Would give the user the chance to try their best to compress with some knowledge of the data type and if it doesn't fit the allocated space, to abort.
Mark : Multi-pass VFD. I like this too. It potentially allows a very flexible approach where even if collective IO is writing to the same chunk, the collection/compression phase can do the sums and transmit the info into the hdf5 metadata layer. We'd certainly need to extend the chunking interface to handle variable seized chunks to allow for more/less compression in different areas of the data (actually this would be true for any option involving lossless compression). I think the chunk hashing relies on all chunks being the same size, so any change to that is going to be a huge compatibility breaker. Also, the chunking layer sits on top of the VFD, so I'm not sure if the VFD would be able to manipulate the chunks in the way desired. Perhaps I'm mstaked and the VFD does see the chunks. Correct me anyway.
Quincey : One idea I had and which I think Mark also expounded on is ... each process takes its own data and compresses it as it sees fit, then the processes do a synchronization step to tell each other how much (new compressed) data they have got - and then a dataset create is called - using the size of the compressed data. Now each process creates a hyperslab for its piece of compressed data and writes into the file using collective IO. We now add an array of extent information and compression algorithm info to the dataset as an attribute where each entry has a start and end index of the data for each process.
Now the only trouble is that reading the data back requires a double step of reading the attributes and decompressing the desired piece- quite nasty when odd slices are being requested.
Now I start to think that Marks double VFD suggestion would do basically this (in one way or another), but maintaining the normal data layout rather than writing a special dataset representing the compressed data.
step 1 : Data is collected into chunks (if already aligned with domain decomposition, no-op), chunks are compressed.
step 2 : Sizes of chunks are exchanged and space is allocated in the file for all the chunks.
step 3 : chunks of compressed data are written
not sure two passes are actually needed, as long as the 3 steps are followed.
...but variable chunk sizes are not allowed in hdf (true or false?) - this seems like a showstopper.
Aha. I understand. The actual written data can/could vary in size, as long as the chunk indices as referring to the original dataspace are regular. yes?
JB
Please forgive my thinking out aloud
···
-----Original Message-----
From: hdf-forum-bounces@hdfgroup.org [mailto:hdf-forum-bounces@hdfgroup.org] On Behalf Of Mark Miller
Sent: 22 February 2011 23:43
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] New "Chunking in HDF5" document
On Tue, 2011-02-22 at 14:06, Quincey Koziol wrote:
Well, as I say above, with this approach, you push the space
allocation problem to the dataset creation step (which has it's own
set of problems),
Yeah, but those 'problems' aren't new to parallel I/O issues. Anyone
that is currently doing concurrent parallel I/O with HDF5 has had to
already deal with this part of the problem -- space allocation at
dataset creation -- right? The point is the caller of HDF5 then knows
how big it will be after its been compressed and HDF5 doesn't have to
'discover' that during H5Dwrite. Hmm puzzling...
I am recalling my suggestion of a '2-pass-planning' VFD where the caller
executes slew of HDF5 operations on a file TWICE. The first pass, HDF5
doesn't do any of the actual raw data I/O but just records all the
information about it for a 'repeat performance' second pass. In the
second pass, HDF5 knows everything about what is 'about to happen' and
then can plan accordingly.
What about maybe doing that on a dataset-at-a-time basis? I mean, what
if you set dxpl props to indicate either 'pass 1' or 'pass 2' of a
2-pass H5Dwrite operation. On pass 1, between H5Dopen and H5Dclose,
H5Dwrites don't do any of the raw data I/O but do apply filters and
compute sizes of things it will eventually write. On H5Dclose of pass 1,
all the information of chunk sizes is recorded. Caller then does
everything again, a second time but sets 'pass' to 'pass 2' in dxpl for
H5Dwrite calls and everything 'works' because all processors know
everything they need to know.
Maybe HDF5 could expose an API routine that the application could
call, to pre-compress the data by passing it through the I/O filters?
I think that could be useful in any case. Like its now possible to apply
type conversion to a buffer of bytes, it probably ought to be possible
to apply any 'filter' to a buffer of bytes. The second half of this
though would involve smartening HDF5 then to 'pass-through' pre-filtered
data so result is 'as if' HDF5 had done the filtering work itself during
H5Dwrite. Not sure how easy that would be
But, you asked for
comments/input.
Quincey
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
--
Mark C. Miller, Lawrence Livermore National Laboratory
================!!LLNL BUSINESS ONLY!!================
miller86@llnl.gov urgent: miller86@pager.llnl.gov
T:8-6 (925)-423-5901 M/W/Th:7-12,2-7 (530)-753-8511
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org