Mark,
Perhaps - however there will be a huge range of compression ratios in our
simulations. In many cases we literally have all the same floating point
values for a bunch of the variables in a given file. In other cases, it's
much less, getting only 2:1 or 4:1 compression ratios for scale+offset+gzip.
So with that kind of range, I'm not sure it would be worth the effort. I'll
mull it over some more though, there may be a way to make it worth the
effort.
Leigh
···
On Wed, Dec 15, 2010 at 12:13 PM, Mark Miller <miller86@llnl.gov> wrote:
Hi Leigh,
I guess I am still interested to know whether an approach where
specifying a minimum target compression ratio and then allowing HDF5 to
(possibly over) allocate assuming a max. compressed size would work for
you?Mark
On Wed, 2010-12-15 at 10:59, Leigh Orf wrote:
>
> On Tue, Dec 14, 2010 at 5:42 PM, Quincey Koziol <koziol@hdfgroup.org> > > wrote:
> Hi Leigh,
>
>
> [snipped for brevity]
>
> > Quincey,
> >
> > Probably a combination of both, namely, an ideal situation
> > would be a group of MPI ranks collectively writing one
> > compressed HDF5 file. On Blue Waters a 100kcore run with 32
> > cores/MCM could therefore result in say around 3000 files,
> > which is not unreasonable.
> >
> > Maybe I'm thinking about this too simply, but couldn't you
> > compress the data on each MPI rank, save it in a buffer,
> > calculate the space required, and the write it? I don't know
> > enough about the internal workings of hdf5 to know whether
> > that would fit in the HDF5 model. In our particular
> > application on Blue Waters, memory is cheap, so there is
> > lots of space in memory for buffering data.
> >
>
>
> What you say above is basically what happens, except that
> space in the file needs to be allocated for each block of
> compressed data. Since each block is not the same size, the
> HDF5 library can't pre-allocate the space or algorithmically
> determine how much to reserve for each process. In the case
> of collective I/O, at least it's theoretically possible for
> all the processes to communicate and work it out, but I'm not
> certain it's going to be solvable for independent I/O, unless
> we reserve some processes to either allocate space (like a
> "free space server") or buffer the "I/O", etc.
>
> Could you make this work by forcing each core to have some specific
> chunking arrangement? For instance, you could have each core's
> dimension simply be the same dimension as each chunk, which actually
> works out pretty well in my application, at least in the horizontal. I
> typically have nxchunk=nx, nychunk=ny, and nzchunk to be something
> like 20 or so. But - now that I think about it, even if that were the
> case you don't know the size of the compressed chunks until you've
> compressed them and you'd still need to communicate the size of the
> compressed chunks amongst cores writing to an individual file.
>
> I don't know enough about hdf5 to understand how the preallocation
> process works. It sounds like you are allocating a bunch of zeroes (or
> something) on disk first, and then doing I/O straight to that space on
> disk? If this is the case then I can see how this necessitates some
> kind of collective communication if you are splitting up compression
> amongst MPI ranks.
>
> Personally I am perfectly happy with a bit of overhead which forces
> all cores to share amongst themselves what the compressed block size
> is before writing if it means we can do compression. Right now I see
> my choices as being (1) compressed, but 1 file per MPI rank, lots of
> files (2) No compression, fewer files, but perhaps compressing later
> on using h5repack, calling it in parallel, one h5repack per MPI rank
> as a post-processing step (yuck!).
>
> I'm glad you're working on this, personally I think this is important
> stuff for really huge simulations. In talking to other folks who will
> be using Blue Waters, compression is not much of an issue with many of
> them because of the nature of their data. Cloud data especially tends
> to compress very well. It would be a shame to fill terabytes of disk
> space with zeroes! I am sure we can still carry out our research
> objectives without compression, but the sheer amount of data we will
> be producing is staggering even with compression.
>
> Leigh
>
>
> Quincey
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> Hdf-forum@hdfgroup.org
> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
>
>
>
>
> --
> Leigh Orf
> Associate Professor of Atmospheric Science
> Department of Geology and Meteorology
> Central Michigan University
> Currently on sabbatical at the National Center for Atmospheric
> Research in Boulder, CO
> NCAR office phone: (303) 497-8200
--
Mark C. Miller, Lawrence Livermore National Laboratory
================!!LLNL BUSINESS ONLY!!================
miller86@llnl.gov urgent: miller86@pager.llnl.gov
T:8-6 (925)-423-5901 M/W/Th:7-12,2-7 (530)-753-8511_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
--
Leigh Orf
Associate Professor of Atmospheric Science
Department of Geology and Meteorology
Central Michigan University
Currently on sabbatical at the National Center for Atmospheric Research
in Boulder, CO
NCAR office phone: (303) 497-8200