Reading many small datasets

Sebastian_Good · August 26, 2009, 11:18pm

We read a number of data arrays out of HDF files, and each of these datasets
(e.g. 500x500 arrays of floats) is described by 6 very small datasets, each
a single float for height, width, originx, originy, cell height, cell width.
The HDF file is typically on an NFS (or CIFS) mount. We read from both Linux
& Windows. Opening and closing each of these tiny datasets takes enough time
that it approaches the time taken to read the entire grid itself (only a
megabyte or less, typically).
Is there a faster way to read a batch of small datasets than successive
calls to H5Dopen2, H5Dclose, and H5Dread? Obviously these grid dimensions
could be put in a single struct and read that way, but the simulator that
produces this metadata can't be effectively changed.

Many thanks,

Sebastian Good

miller86 · August 26, 2009, 11:11pm

You say you can't change the data producer. But, are you allowed to
change (add to) the file? If so, write a small tool that gathers all
those tiny datasets up into 1 large one and write that alternate form
back to the file in a way that won't collide with the data already in
the file or other apps that might also need to read the original file --
maybe as another dataset, maybe as an attribute of some existing object.

If you can't change the file, then write the data to a separate
(companion) file. In either case, then change your app to FIRST go look
for the 'condensed' form of the tiny dataset info before falling back to
the currently slow (brute force) approach.

Thats all I can think of. It would be much better of course the change
the data producer.

···

On Wed, 2009-08-26 at 16:18, Sebastian Good wrote:

We read a number of data arrays out of HDF files, and each of these
datasets (e.g. 500x500 arrays of floats) is described by 6 very small
datasets, each a single float for height, width, originx, originy,
cell height, cell width. The HDF file is typically on an NFS (or CIFS)
mount. We read from both Linux & Windows. Opening and closing each of
these tiny datasets takes enough time that it approaches the time
taken to read the entire grid itself (only a megabyte or less,
typically).

Is there a faster way to read a batch of small datasets than
successive calls to H5Dopen2, H5Dclose, and H5Dread? Obviously these
grid dimensions could be put in a single struct and read that way, but
the simulator that produces this metadata can't be effectively
changed.

Many thanks,

Sebastian Good

______________________________________________________________________
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://*mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

--
Mark C. Miller, Lawrence Livermore National Laboratory
email: mailto:miller86@llnl.gov
(M/T/W) (925)-423-5901 (!!LLNL BUSINESS ONLY!!)
(Th/F) (530)-753-8511 (!!LLNL BUSINESS ONLY!!)

miller86 · August 26, 2009, 11:39pm

I guess I could have been a little clearer here. If the approach I
suggest is possible, then the benefit is that this 'gather up the small
bits' process could occur 'in the background' BEFORE someone actually
needs to open and read the file with your app. And, furthermore, the
price is only paid once. But, it is nonetheless a hassle and a bit
combersome. Implicit in my response is that I myself know of no way to
avoid the long succession of H5Dopen/H5Dread/H5Dclose calls.

···

On Wed, 2009-08-26 at 16:11, Mark Miller wrote:

You say you can't change the data producer. But, are you allowed to
change (add to) the file? If so, write a small tool that gathers all
those tiny datasets up into 1 large one and write that alternate form
back to the file in a way that won't collide with the data already in
the file or other apps that might also need to read the original file --
maybe as another dataset, maybe as an attribute of some existing object.

If you can't change the file, then write the data to a separate
(companion) file. In either case, then change your app to FIRST go look
for the 'condensed' form of the tiny dataset info before falling back to
the currently slow (brute force) approach.

Thats all I can think of. It would be much better of course the change
the data producer.

On Wed, 2009-08-26 at 16:18, Sebastian Good wrote:
> We read a number of data arrays out of HDF files, and each of these
> datasets (e.g. 500x500 arrays of floats) is described by 6 very small
> datasets, each a single float for height, width, originx, originy,
> cell height, cell width. The HDF file is typically on an NFS (or CIFS)
> mount. We read from both Linux & Windows. Opening and closing each of
> these tiny datasets takes enough time that it approaches the time
> taken to read the entire grid itself (only a megabyte or less,
> typically).
>
> Is there a faster way to read a batch of small datasets than
> successive calls to H5Dopen2, H5Dclose, and H5Dread? Obviously these
> grid dimensions could be put in a single struct and read that way, but
> the simulator that produces this metadata can't be effectively
> changed.
>
> Many thanks,
>
> Sebastian Good
>
> ______________________________________________________________________
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> Hdf-forum@hdfgroup.org
> http://*mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

--
Mark C. Miller, Lawrence Livermore National Laboratory
email: mailto:miller86@llnl.gov
(M/T/W) (925)-423-5901 (!!LLNL BUSINESS ONLY!!)
(Th/F) (530)-753-8511 (!!LLNL BUSINESS ONLY!!)

werner · August 27, 2009, 1:12am

I'd be interested to learn what means "can't be effectively changed".
So you could change it, but it won't be efficient?

Werner

···

On Thu, 27 Aug 2009 01:18:09 +0200, Sebastian Good <sebastian@palladiumconsulting.com> wrote:

We read a number of data arrays out of HDF files, and each of these datasets
(e.g. 500x500 arrays of floats) is described by 6 very small datasets, each
a single float for height, width, originx, originy, cell height, cell width.
The HDF file is typically on an NFS (or CIFS) mount. We read from both Linux
& Windows. Opening and closing each of these tiny datasets takes enough time
that it approaches the time taken to read the entire grid itself (only a
megabyte or less, typically).
Is there a faster way to read a batch of small datasets than successive
calls to H5Dopen2, H5Dclose, and H5Dread? Obviously these grid dimensions
could be put in a single struct and read that way, but the simulator that
produces this metadata can't be effectively changed.

Many thanks,

Sebastian Good

--
___________________________________________________________________________
Dr. Werner Benger <werner@cct.lsu.edu> Visualization Research
Laboratory for Creative Arts and Technology (LCAT)
Center for Computation & Technology at Louisiana State University (CCT/LSU)
239 Johnston Hall, Baton Rouge, Louisiana 70803
Tel.: +1 225 578 4809 Fax.: +1 225 578-5362

Sebastian_Good · August 27, 2009, 2:01am

Good question. We could change it, but it's likely the development effort would be better spent elsewhere. I was hoping for a cheap trick to open multiple datasets.

The simulation uses a finite element mesh, and the datasets I'm talking about are slices from it. What would be better would be to open hyperslabs from the mesh itself instead of all these little extracts (they are produced for historical reasons). So if we were going to open up the simulator to be friendlier, we'd probably rather drop the slice-exports and just read from the mesh. That's the long answer

···

On Aug 26, 2009, at 8:12 PM, Werner Benger wrote:

I'd be interested to learn what means "can't be effectively changed".
So you could change it, but it won't be efficient?

Werner

On Thu, 27 Aug 2009 01:18:09 +0200, Sebastian Good <sebastian@palladiumconsulting.com > > wrote:

We read a number of data arrays out of HDF files, and each of these datasets
(e.g. 500x500 arrays of floats) is described by 6 very small datasets, each
a single float for height, width, originx, originy, cell height, cell width.
The HDF file is typically on an NFS (or CIFS) mount. We read from both Linux
& Windows. Opening and closing each of these tiny datasets takes enough time
that it approaches the time taken to read the entire grid itself (only a
megabyte or less, typically).
Is there a faster way to read a batch of small datasets than successive
calls to H5Dopen2, H5Dclose, and H5Dread? Obviously these grid dimensions
could be put in a single struct and read that way, but the simulator that
produces this metadata can't be effectively changed.

Many thanks,

Sebastian Good

--
___________________________________________________________________________
Dr. Werner Benger <werner@cct.lsu.edu> Visualization Research
Laboratory for Creative Arts and Technology (LCAT)
Center for Computation & Technology at Louisiana State University (CCT/LSU)
239 Johnston Hall, Baton Rouge, Louisiana 70803
Tel.: +1 225 578 4809 Fax.: +1 225 578-5362

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

werner · August 27, 2009, 8:55am

Hmyes, historical code is good reason for many limitations.

What I was wondering, if it might be easy enough to create a dataset
with variable dimension, and to append all these little data fragments
to this single one.

Admittedly, never have done that myself, and don't know about its
performance, but maybe this is sufficiently doable and improve the
reading performance...?

···

On Thu, 27 Aug 2009 04:01:06 +0200, Sebastian Good <sebastian@palladiumconsulting.com> wrote:

Good question. We could change it, but it's likely the development
effort would be better spent elsewhere. I was hoping for a cheap trick
to open multiple datasets.

The simulation uses a finite element mesh, and the datasets I'm
talking about are slices from it. What would be better would be to
open hyperslabs from the mesh itself instead of all these little
extracts (they are produced for historical reasons). So if we were
going to open up the simulator to be friendlier, we'd probably rather
drop the slice-exports and just read from the mesh. That's the long
answer

On Aug 26, 2009, at 8:12 PM, Werner Benger wrote:

I'd be interested to learn what means "can't be effectively changed".
So you could change it, but it won't be efficient?

Werner

On Thu, 27 Aug 2009 01:18:09 +0200, Sebastian Good <sebastian@palladiumconsulting.com >> > wrote:

We read a number of data arrays out of HDF files, and each of these
datasets
(e.g. 500x500 arrays of floats) is described by 6 very small
datasets, each
a single float for height, width, originx, originy, cell height,
cell width.
The HDF file is typically on an NFS (or CIFS) mount. We read from
both Linux
& Windows. Opening and closing each of these tiny datasets takes
enough time
that it approaches the time taken to read the entire grid itself
(only a
megabyte or less, typically).
Is there a faster way to read a batch of small datasets than
successive
calls to H5Dopen2, H5Dclose, and H5Dread? Obviously these grid
dimensions
could be put in a single struct and read that way, but the
simulator that
produces this metadata can't be effectively changed.

Many thanks,

Sebastian Good

--
___________________________________________________________________________
Dr. Werner Benger <werner@cct.lsu.edu> Visualization
Research
Laboratory for Creative Arts and Technology (LCAT)
Center for Computation & Technology at Louisiana State University
(CCT/LSU)
239 Johnston Hall, Baton Rouge, Louisiana 70803
Tel.: +1 225 578 4809 Fax.: +1 225 578-5362

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

--
___________________________________________________________________________
Dr. Werner Benger <werner@cct.lsu.edu> Visualization Research
Laboratory for Creative Arts and Technology (LCAT)
Center for Computation & Technology at Louisiana State University (CCT/LSU)
239 Johnston Hall, Baton Rouge, Louisiana 70803
Tel.: +1 225 578 4809 Fax.: +1 225 578-5362