hdf5 file layout and quick access to metadata

Leigh_Orf · July 6, 2011, 3:21pm

Lately I am using the buffered write (backing store) of hdf5 to write
multiple time levels to hdf5 files. Each time level is its own group (zero
padded character string) and 3D floating point variables are members of each
groups.

My concern - perhaps unfounded - is that the very small bits of what I call
metadata (integers, lists of what variables are in the file, and other very
small bits of data I write which describe stuff like the 3D data and is
necessary for my reader code) will be placed after the huge 3d data such
that accessing it will require long seeks through 3d data. The only reason I
am worried about this is I noticed doing a h5dump on one of my small
metadata datasets that it took more than 10 seconds to output data on one of
my files. I got the impression that perhaps h5dump was having to make its
way through the 3d arrays before getting to the metadata. However, my C code
seemed to access the metadata quickly; perhaps it's an issue with h5dump.

So I guess my question is, should I not worry about things like what order
data is written to the hdf5 file and assume that the layout is intelligent
enough such that small structures/arrays/integers etc. will be accessible
quickly? If not, how do I force the small stuff to be at the beginning of
the file so it's quickly accessible? I will be looking at thousands of
files, each of which is tens of GB in size and may have dozens of groups
(each of which has dozens of 3D floating point arrays), so I am looking for
all ways to squeeze the fastest I/O I can.

Leigh

···

--
Leigh Orf
Associate Professor of Atmospheric Science
Department of Earth and Atmospheric Science
Central Michigan University

Dherring · July 6, 2011, 3:51pm

Leigh Orf wrote:

My concern - perhaps unfounded - is that the very small bits of what I
call
metadata (integers, lists of what variables are in the file, and other
very
small bits of data I write which describe stuff like the 3D data and is
necessary for my reader code) will be placed after the huge 3d data such
that accessing it will require long seeks through 3d data. The only reason
I
am worried about this is I noticed doing a h5dump on one of my small
metadata datasets that it took more than 10 seconds to output data on one
of
my files. I got the impression that perhaps h5dump was having to make its
way through the 3d arrays before getting to the metadata. However, my C
code
seemed to access the metadata quickly; perhaps it's an issue with h5dump.

I hit the same issue when inspecting metadata in some files. `h5dump -H`
was what I needed.

- Daniel

Leigh_Orf · July 6, 2011, 4:13pm

Daniel,

But, the data is exactly what I want to see when I use h5dump, not the
header. If I just want to traverse the data, I'll use h5ls -rv which I find
to be lightning fast.

For instance, I store grid and mesh data in groups /grid and /mesh - this
has info like grid spacing, Cartesian location of gridpoints, and the x y
and z size of the 3d data in the file. A few hundred bytes at most.

Sometime I just want to

h5dump -g /mesh

What I find is that it can take 10 or so seconds for this to complete with a
~ 2GB file which is 99.999% 3D floating point data.

I am worried that perhaps the mesh data (which I write before the 3D data)
is somehow at the end of the file. It could be an issue with h5dump. I just
want to be sure before I write out a few hundred TB of data for analysis.

FWIW I have looked at http://www.hdfgroup.org/HDF5/doc/H5.format.html but
it's more of a spec sheet and I couldn't find anything addressing my
specific question.

Leigh

···

On Wed, Jul 6, 2011 at 11:51 AM, <dherring@tentpost.com> wrote:

Leigh Orf wrote:
> My concern - perhaps unfounded - is that the very small bits of what I
> call
> metadata (integers, lists of what variables are in the file, and other
> very
> small bits of data I write which describe stuff like the 3D data and is
> necessary for my reader code) will be placed after the huge 3d data such
> that accessing it will require long seeks through 3d data. The only
reason
> I
> am worried about this is I noticed doing a h5dump on one of my small
> metadata datasets that it took more than 10 seconds to output data on one
> of
> my files. I got the impression that perhaps h5dump was having to make its
> way through the 3d arrays before getting to the metadata. However, my C
> code
> seemed to access the metadata quickly; perhaps it's an issue with h5dump.

I hit the same issue when inspecting metadata in some files. `h5dump -H`
was what I needed.

- Daniel

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

--
Leigh Orf
Associate Professor of Atmospheric Science
Department of Earth and Atmospheric Science
Central Michigan University

Quincey_Koziol · July 12, 2011, 2:11pm

Hi Leigh,
I'm guessing it's just a performance issue with how h5dump is creating its internal data structures: it needs to walk the entire file at the beginning, in order to properly handle forward references (in hard/soft links, and in reference datatypes) and other internal statistics, etc. Normal file opens don't traverse the entire file and only load metadata on demand.

Quincey

···

On Jul 6, 2011, at 11:13 AM, Leigh Orf wrote:

Daniel,

But, the data is exactly what I want to see when I use h5dump, not the header. If I just want to traverse the data, I'll use h5ls -rv which I find to be lightning fast.

For instance, I store grid and mesh data in groups /grid and /mesh - this has info like grid spacing, Cartesian location of gridpoints, and the x y and z size of the 3d data in the file. A few hundred bytes at most.

Sometime I just want to

h5dump -g /mesh

What I find is that it can take 10 or so seconds for this to complete with a ~ 2GB file which is 99.999% 3D floating point data.

I am worried that perhaps the mesh data (which I write before the 3D data) is somehow at the end of the file. It could be an issue with h5dump. I just want to be sure before I write out a few hundred TB of data for analysis.

FWIW I have looked at http://www.hdfgroup.org/HDF5/doc/H5.format.html but it's more of a spec sheet and I couldn't find anything addressing my specific question.

Leigh

On Wed, Jul 6, 2011 at 11:51 AM, <dherring@tentpost.com> wrote:
Leigh Orf wrote:
> My concern - perhaps unfounded - is that the very small bits of what I
> call
> metadata (integers, lists of what variables are in the file, and other
> very
> small bits of data I write which describe stuff like the 3D data and is
> necessary for my reader code) will be placed after the huge 3d data such
> that accessing it will require long seeks through 3d data. The only reason
> I
> am worried about this is I noticed doing a h5dump on one of my small
> metadata datasets that it took more than 10 seconds to output data on one
> of
> my files. I got the impression that perhaps h5dump was having to make its
> way through the 3d arrays before getting to the metadata. However, my C
> code
> seemed to access the metadata quickly; perhaps it's an issue with h5dump.

I hit the same issue when inspecting metadata in some files. `h5dump -H`
was what I needed.

- Daniel

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

--
Leigh Orf
Associate Professor of Atmospheric Science
Department of Earth and Atmospheric Science
Central Michigan University

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org