Indexing and fixed number of groups

Hi all,

I'm a long time user of HDF5 (mostly via the Enzo project), but new to
optimizing and attempting to really take advantage of the features of
the library. We're seeing substantial performance problems at the
present, and we're attempting to narrow it down. As a bit of
background, the data in our files is structured such that we have
top-level groups (of the format /Grid00000001 , /Grid00000002 , etc)
and then off of each group hang a fixed number of datasets, and the
files themselves are write-once read-many. We're in the position that
we know in advance exactly how many groups we have, how many datasets
hang off of each group (or at least a reasonable upper bound) and all
of our data is streamed exactly out with no chunking.

What we've found lately is that about 30% of the time to read a
dataset in occurs just in opening the individual grids. The remainder
is the actual calls to read the data. My naive guess at the source of
this behavior is that the opening of the groups involves reading a
potentially distributed index. Because of our particular situation --
a fixed number of groups and datasets and inviolate data-on-disk -- is
there a particular mechanism or parameter we could set by which we
could speed up access to the groups and datasets?

Thanks for any ideas,

Matt

Hi Matt,

···

On Nov 22, 2011, at 1:11 PM, Matthew Turk wrote:

Hi all,

I'm a long time user of HDF5 (mostly via the Enzo project), but new to
optimizing and attempting to really take advantage of the features of
the library. We're seeing substantial performance problems at the
present, and we're attempting to narrow it down. As a bit of
background, the data in our files is structured such that we have
top-level groups (of the format /Grid00000001 , /Grid00000002 , etc)
and then off of each group hang a fixed number of datasets, and the
files themselves are write-once read-many. We're in the position that
we know in advance exactly how many groups we have, how many datasets
hang off of each group (or at least a reasonable upper bound) and all
of our data is streamed exactly out with no chunking.

What we've found lately is that about 30% of the time to read a
dataset in occurs just in opening the individual grids. The remainder
is the actual calls to read the data. My naive guess at the source of
this behavior is that the opening of the groups involves reading a
potentially distributed index. Because of our particular situation --
a fixed number of groups and datasets and inviolate data-on-disk -- is
there a particular mechanism or parameter we could set by which we
could speed up access to the groups and datasets?

  There's no distributed index really, each group just has a heap with the link info in it and a B-tree that indexes them. How large are the files you are accessing? Are you using serial or parallel access to them? What system/file system are you using?

  Quincey

Hi Quincey,

Hi Matt,

Hi all,

I'm a long time user of HDF5 (mostly via the Enzo project), but new to
optimizing and attempting to really take advantage of the features of
the library. We're seeing substantial performance problems at the
present, and we're attempting to narrow it down. As a bit of
background, the data in our files is structured such that we have
top-level groups (of the format /Grid00000001 , /Grid00000002 , etc)
and then off of each group hang a fixed number of datasets, and the
files themselves are write-once read-many. We're in the position that
we know in advance exactly how many groups we have, how many datasets
hang off of each group (or at least a reasonable upper bound) and all
of our data is streamed exactly out with no chunking.

What we've found lately is that about 30% of the time to read a
dataset in occurs just in opening the individual grids. The remainder
is the actual calls to read the data. My naive guess at the source of
this behavior is that the opening of the groups involves reading a
potentially distributed index. Because of our particular situation --
a fixed number of groups and datasets and inviolate data-on-disk -- is
there a particular mechanism or parameter we could set by which we
could speed up access to the groups and datasets?

   There's no distributed index really, each group just has a heap with the link info in it and a B\-tree that indexes them\.  How large are the files you are accessing?  Are you using serial or parallel access to them?  What system/file system are you using?

Thanks for your reply! Here's some more detailed info about the
files. The tests have been conducted on a couple file systems --
local disk (ext4), NFS and lustre. The files themselves are about
~150-250mb, but we can often see much larger files. The files have on
order of 200-300 groups (grids) per file, each of which has ~40
datasets (all of which share roughly the same set of names). The
datasets themselves are somewhat small -- we have both 3D and 1D
datasets, and the 3D datasets on average contain ~10000 elements. All
access (read and write) is done in serial to a single file.

I suppose my question was ill-posed; what I was wondering about is if
there might be any kind of way to speed up group opening. The numbers
I quoted earlier, of about 30% for grid opening, are a medium case.
In some cases (on local disk, running over the same file 100 times and
averaging) we actually see that opening and closing the groups takes
about 50-60% of the time that opening the groups and reading the data
takes. I suppose a more broad question is, should I be surprised at
this? Or is this to be expected, based on how HDF5 operates (and all
of the utility it provides)? Are there low-hanging fruit that I
should be addressing with the data handling?

Thanks so much for any ideas you might have,

Matt

···

On Wed, Nov 23, 2011 at 11:22 AM, Quincey Koziol <qkoziol@gmail.com> wrote:

On Nov 22, 2011, at 1:11 PM, Matthew Turk wrote:

   Quincey

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Hi Matt,

···

On Nov 23, 2011, at 11:08 AM, Matthew Turk wrote:

Hi Quincey,

On Wed, Nov 23, 2011 at 11:22 AM, Quincey Koziol <qkoziol@gmail.com> wrote:

Hi Matt,

On Nov 22, 2011, at 1:11 PM, Matthew Turk wrote:

Hi all,

I'm a long time user of HDF5 (mostly via the Enzo project), but new to
optimizing and attempting to really take advantage of the features of
the library. We're seeing substantial performance problems at the
present, and we're attempting to narrow it down. As a bit of
background, the data in our files is structured such that we have
top-level groups (of the format /Grid00000001 , /Grid00000002 , etc)
and then off of each group hang a fixed number of datasets, and the
files themselves are write-once read-many. We're in the position that
we know in advance exactly how many groups we have, how many datasets
hang off of each group (or at least a reasonable upper bound) and all
of our data is streamed exactly out with no chunking.

What we've found lately is that about 30% of the time to read a
dataset in occurs just in opening the individual grids. The remainder
is the actual calls to read the data. My naive guess at the source of
this behavior is that the opening of the groups involves reading a
potentially distributed index. Because of our particular situation --
a fixed number of groups and datasets and inviolate data-on-disk -- is
there a particular mechanism or parameter we could set by which we
could speed up access to the groups and datasets?

       There's no distributed index really, each group just has a heap with the link info in it and a B-tree that indexes them. How large are the files you are accessing? Are you using serial or parallel access to them? What system/file system are you using?

Thanks for your reply! Here's some more detailed info about the
files. The tests have been conducted on a couple file systems --
local disk (ext4), NFS and lustre. The files themselves are about
~150-250mb, but we can often see much larger files. The files have on
order of 200-300 groups (grids) per file, each of which has ~40
datasets (all of which share roughly the same set of names). The
datasets themselves are somewhat small -- we have both 3D and 1D
datasets, and the 3D datasets on average contain ~10000 elements. All
access (read and write) is done in serial to a single file.

I suppose my question was ill-posed; what I was wondering about is if
there might be any kind of way to speed up group opening. The numbers
I quoted earlier, of about 30% for grid opening, are a medium case.
In some cases (on local disk, running over the same file 100 times and
averaging) we actually see that opening and closing the groups takes
about 50-60% of the time that opening the groups and reading the data
takes. I suppose a more broad question is, should I be surprised at
this? Or is this to be expected, based on how HDF5 operates (and all
of the utility it provides)? Are there low-hanging fruit that I
should be addressing with the data handling?

  I would think that this should be faster... Do you have a test program and file(s) that I could profile with?

  Quincey

Hi Quincey,

Sorry for the delay over the weekend. I have gone ahead and posted a
sample dataset, and I have used this code:

http://paste.yt-project.org/show/1961/

to open it. Just now I ran this and received:

Read: 0 with time 53.893453 over 1000 runs
Read: 1 with time 197.589276 over 1000 runs

Thanks for any ideas you might have. We end up controlling both the
creation and consumption of these files, for the most part, so we're
eager for solutions that will assist with either end. (And if there's
anything particularly bad or naive in my code, that would be
appreciated, too!) As a side note, for the most part we also usually
use either C++ or h5py to read files for analysis, which is where we
are hit particularly hard.

Best,

Matt

···

On Wed, Nov 23, 2011 at 1:24 PM, Quincey Koziol <qkoziol@gmail.com> wrote:

Hi Matt,

On Nov 23, 2011, at 11:08 AM, Matthew Turk wrote:

Hi Quincey,

On Wed, Nov 23, 2011 at 11:22 AM, Quincey Koziol <qkoziol@gmail.com> wrote:

Hi Matt,

On Nov 22, 2011, at 1:11 PM, Matthew Turk wrote:

Hi all,

I'm a long time user of HDF5 (mostly via the Enzo project), but new to
optimizing and attempting to really take advantage of the features of
the library. We're seeing substantial performance problems at the
present, and we're attempting to narrow it down. As a bit of
background, the data in our files is structured such that we have
top-level groups (of the format /Grid00000001 , /Grid00000002 , etc)
and then off of each group hang a fixed number of datasets, and the
files themselves are write-once read-many. We're in the position that
we know in advance exactly how many groups we have, how many datasets
hang off of each group (or at least a reasonable upper bound) and all
of our data is streamed exactly out with no chunking.

What we've found lately is that about 30% of the time to read a
dataset in occurs just in opening the individual grids. The remainder
is the actual calls to read the data. My naive guess at the source of
this behavior is that the opening of the groups involves reading a
potentially distributed index. Because of our particular situation --
a fixed number of groups and datasets and inviolate data-on-disk -- is
there a particular mechanism or parameter we could set by which we
could speed up access to the groups and datasets?

   There&#39;s no distributed index really, each group just has a heap with the link info in it and a B\-tree that indexes them\.  How large are the files you are accessing?  Are you using serial or parallel access to them?  What system/file system are you using?

Thanks for your reply! Here's some more detailed info about the
files. The tests have been conducted on a couple file systems --
local disk (ext4), NFS and lustre. The files themselves are about
~150-250mb, but we can often see much larger files. The files have on
order of 200-300 groups (grids) per file, each of which has ~40
datasets (all of which share roughly the same set of names). The
datasets themselves are somewhat small -- we have both 3D and 1D
datasets, and the 3D datasets on average contain ~10000 elements. All
access (read and write) is done in serial to a single file.

I suppose my question was ill-posed; what I was wondering about is if
there might be any kind of way to speed up group opening. The numbers
I quoted earlier, of about 30% for grid opening, are a medium case.
In some cases (on local disk, running over the same file 100 times and
averaging) we actually see that opening and closing the groups takes
about 50-60% of the time that opening the groups and reading the data
takes. I suppose a more broad question is, should I be surprised at
this? Or is this to be expected, based on how HDF5 operates (and all
of the utility it provides)? Are there low-hanging fruit that I
should be addressing with the data handling?

   I would think that this should be faster\.\.\.  Do you have a test program and file\(s\) that I could profile with?

   Quincey

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org