H5Oget_info performance with large number of chunks

Hi,

I'm trying to understand a performance hit that we are
experiencing trying to examine the tree structure of
our HDF5 files. Originally we observed problem when
using h5py but it could be reproduced even with h5ls
command. I tracked it down to a significant delay in
the call to H5Oget_info_by_name function on a dataset
with a large number of chunks. It looks like when the
number of chunks in dataset increases (in our case
we have 1-10k chunks) the performance of the H5Oget_info
drops significantly. Looking at the IO statistics it
seems that HDF5 library does very many small IO operations
in this case. There is very little CPU spent, but real
time is measured in tens of seconds.

Is this an expected behavior? Can it be improved somehow
without reducing the number of chunks drastically?

One more comment about H5Oget_info - it returns a
structure that contains a lot of different info.
In the case of h5py code the only member of the
structure used in the code is "type". could there be
more efficient way to determine just the type of the
object without requiring every other piece of info?

Regards,
Andy

Hi Andy,

···

On Mar 8, 2011, at 7:09 PM, Salnikov, Andrei A. wrote:

Hi,

I'm trying to understand a performance hit that we are
experiencing trying to examine the tree structure of
our HDF5 files. Originally we observed problem when
using h5py but it could be reproduced even with h5ls
command. I tracked it down to a significant delay in
the call to H5Oget_info_by_name function on a dataset
with a large number of chunks. It looks like when the
number of chunks in dataset increases (in our case
we have 1-10k chunks) the performance of the H5Oget_info
drops significantly. Looking at the IO statistics it
seems that HDF5 library does very many small IO operations
in this case. There is very little CPU spent, but real
time is measured in tens of seconds.

Is this an expected behavior? Can it be improved somehow
without reducing the number of chunks drastically?

One more comment about H5Oget_info - it returns a
structure that contains a lot of different info.
In the case of h5py code the only member of the
structure used in the code is "type". could there be
more efficient way to determine just the type of the
object without requiring every other piece of info?

  Ah, yes, we've noticed that in some of the applications we've worked with also (including some of the main HDF5 tools, like h5ls, etc). As you say, H5Oget_info() is fairly heavyweight, getting all sorts of information about each object. I do think a lighter-weight call like "H5Oget_type" would be useful. Is there other "lightweight" information that people would like back for each object?

  Quincey

Hi Quincey,

thanks for confirming this. Could you explain briefly what is
going on there and which part of H5O_info_t needs so many reads?
Maybe removing heavyweight info from H5O_info_t is the right
thing to do, or creating another version of H5O_info_t structure
which has only light-weight info?

Cheers,
Andy

···

Quincey Koziol wrote on 2011-03-09: > Hi Andy, > > On Mar 8, 2011, at 7:09 PM, Salnikov, Andrei A. wrote:

Hi,

I'm trying to understand a performance hit that we are
experiencing trying to examine the tree structure of
our HDF5 files. Originally we observed problem when
using h5py but it could be reproduced even with h5ls
command. I tracked it down to a significant delay in
the call to H5Oget_info_by_name function on a dataset
with a large number of chunks. It looks like when the
number of chunks in dataset increases (in our case
we have 1-10k chunks) the performance of the H5Oget_info
drops significantly. Looking at the IO statistics it
seems that HDF5 library does very many small IO operations
in this case. There is very little CPU spent, but real
time is measured in tens of seconds.

Is this an expected behavior? Can it be improved somehow
without reducing the number of chunks drastically?

One more comment about H5Oget_info - it returns a
structure that contains a lot of different info.
In the case of h5py code the only member of the
structure used in the code is "type". could there be
more efficient way to determine just the type of the
object without requiring every other piece of info?

  Ah, yes, we've noticed that in some of the applications we've worked
with also (including some of the main HDF5 tools, like h5ls, etc). As you
say, H5Oget_info() is fairly heavyweight, getting all sorts of information
about each object. I do think a lighter-weight call like "H5Oget_type"
would be useful. Is there other "lightweight" information that people
would like back for each object?

  Quincey

Hi Andy,

Hi,

I'm trying to understand a performance hit that we are
experiencing trying to examine the tree structure of
our HDF5 files. Originally we observed problem when
using h5py but it could be reproduced even with h5ls
command. I tracked it down to a significant delay in
the call to H5Oget_info_by_name function on a dataset
with a large number of chunks. It looks like when the
number of chunks in dataset increases (in our case
we have 1-10k chunks) the performance of the H5Oget_info
drops significantly. Looking at the IO statistics it
seems that HDF5 library does very many small IO operations
in this case. There is very little CPU spent, but real
time is measured in tens of seconds.

Is this an expected behavior? Can it be improved somehow
without reducing the number of chunks drastically?

One more comment about H5Oget_info - it returns a
structure that contains a lot of different info.
In the case of h5py code the only member of the
structure used in the code is "type". could there be
more efficient way to determine just the type of the
object without requiring every other piece of info?

  Ah, yes, we've noticed that in some of the applications we've worked
with also (including some of the main HDF5 tools, like h5ls, etc). As you
say, H5Oget_info() is fairly heavyweight, getting all sorts of information
about each object. I do think a lighter-weight call like "H5Oget_type"
would be useful. Is there other "lightweight" information that people
would like back for each object?

  Quincey

Hi Quincey,

thanks for confirming this. Could you explain briefly what is
going on there and which part of H5O_info_t needs so many reads?

  The H5Oget_info() call is gathering information about the amount of space that the metadata for the dataset is using. When there's a large B-tree for indexing the chunks, that can take a fair bit of time to walk the B-tree.

Maybe removing heavyweight info from H5O_info_t is the right
thing to do, or creating another version of H5O_info_t structure
which has only light-weight info?

  I'm leaning toward another light-weight version. I'm asking the HDF5 community to help me decide what goes into that structure besides the object type.

  Quincey

···

On Mar 9, 2011, at 11:15 AM, Salnikov, Andrei A. wrote:

Quincey Koziol wrote on 2011-03-09: >> Hi Andy, >> >> On Mar 8, 2011, at 7:09 PM, Salnikov, Andrei A. wrote:

Hi Quincey,

is there a chance we can get this new version in the next release?

Cheers,
Andy

···

Quincey Koziol wrote on 2011-03-10: > Hi Andy, > > On Mar 9, 2011, at 11:15 AM, Salnikov, Andrei A. wrote:

Quincey Koziol wrote on 2011-03-09: >>> Hi Andy, >>> >>> On Mar 8, 2011, at 7:09 PM, Salnikov, Andrei A. wrote:

Hi,

I'm trying to understand a performance hit that we are
experiencing trying to examine the tree structure of
our HDF5 files. Originally we observed problem when
using h5py but it could be reproduced even with h5ls
command. I tracked it down to a significant delay in
the call to H5Oget_info_by_name function on a dataset
with a large number of chunks. It looks like when the
number of chunks in dataset increases (in our case
we have 1-10k chunks) the performance of the H5Oget_info
drops significantly. Looking at the IO statistics it
seems that HDF5 library does very many small IO operations
in this case. There is very little CPU spent, but real
time is measured in tens of seconds.

Is this an expected behavior? Can it be improved somehow
without reducing the number of chunks drastically?

One more comment about H5Oget_info - it returns a
structure that contains a lot of different info.
In the case of h5py code the only member of the
structure used in the code is "type". could there be
more efficient way to determine just the type of the
object without requiring every other piece of info?

  Ah, yes, we've noticed that in some of the applications we've worked
with also (including some of the main HDF5 tools, like h5ls, etc). As
you say, H5Oget_info() is fairly heavyweight, getting all sorts of
information about each object. I do think a lighter-weight call like
"H5Oget_type" would be useful. Is there other "lightweight"
information that people would like back for each object?

  Quincey

Hi Quincey,

thanks for confirming this. Could you explain briefly what is
going on there and which part of H5O_info_t needs so many reads?

  The H5Oget_info() call is gathering information about the amount of
space that the metadata for the dataset is using. When there's a large
B- tree for indexing the chunks, that can take a fair bit of time to
walk the B-tree.

Maybe removing heavyweight info from H5O_info_t is the right
thing to do, or creating another version of H5O_info_t structure
which has only light-weight info?

  I'm leaning toward another light-weight version. I'm asking the
HDF5 community to help me decide what goes into that structure besides the
object type.

Andy,

Hi,

I'm trying to understand a performance hit that we are
experiencing trying to examine the tree structure of
our HDF5 files. Originally we observed problem when
using h5py but it could be reproduced even with h5ls
command. I tracked it down to a significant delay in
the call to H5Oget_info_by_name function on a dataset
with a large number of chunks. It looks like when the
number of chunks in dataset increases (in our case
we have 1-10k chunks) the performance of the H5Oget_info
drops significantly. Looking at the IO statistics it
seems that HDF5 library does very many small IO operations
in this case. There is very little CPU spent, but real
time is measured in tens of seconds.

Is this an expected behavior? Can it be improved somehow
without reducing the number of chunks drastically?

One more comment about H5Oget_info - it returns a
structure that contains a lot of different info.
In the case of h5py code the only member of the
structure used in the code is "type". could there be
more efficient way to determine just the type of the
object without requiring every other piece of info?

  Ah, yes, we've noticed that in some of the applications we've worked
with also (including some of the main HDF5 tools, like h5ls, etc). As
you say, H5Oget_info() is fairly heavyweight, getting all sorts of
information about each object. I do think a lighter-weight call like
"H5Oget_type" would be useful. Is there other "lightweight"
information that people would like back for each object?

  Quincey

Hi Quincey,

thanks for confirming this. Could you explain briefly what is
going on there and which part of H5O_info_t needs so many reads?

  The H5Oget_info() call is gathering information about the amount of
space that the metadata for the dataset is using. When there's a large
B- tree for indexing the chunks, that can take a fair bit of time to
walk the B-tree.

  Maybe removing heavyweight info from H5O_info_t is the right
thing to do, or creating another version of H5O_info_t structure
which has only light-weight info?

  I'm leaning toward another light-weight version. I'm asking the
HDF5 community to help me decide what goes into that structure besides the
object type.

Hi Quincey,

is there a chance we can get this new version in the next release?

We actually already have an experimental branch with a similar feature mostly implemented. It allows you to specify the fields you want filled in by H5Oget_info. The branch can be found at:

http://svn.hdfgroup.uiuc.edu/hdf5/branches/h5oget_info_by_field/

The new functions are:

herr_t H5Oget_info2(hid_t loc_id, H5O_info_t *oinfo, unsigned fields);
herr_t H5Oget_info_by_name2(hid_t loc_id, const char *name, H5O_info_t *oinfo, unsigned fields, hid_t lapl_id);

The "fields" parameter can contain the following bitflags (combined with "|"):

H5O_INFO_TIME
H5O_INFO_NUM_ATTRS
H5O_INFO_HDR
H5O_INFO_META_SIZE
H5O_INFO_ALL (==H5O_INFO_TIME | H5O_INFO_NUM_ATTRS | H5O_INFO_HDR | H5O_INFO_META_SIZE)

Passing these flags tells the library to fill in the corresponding fields in oinfo. Other fields are always filled in because there is no performance penalty. In your case, since you only need the type, you can just pass "0". h5ls has also been modified to use these, so it should be faster.

Of course, this is experimental code and should not be used in production, but if you're curious how much a lightweight H5Oget_info would help your performance you're welcome to try it. If you do, we'd love to hear about your results, and also your thoughts on the interface. For maximum performance, you should configure the library with "--enable-production" (for this branch, not necessary for releases).

Thanks,
-Neil

···

On 03/11/2011 06:48 PM, Salnikov, Andrei A. wrote:

Quincey Koziol wrote on 2011-03-10: >> Hi Andy, >> >> On Mar 9, 2011, at 11:15 AM, Salnikov, Andrei A. wrote:

Quincey Koziol wrote on 2011-03-09: >>>> Hi Andy, >>>> >>>> On Mar 8, 2011, at 7:09 PM, Salnikov, Andrei A. wrote:

Hi Neil,

I managed to build this branch and test it. It has indeed improved
performance dramatically. As you suggest I only use zero value for the
fields argument, other values have not been included in my test.
With that value and checking only the "type" field in H5O_info_t it
runs much faster than previous version.'h5ls' also works better on our
files.

What I find interesting is a missing version for H5Oget_info_by_idx
which would take "fields" argument. Is this function so much different
from H5Oget_info and H5Oget_info_by_name so it cannot be optimized?

Even without H5Oget_info_by_idx2 I'd be happy to see this branch
included into next release.

Cheers,
Andy

···

Neil Fortner wrote on 2011-03-14: > Andy, > > On 03/11/2011 06:48 PM, Salnikov, Andrei A. wrote:

Quincey Koziol wrote on 2011-03-10: >>> Hi Andy, >>> >>> On Mar 9, 2011, at 11:15 AM, Salnikov, Andrei A. wrote:

Quincey Koziol wrote on 2011-03-09: >>>>> Hi Andy, >>>>> >>>>> On Mar 8, 2011, at 7:09 PM, Salnikov, Andrei A. wrote:

Hi,

I'm trying to understand a performance hit that we are
experiencing trying to examine the tree structure of
our HDF5 files. Originally we observed problem when
using h5py but it could be reproduced even with h5ls
command. I tracked it down to a significant delay in
the call to H5Oget_info_by_name function on a dataset
with a large number of chunks. It looks like when the
number of chunks in dataset increases (in our case
we have 1-10k chunks) the performance of the H5Oget_info
drops significantly. Looking at the IO statistics it
seems that HDF5 library does very many small IO operations
in this case. There is very little CPU spent, but real
time is measured in tens of seconds.

Is this an expected behavior? Can it be improved somehow
without reducing the number of chunks drastically?

One more comment about H5Oget_info - it returns a
structure that contains a lot of different info.
In the case of h5py code the only member of the
structure used in the code is "type". could there be
more efficient way to determine just the type of the
object without requiring every other piece of info?

  Ah, yes, we've noticed that in some of the applications we've
worked with also (including some of the main HDF5 tools, like h5ls,
etc). As you say, H5Oget_info() is fairly heavyweight, getting all
sorts of information about each object. I do think a lighter-weight
call like "H5Oget_type" would be useful. Is there other
"lightweight" information that people would like back for each
object?

  Quincey

Hi Quincey,

thanks for confirming this. Could you explain briefly what is
going on there and which part of H5O_info_t needs so many reads?

  The H5Oget_info() call is gathering information about the amount of
space that the metadata for the dataset is using. When there's a large
B- tree for indexing the chunks, that can take a fair bit of time to
walk the B-tree.

  Maybe removing heavyweight info from H5O_info_t is the right
thing to do, or creating another version of H5O_info_t structure
which has only light-weight info?

  I'm leaning toward another light-weight version. I'm asking the HDF5
community to help me decide what goes into that structure besides the
object type.

Hi Quincey,

is there a chance we can get this new version in the next release?

We actually already have an experimental branch with a similar feature
mostly implemented. It allows you to specify the fields you want filled
in by H5Oget_info. The branch can be found at:

http://svn.hdfgroup.uiuc.edu/hdf5/branches/h5oget_info_by_field/

The new functions are:

herr_t H5Oget_info2(hid_t loc_id, H5O_info_t *oinfo, unsigned fields);
herr_t H5Oget_info_by_name2(hid_t loc_id, const char *name, H5O_info_t
*oinfo, unsigned fields, hid_t lapl_id);

The "fields" parameter can contain the following bitflags (combined with
"|"):

H5O_INFO_TIME H5O_INFO_NUM_ATTRS H5O_INFO_HDR H5O_INFO_META_SIZE
H5O_INFO_ALL (==H5O_INFO_TIME | H5O_INFO_NUM_ATTRS | H5O_INFO_HDR |
H5O_INFO_META_SIZE)

Passing these flags tells the library to fill in the corresponding
fields in oinfo. Other fields are always filled in because there is no
performance penalty. In your case, since you only need the type, you
can just pass "0". h5ls has also been modified to use these, so it
should be faster.

Of course, this is experimental code and should not be used in
production, but if you're curious how much a lightweight H5Oget_info
would help your performance you're welcome to try it. If you do, we'd
love to hear about your results, and also your thoughts on the
interface. For maximum performance, you should configure the library
with "--enable-production" (for this branch, not necessary for releases).

Thanks,
-Neil

Andy,

Hi,

I'm trying to understand a performance hit that we are
experiencing trying to examine the tree structure of
our HDF5 files. Originally we observed problem when
using h5py but it could be reproduced even with h5ls
command. I tracked it down to a significant delay in
the call to H5Oget_info_by_name function on a dataset
with a large number of chunks. It looks like when the
number of chunks in dataset increases (in our case
we have 1-10k chunks) the performance of the H5Oget_info
drops significantly. Looking at the IO statistics it
seems that HDF5 library does very many small IO operations
in this case. There is very little CPU spent, but real
time is measured in tens of seconds.

Is this an expected behavior? Can it be improved somehow
without reducing the number of chunks drastically?

One more comment about H5Oget_info - it returns a
structure that contains a lot of different info.
In the case of h5py code the only member of the
structure used in the code is "type". could there be
more efficient way to determine just the type of the
object without requiring every other piece of info?

  Ah, yes, we've noticed that in some of the applications we've
worked with also (including some of the main HDF5 tools, like h5ls,
etc). As you say, H5Oget_info() is fairly heavyweight, getting all
sorts of information about each object. I do think a lighter-weight
call like "H5Oget_type" would be useful. Is there other
"lightweight" information that people would like back for each
object?

  Quincey

Hi Quincey,

thanks for confirming this. Could you explain briefly what is
going on there and which part of H5O_info_t needs so many reads?

  The H5Oget_info() call is gathering information about the amount of
space that the metadata for the dataset is using. When there's a large
B- tree for indexing the chunks, that can take a fair bit of time to
walk the B-tree.

   Maybe removing heavyweight info from H5O_info_t is the right
thing to do, or creating another version of H5O_info_t structure
which has only light-weight info?

  I'm leaning toward another light-weight version. I'm asking the HDF5
community to help me decide what goes into that structure besides the
object type.

Hi Quincey,

is there a chance we can get this new version in the next release?

We actually already have an experimental branch with a similar feature
mostly implemented. It allows you to specify the fields you want filled
in by H5Oget_info. The branch can be found at:

http://svn.hdfgroup.uiuc.edu/hdf5/branches/h5oget_info_by_field/

The new functions are:

herr_t H5Oget_info2(hid_t loc_id, H5O_info_t *oinfo, unsigned fields);
herr_t H5Oget_info_by_name2(hid_t loc_id, const char *name, H5O_info_t
*oinfo, unsigned fields, hid_t lapl_id);

The "fields" parameter can contain the following bitflags (combined with
"|"):

H5O_INFO_TIME H5O_INFO_NUM_ATTRS H5O_INFO_HDR H5O_INFO_META_SIZE
H5O_INFO_ALL (==H5O_INFO_TIME | H5O_INFO_NUM_ATTRS | H5O_INFO_HDR |
H5O_INFO_META_SIZE)

Passing these flags tells the library to fill in the corresponding
fields in oinfo. Other fields are always filled in because there is no
performance penalty. In your case, since you only need the type, you
can just pass "0". h5ls has also been modified to use these, so it
should be faster.

Of course, this is experimental code and should not be used in
production, but if you're curious how much a lightweight H5Oget_info
would help your performance you're welcome to try it. If you do, we'd
love to hear about your results, and also your thoughts on the
interface. For maximum performance, you should configure the library
with "--enable-production" (for this branch, not necessary for releases).

Thanks,
-Neil

Hi Neil,

I managed to build this branch and test it. It has indeed improved
performance dramatically. As you suggest I only use zero value for the
fields argument, other values have not been included in my test.
With that value and checking only the "type" field in H5O_info_t it
runs much faster than previous version.'h5ls' also works better on our
files.

What I find interesting is a missing version for H5Oget_info_by_idx
which would take "fields" argument. Is this function so much different
from H5Oget_info and H5Oget_info_by_name so it cannot be optimized?

Even without H5Oget_info_by_idx2 I'd be happy to see this branch
included into next release.

Glad to hear it improved your performance! It would be easy to add H5Oget_info_by_idx2, we just didn't do that because we only did the minimum needed to test the performance in the case we were looking at, and stopped after reaching that point. We shelved the work because it didn't make a huge difference in the case we were looking at, but with your report I will look into getting it scheduled sooner rather than later. There is a chance we may change the interface to something like what Quincey suggested. Thanks for taking the time to test this!

-Neil

···

On 03/20/2011 02:28 AM, Salnikov, Andrei A. wrote:

Neil Fortner wrote on 2011-03-14: >> Andy, >> >> On 03/11/2011 06:48 PM, Salnikov, Andrei A. wrote:

Quincey Koziol wrote on 2011-03-10: >>>> Hi Andy, >>>> >>>> On Mar 9, 2011, at 11:15 AM, Salnikov, Andrei A. wrote:

Quincey Koziol wrote on 2011-03-09: >>>>>> Hi Andy, >>>>>> >>>>>> On Mar 8, 2011, at 7:09 PM, Salnikov, Andrei A. wrote:

Cheers,
Andy

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Hi Neil,

just reviving this old thread to see if there was a progress
on this feature. Do you have an update on the status for
the upcoming 1.8.8 release?

Thanks,
Andy

···

Neil Fortner wrote on 2011-03-21: > Andy, > > On 03/20/2011 02:28 AM, Salnikov, Andrei A. wrote:

Neil Fortner wrote on 2011-03-14: >>> Andy, >>> >>> On 03/11/2011 06:48 PM, Salnikov, Andrei A. wrote:

Quincey Koziol wrote on 2011-03-10: >>>>> Hi Andy, >>>>> >>>>> On Mar 9, 2011, at 11:15 AM, Salnikov, Andrei A. wrote:

Quincey Koziol wrote on 2011-03-09: >>>>>>> Hi Andy, >>>>>>> >>>>>>> On Mar 8, 2011, at 7:09 PM, Salnikov, Andrei A. wrote:

Hi,

I'm trying to understand a performance hit that we are
experiencing trying to examine the tree structure of
our HDF5 files. Originally we observed problem when
using h5py but it could be reproduced even with h5ls
command. I tracked it down to a significant delay in
the call to H5Oget_info_by_name function on a dataset
with a large number of chunks. It looks like when the
number of chunks in dataset increases (in our case
we have 1-10k chunks) the performance of the H5Oget_info
drops significantly. Looking at the IO statistics it
seems that HDF5 library does very many small IO operations
in this case. There is very little CPU spent, but real
time is measured in tens of seconds.

Is this an expected behavior? Can it be improved somehow
without reducing the number of chunks drastically?

One more comment about H5Oget_info - it returns a
structure that contains a lot of different info.
In the case of h5py code the only member of the
structure used in the code is "type". could there be
more efficient way to determine just the type of the
object without requiring every other piece of info?

  Ah, yes, we've noticed that in some of the applications we've
worked with also (including some of the main HDF5 tools, like
h5ls, etc). As you say, H5Oget_info() is fairly heavyweight,
getting all sorts of information about each object. I do think a
lighter- weight call like "H5Oget_type" would be useful. Is there
other "lightweight" information that people would like back for
each object?

  Quincey

Hi Quincey,

thanks for confirming this. Could you explain briefly what is
going on there and which part of H5O_info_t needs so many reads?

  The H5Oget_info() call is gathering information about the amount of
space that the metadata for the dataset is using. When there's a
large B- tree for indexing the chunks, that can take a fair bit of
time to walk the B-tree.

   Maybe removing heavyweight info from H5O_info_t is the right
thing to do, or creating another version of H5O_info_t structure
which has only light-weight info?

  I'm leaning toward another light-weight version. I'm asking the
HDF5 community to help me decide what goes into that structure
besides the object type.

Hi Quincey,

is there a chance we can get this new version in the next release?

We actually already have an experimental branch with a similar feature
mostly implemented. It allows you to specify the fields you want
filled in by H5Oget_info. The branch can be found at:

http://svn.hdfgroup.uiuc.edu/hdf5/branches/h5oget_info_by_field/

The new functions are:

herr_t H5Oget_info2(hid_t loc_id, H5O_info_t *oinfo, unsigned fields);
herr_t H5Oget_info_by_name2(hid_t loc_id, const char *name, H5O_info_t
*oinfo, unsigned fields, hid_t lapl_id);

The "fields" parameter can contain the following bitflags (combined
with "|"):

H5O_INFO_TIME H5O_INFO_NUM_ATTRS H5O_INFO_HDR H5O_INFO_META_SIZE
H5O_INFO_ALL (==H5O_INFO_TIME | H5O_INFO_NUM_ATTRS | H5O_INFO_HDR |
H5O_INFO_META_SIZE)

Passing these flags tells the library to fill in the corresponding
fields in oinfo. Other fields are always filled in because there is
no performance penalty. In your case, since you only need the type,
you can just pass "0". h5ls has also been modified to use these, so
it should be faster.

Of course, this is experimental code and should not be used in
production, but if you're curious how much a lightweight H5Oget_info
would help your performance you're welcome to try it. If you do, we'd
love to hear about your results, and also your thoughts on the
interface. For maximum performance, you should configure the library
with "--enable-production" (for this branch, not necessary for
releases).

Thanks,
-Neil

Hi Neil,

I managed to build this branch and test it. It has indeed improved
performance dramatically. As you suggest I only use zero value for the
fields argument, other values have not been included in my test. With
that value and checking only the "type" field in H5O_info_t it runs
much faster than previous version.'h5ls' also works better on our files.

What I find interesting is a missing version for H5Oget_info_by_idx
which would take "fields" argument. Is this function so much different
from H5Oget_info and H5Oget_info_by_name so it cannot be optimized?

Even without H5Oget_info_by_idx2 I'd be happy to see this branch
included into next release.

Glad to hear it improved your performance! It would be easy to add
H5Oget_info_by_idx2, we just didn't do that because we only did the
minimum needed to test the performance in the case we were looking at,
and stopped after reaching that point. We shelved the work because it
didn't make a huge difference in the case we were looking at, but with
your report I will look into getting it scheduled sooner rather than
later. There is a chance we may change the interface to something like
what Quincey suggested. Thanks for taking the time to test this!

-Neil

Cheers,
Andy

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________ Hdf-forum is for HDF
software users discussion. Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Hi all,

I have not received any response to my last question to Neil.
Does anybody else know what is the status of this new feature
(H5Oget_info2 and H5Oget_info_by_idx2) and if there a chance
that we see it any time soon in released code?

Cheers,
Andy

···

Salnikov, Andrei A. wrote on 2011-09-07:

Hi Neil,

just reviving this old thread to see if there was a progress
on this feature. Do you have an update on the status for
the upcoming 1.8.8 release?

Thanks,
Andy

Neil Fortner wrote on 2011-03-21:

Andy,

On 03/20/2011 02:28 AM, Salnikov, Andrei A. wrote:

Neil Fortner wrote on 2011-03-14: >>>> Andy, >>>> >>>> On 03/11/2011 06:48 PM, Salnikov, Andrei A. wrote:

Quincey Koziol wrote on 2011-03-10: >>>>>> Hi Andy, >>>>>> >>>>>> On Mar 9, 2011, at 11:15 AM, Salnikov, Andrei A. wrote:

Quincey Koziol wrote on 2011-03-09: >>>>>>>> Hi Andy, >>>>>>>> >>>>>>>> On Mar 8, 2011, at 7:09 PM, Salnikov, Andrei A. wrote:

Hi,

I'm trying to understand a performance hit that we are
experiencing trying to examine the tree structure of
our HDF5 files. Originally we observed problem when
using h5py but it could be reproduced even with h5ls
command. I tracked it down to a significant delay in
the call to H5Oget_info_by_name function on a dataset
with a large number of chunks. It looks like when the
number of chunks in dataset increases (in our case
we have 1-10k chunks) the performance of the H5Oget_info
drops significantly. Looking at the IO statistics it
seems that HDF5 library does very many small IO operations
in this case. There is very little CPU spent, but real
time is measured in tens of seconds.

Is this an expected behavior? Can it be improved somehow
without reducing the number of chunks drastically?

One more comment about H5Oget_info - it returns a
structure that contains a lot of different info.
In the case of h5py code the only member of the
structure used in the code is "type". could there be
more efficient way to determine just the type of the
object without requiring every other piece of info?

  Ah, yes, we've noticed that in some of the applications we've
worked with also (including some of the main HDF5 tools, like
h5ls, etc). As you say, H5Oget_info() is fairly heavyweight,
getting all sorts of information about each object. I do think a
lighter- weight call like "H5Oget_type" would be useful. Is
there other "lightweight" information that people would like back
for each object?

  Quincey

Hi Quincey,

thanks for confirming this. Could you explain briefly what is
going on there and which part of H5O_info_t needs so many reads?

  The H5Oget_info() call is gathering information about the amount
of space that the metadata for the dataset is using. When there's
a large B- tree for indexing the chunks, that can take a fair bit
of time to walk the B-tree.

   Maybe removing heavyweight info from H5O_info_t is the right
thing to do, or creating another version of H5O_info_t structure
which has only light-weight info?

  I'm leaning toward another light-weight version. I'm asking the
HDF5 community to help me decide what goes into that structure
besides the object type.

Hi Quincey,

is there a chance we can get this new version in the next release?

We actually already have an experimental branch with a similar
feature mostly implemented. It allows you to specify the fields you
want filled in by H5Oget_info. The branch can be found at:

http://svn.hdfgroup.uiuc.edu/hdf5/branches/h5oget_info_by_field/

The new functions are:

herr_t H5Oget_info2(hid_t loc_id, H5O_info_t *oinfo, unsigned
fields); herr_t H5Oget_info_by_name2(hid_t loc_id, const char *name,
H5O_info_t *oinfo, unsigned fields, hid_t lapl_id);

The "fields" parameter can contain the following bitflags (combined
with "|"):

H5O_INFO_TIME H5O_INFO_NUM_ATTRS H5O_INFO_HDR H5O_INFO_META_SIZE
H5O_INFO_ALL (==H5O_INFO_TIME | H5O_INFO_NUM_ATTRS | H5O_INFO_HDR |
H5O_INFO_META_SIZE)

Passing these flags tells the library to fill in the corresponding
fields in oinfo. Other fields are always filled in because there is
no performance penalty. In your case, since you only need the type,
you can just pass "0". h5ls has also been modified to use these, so
it should be faster.

Of course, this is experimental code and should not be used in
production, but if you're curious how much a lightweight H5Oget_info
would help your performance you're welcome to try it. If you do,
we'd love to hear about your results, and also your thoughts on the
interface. For maximum performance, you should configure the library
with "--enable-production" (for this branch, not necessary for
releases).

Thanks,
-Neil

Hi Neil,

I managed to build this branch and test it. It has indeed improved
performance dramatically. As you suggest I only use zero value for the
fields argument, other values have not been included in my test. With
that value and checking only the "type" field in H5O_info_t it runs
much faster than previous version.'h5ls' also works better on our
files.

What I find interesting is a missing version for H5Oget_info_by_idx
which would take "fields" argument. Is this function so much different
from H5Oget_info and H5Oget_info_by_name so it cannot be optimized?

Even without H5Oget_info_by_idx2 I'd be happy to see this branch
included into next release.

Glad to hear it improved your performance! It would be easy to add
H5Oget_info_by_idx2, we just didn't do that because we only did the
minimum needed to test the performance in the case we were looking at,
and stopped after reaching that point. We shelved the work because it
didn't make a huge difference in the case we were looking at, but with
your report I will look into getting it scheduled sooner rather than
later. There is a chance we may change the interface to something like
what Quincey suggested. Thanks for taking the time to test this!

-Neil

Cheers,
Andy

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________ Hdf-forum is for HDF
software users discussion. Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________ Hdf-forum is for HDF
software users discussion. Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Andrei,

Hi all,

I have not received any response to my last question to Neil.
Does anybody else know what is the status of this new feature
(H5Oget_info2 and H5Oget_info_by_idx2) and if there a chance
that we see it any time soon in released code?

Sorry, it will not be included in this upcoming release (HDF5 1.8.8). We will let you know as soon as it is merged into the development branch (1.9). If the feature doesn't require a file format extension, it has a chance to be included in the future maintenance release.

Elena

···

On Oct 14, 2011, at 11:19 AM, Salnikov, Andrei A. wrote:

Cheers,
Andy

Salnikov, Andrei A. wrote on 2011-09-07:

Hi Neil,

just reviving this old thread to see if there was a progress
on this feature. Do you have an update on the status for
the upcoming 1.8.8 release?

Thanks,
Andy

Neil Fortner wrote on 2011-03-21:

Andy,

On 03/20/2011 02:28 AM, Salnikov, Andrei A. wrote:

Neil Fortner wrote on 2011-03-14: >>>>> Andy, >>>>> >>>>> On 03/11/2011 06:48 PM, Salnikov, Andrei A. wrote:

Quincey Koziol wrote on 2011-03-10: >>>>>>> Hi Andy, >>>>>>> >>>>>>> On Mar 9, 2011, at 11:15 AM, Salnikov, Andrei A. wrote:

Quincey Koziol wrote on 2011-03-09: >>>>>>>>> Hi Andy, >>>>>>>>> >>>>>>>>> On Mar 8, 2011, at 7:09 PM, Salnikov, Andrei A. wrote:

Hi,

I'm trying to understand a performance hit that we are
experiencing trying to examine the tree structure of
our HDF5 files. Originally we observed problem when
using h5py but it could be reproduced even with h5ls
command. I tracked it down to a significant delay in
the call to H5Oget_info_by_name function on a dataset
with a large number of chunks. It looks like when the
number of chunks in dataset increases (in our case
we have 1-10k chunks) the performance of the H5Oget_info
drops significantly. Looking at the IO statistics it
seems that HDF5 library does very many small IO operations
in this case. There is very little CPU spent, but real
time is measured in tens of seconds.

Is this an expected behavior? Can it be improved somehow
without reducing the number of chunks drastically?

One more comment about H5Oget_info - it returns a
structure that contains a lot of different info.
In the case of h5py code the only member of the
structure used in the code is "type". could there be
more efficient way to determine just the type of the
object without requiring every other piece of info?

  Ah, yes, we've noticed that in some of the applications we've
worked with also (including some of the main HDF5 tools, like
h5ls, etc). As you say, H5Oget_info() is fairly heavyweight,
getting all sorts of information about each object. I do think a
lighter- weight call like "H5Oget_type" would be useful. Is
there other "lightweight" information that people would like back
for each object?

  Quincey

Hi Quincey,

thanks for confirming this. Could you explain briefly what is
going on there and which part of H5O_info_t needs so many reads?

  The H5Oget_info() call is gathering information about the amount
of space that the metadata for the dataset is using. When there's
a large B- tree for indexing the chunks, that can take a fair bit
of time to walk the B-tree.

  Maybe removing heavyweight info from H5O_info_t is the right
thing to do, or creating another version of H5O_info_t structure
which has only light-weight info?

  I'm leaning toward another light-weight version. I'm asking the
HDF5 community to help me decide what goes into that structure
besides the object type.

Hi Quincey,

is there a chance we can get this new version in the next release?

We actually already have an experimental branch with a similar
feature mostly implemented. It allows you to specify the fields you
want filled in by H5Oget_info. The branch can be found at:

http://svn.hdfgroup.uiuc.edu/hdf5/branches/h5oget_info_by_field/

The new functions are:

herr_t H5Oget_info2(hid_t loc_id, H5O_info_t *oinfo, unsigned
fields); herr_t H5Oget_info_by_name2(hid_t loc_id, const char *name,
H5O_info_t *oinfo, unsigned fields, hid_t lapl_id);

The "fields" parameter can contain the following bitflags (combined
with "|"):

H5O_INFO_TIME H5O_INFO_NUM_ATTRS H5O_INFO_HDR H5O_INFO_META_SIZE
H5O_INFO_ALL (==H5O_INFO_TIME | H5O_INFO_NUM_ATTRS | H5O_INFO_HDR |
H5O_INFO_META_SIZE)

Passing these flags tells the library to fill in the corresponding
fields in oinfo. Other fields are always filled in because there is
no performance penalty. In your case, since you only need the type,
you can just pass "0". h5ls has also been modified to use these, so
it should be faster.

Of course, this is experimental code and should not be used in
production, but if you're curious how much a lightweight H5Oget_info
would help your performance you're welcome to try it. If you do,
we'd love to hear about your results, and also your thoughts on the
interface. For maximum performance, you should configure the library
with "--enable-production" (for this branch, not necessary for
releases).

Thanks,
-Neil

Hi Neil,

I managed to build this branch and test it. It has indeed improved
performance dramatically. As you suggest I only use zero value for the
fields argument, other values have not been included in my test. With
that value and checking only the "type" field in H5O_info_t it runs
much faster than previous version.'h5ls' also works better on our
files.

What I find interesting is a missing version for H5Oget_info_by_idx
which would take "fields" argument. Is this function so much different
from H5Oget_info and H5Oget_info_by_name so it cannot be optimized?

Even without H5Oget_info_by_idx2 I'd be happy to see this branch
included into next release.

Glad to hear it improved your performance! It would be easy to add
H5Oget_info_by_idx2, we just didn't do that because we only did the
minimum needed to test the performance in the case we were looking at,
and stopped after reaching that point. We shelved the work because it
didn't make a huge difference in the case we were looking at, but with
your report I will look into getting it scheduled sooner rather than
later. There is a chance we may change the interface to something like
what Quincey suggested. Thanks for taking the time to test this!

-Neil

Cheers,
Andy

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________ Hdf-forum is for HDF
software users discussion. Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________ Hdf-forum is for HDF
software users discussion. Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org