Practical limit on number of objects?

Darryl_Okahata · August 6, 2008, 7:49pm

Hi,

Sorry, this question has probably been asked before, but I couldn't
find anything in the docs, and there doesn't seem to be an archive of
this mailing list.

Are there known practical limitations on the number of objects
(e.g., groups or datasets)? I'm asking because I've written some test
programs, and the HDF5 performance seems to start non-linearly degrading
once the number of objects grows above approximately 50000-100000
objects. Are there parameter settings that can improve this?

I have two test programs that I'm using to test HDF5:

Program 1:
  Create a new HDF5 file, and write 100000 chunked datasets of
  size (6,2,2) (native double) into the top level. In this test,
  the chunk dimensions are the same as the entire dataset.

Program 2:
  Create a new HDF5 file, create 1000 groups, and write 100
  chunked datasets of size (6,2,2) (native double) into each
  group. In this test, the chunk dimensions are the same as the
  entire dataset.

I'm using chunked datasets, because the next test after this would
extend the dataset sizes from (6,2,2) to (N,2,2), for varying values of
N. I've tried using the split file driver, but the performance of that
is comparable.

Also, to get better performance, I've had to twiddle various symbol
and storage parameters, but I really have no idea what I'm doing, here:

status = H5Pset_istore_k(fcpl, 1);
status = H5Pset_sym_k(fcpl, 20, 50);

[ What I'm really trying to do is figure out a reasonable way of storing
  ragged arrays of ragged arrays of ragged arrays of .... The nesting
  can go pretty deep, and so I was wondering if I could use groups to
  help with the nesting. Unfortunately, with this approach, the number
  of groups used by my program could be on the order of a trillion or
  more, worst-case. I could use alternative encodings (e.g.,
  concatenate all my datasets), but, at that point, I don't know if it's
  worthwhile to use HDF5 any more. ;-( ]

···

--
Darryl Okahata
darrylo@soco.agilent.com

DISCLAIMER: this message is the author's personal opinion and does not
constitute the support, opinion, or policy of Agilent Technologies, or
of the little green men that have been following him all day.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Quincey_Koziol · August 6, 2008, 8:08pm

Hi Darryl,

Hi,

Sorry, this question has probably been asked before, but I couldn't
find anything in the docs, and there doesn't seem to be an archive of
this mailing list.

Are there known practical limitations on the number of objects
(e.g., groups or datasets)? I'm asking because I've written some test
programs, and the HDF5 performance seems to start non-linearly degrading
once the number of objects grows above approximately 50000-100000
objects. Are there parameter settings that can improve this?

I would suggest trying the enhancements that come with using the latest version of the file format, which can be enabled by calling H5Pset_libver_bounds() with both bounds settings to H5F_LIBVER_LATEST. This should force the library to use the newer data structures for storing links in groups. We are continuing some work that will speed things up further, but it's not into a public release yet.

    I have two test programs that I'm using to test HDF5:

Program 1:
  Create a new HDF5 file, and write 100000 chunked datasets of
  size (6,2,2) (native double) into the top level. In this test,
  the chunk dimensions are the same as the entire dataset.

Program 2:
  Create a new HDF5 file, create 1000 groups, and write 100
  chunked datasets of size (6,2,2) (native double) into each
  group. In this test, the chunk dimensions are the same as the
  entire dataset.

I'm using chunked datasets, because the next test after this would
extend the dataset sizes from (6,2,2) to (N,2,2), for varying values of
N. I've tried using the split file driver, but the performance of that
is comparable.

    Also, to get better performance, I've had to twiddle various symbol
and storage parameters, but I really have no idea what I'm doing, here:

  status = H5Pset_istore_k(fcpl, 1);
  status = H5Pset_sym_k(fcpl, 20, 50);

[ What I'm really trying to do is figure out a reasonable way of storing
ragged arrays of ragged arrays of ragged arrays of .... The nesting
can go pretty deep, and so I was wondering if I could use groups to
help with the nesting. Unfortunately, with this approach, the number
of groups used by my program could be on the order of a trillion or
more, worst-case. I could use alternative encodings (e.g.,
concatenate all my datasets), but, at that point, I don't know if it's
worthwhile to use HDF5 any more. ;-( ]

You can nest HDF5's variable-length datatypes arbitrarily deep - does that give you what you are looking for?

Quincey

···

On Aug 6, 2008, at 2:49 PM, Darryl Okahata wrote:

--
Darryl Okahata
darrylo@soco.agilent.com

DISCLAIMER: this message is the author's personal opinion and does not
constitute the support, opinion, or policy of Agilent Technologies, or
of the little green men that have been following him all day.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Jim_Robinson · August 9, 2008, 3:26am

Hi, I have a java application using HDF through the jni bindings. I know from previous discussions on this forum that opening datasets have some signifcant overhead and that, in general, reducing the number of datasets is a good idea. I have done that to the degree possible but opening datasets are still, by far, the top hit on my performance profiles. It is taking significantly longer than reading the actual data.

The structure of the program makes it likely that once a dataset is opened it is likely to be revisted later. I have been closing datasets after each read operation because I am not sure what the consequences are of simply leaving them open. I am considering implementing a cache to keep some number of datasets open for the life of the program, or until my cache limit is reached. Does anyone know how many I can safely keep open without worrying about running out of some resource, for example memory, int the C code that is actually managing them? Would 100 be safe, or 1,000?

The program by the way is a genomics visualizer, and was just released to the public at www.broad.mit.edu.

Thanks for any help. I like hdf5, it saves me a lot of time, but I've got to solve this problem by some means to continue using it.

Regards

Jim

···

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Francesc_Alted1 · August 9, 2008, 5:19pm

A Saturday 09 August 2008, Jim Robinson escrigué:

Hi, I have a java application using HDF through the jni bindings.
I know from previous discussions on this forum that opening datasets
have some signifcant overhead and that, in general, reducing the
number of datasets is a good idea. I have done that to the degree
possible but opening datasets are still, by far, the top hit on my
performance profiles. It is taking significantly longer than reading
the actual data.

If this is the case, then I think that you are using HDF5 in an scenario
that it is not designed for. HDF5 is mainly meant for keeping large
amounts of data in relatively few containers. If what you are trying
to do is to keep all your data spread in a lot of containers perhaps
using an object database (or something else) would be your best bet.

Having said that, the HDF5 1.8.x series implements a much more optimized
cache for metadata that should help you somewhat. See below.

The structure of the program makes it likely that once a dataset is
opened it is likely to be revisted later. I have been closing
datasets after each read operation because I am not sure what the
consequences are of simply leaving them open. I am considering
implementing a cache to keep some number of datasets open for the
life of the program, or until my cache limit is reached.

My experience in that field is that you should have enough with the
included metadata cache integrated in the HDF5 library, specially that
included in HDF5 1.8.0 on -- just that you should not stress it too
much. Just as a reference, on a small benchmark that I've done with
PyTables (the results should be extensible to an equivalent C benchmark
too), here it is the memory taken for creating and reading a file with
5000 and 10000 datasets:

HDF5 1.6.7:

Create:
File with 5000 datasets: 45 MB
File with 10000 datasets: 73 MB

Read a subset of 100 datasets:
File with 5000 datasets: 20 MB
File with 10000 datasets: 25 MB

HDF5 1.8.0:

Create:
File with 5000 datasets: 30 MB
File with 10000 datasets: 31 MB

Read a subset of 100 datasets:
File with 5000 datasets: 18 MB
File with 10000 datasets: 19 MB

So, clearly, the 1.8.x series have improved a lot in this area.

Does
anyone know how many I can safely keep open without worrying about
running out of some resource, for example memory, int the C code that
is actually managing them? Would 100 be safe, or 1,000?

If you are still interested in implementing a sort of LRU cache for your
nodes (datasets), you should be aware that the algorithm for
determining the least recently used node also takes time (perhaps a lot
more than the chosen algorithm for evict the metadata cache in HDF5),
so you may want to create a cache with quite small number of nodes (my
recomendation is not to exceed 256) so as to not add too much overhead
in your implementation.

Just as a reference, in PyTables Pro I have implemented such a LRU cache
(for other reasons than yours) with a carefully optimized C code and,
for a LRU cache size of 256 nodes, we are getting a loss in performance
between 2x and 3x with respect to the metadata cache code in HDF5. Of
course, we got our *own* eviction algorithm for cache, but we had to
pay a price for that.

The program by the way is a genomics visualizer, and was just
released to the public at www.broad.mit.edu.

Thanks for any help. I like hdf5, it saves me a lot of time, but
I've got to solve this problem by some means to continue using it.

HTH,

···

--
Francesc Alted
Freelance developer
Tel +34-964-282-249

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Jim_Robinson · August 9, 2008, 5:27pm

Maybe you're correct, the logical model of HDF is well suited to my problem though and I'm committed to it in the near term. This design bias towards large datasets in relatively few containers is not obvious from the documentation. At any rate I'm not talking about a huge number of datasets, just 30-100 or so typically.

HDf 1.8x is not an option because, as far as I know, there are no java jni bindings available yet. I will try your suggestion for a relatively small cache. Thanks very much for the numbers and advice.

Jim

Francesc Alted wrote:

···

A Saturday 09 August 2008, Jim Robinson escrigu�:


Hi, I have a java application using HDF through the jni bindings. I know from previous discussions on this forum that opening datasets
have some signifcant overhead and that, in general, reducing the
number of datasets is a good idea. I have done that to the degree
possible but opening datasets are still, by far, the top hit on my
performance profiles. It is taking significantly longer than reading
the actual data.

If this is the case, then I think that you are using HDF5 in an scenario that it is not designed for. HDF5 is mainly meant for keeping large amounts of data in relatively few containers. If what you are trying to do is to keep all your data spread in a lot of containers perhaps using an object database (or something else) would be your best bet.

Having said that, the HDF5 1.8.x series implements a much more optimized cache for metadata that should help you somewhat. See below.

The structure of the program makes it likely that once a dataset is
opened it is likely to be revisted later. I have been closing
datasets after each read operation because I am not sure what the
consequences are of simply leaving them open. I am considering
implementing a cache to keep some number of datasets open for the
life of the program, or until my cache limit is reached.

My experience in that field is that you should have enough with the included metadata cache integrated in the HDF5 library, specially that included in HDF5 1.8.0 on -- just that you should not stress it too much. Just as a reference, on a small benchmark that I've done with PyTables (the results should be extensible to an equivalent C benchmark too), here it is the memory taken for creating and reading a file with 5000 and 10000 datasets:

HDF5 1.6.7:

Create:
File with 5000 datasets: 45 MB
File with 10000 datasets: 73 MB

Read a subset of 100 datasets:
File with 5000 datasets: 20 MB
File with 10000 datasets: 25 MB

HDF5 1.8.0:

Create:
File with 5000 datasets: 30 MB
File with 10000 datasets: 31 MB

Read a subset of 100 datasets:
File with 5000 datasets: 18 MB
File with 10000 datasets: 19 MB

So, clearly, the 1.8.x series have improved a lot in this area.

Does anyone know how many I can safely keep open without worrying about
running out of some resource, for example memory, int the C code that
is actually managing them? Would 100 be safe, or 1,000?

If you are still interested in implementing a sort of LRU cache for your nodes (datasets), you should be aware that the algorithm for determining the least recently used node also takes time (perhaps a lot more than the chosen algorithm for evict the metadata cache in HDF5), so you may want to create a cache with quite small number of nodes (my recomendation is not to exceed 256) so as to not add too much overhead in your implementation.

Just as a reference, in PyTables Pro I have implemented such a LRU cache (for other reasons than yours) with a carefully optimized C code and, for a LRU cache size of 256 nodes, we are getting a loss in performance between 2x and 3x with respect to the metadata cache code in HDF5. Of course, we got our *own* eviction algorithm for cache, but we had to pay a price for that.

The program by the way is a genomics visualizer, and was just
released to the public at www.broad.mit.edu.

Thanks for any help. I like hdf5, it saves me a lot of time, but
I've got to solve this problem by some means to continue using it.

HTH,

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Francesc_Alted1 · August 11, 2008, 6:47am

A Saturday 09 August 2008, escriguéreu:

Maybe you're correct, the logical model of HDF is well suited to my
problem though and I'm committed to it in the near term. This
design bias towards large datasets in relatively few containers is
not obvious from the documentation.

Perhaps this fact is not obvious from reading the docs, but I've clearly
read it from the book of my own experience

At any rate I'm not talking
about a huge number of datasets, just 30-100 or so typically.

30-100 datasets should be not a problem at all, even when using HDF5
1.6.x series.

HDf 1.8x is not an option because, as far as I know, there are no
java jni bindings available yet. I will try your suggestion for a
relatively small cache. Thanks very much for the numbers and
advice.

Glad that I helped you.

···

--
Francesc Alted
Freelance developer
Tel +34-964-282-249

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Jim_Robinson · August 11, 2008, 1:11pm

Hi Francesc, I should have defined “problem” more clearly. In my
application HDF is serve up data in small chunks to support an
interactive visualization of genomic data. As they zoom and pan more
reads are triggered as additional “tiles” come into view (its modeled
closely on google maps). There are many datasets spread across
multiple HDF5 files. As long as the user is hitting only a few
datasets zooming and panning are fairly smooth, but when the numbers
get in the “10s” the lag from opening datasets becomes noticeable.

The datasets are organized by zoom level so by caching datasets panning
is now smooth. Delays when zooming are more acceptable, somehow you
expect a delay when zooming in, or at least visually it is less
disturbing than jerky panning. I don’t really need an LRU cache,
simply dumping everything and starting over when the cache is full is
good enough.

I would really like to try 1.8X. Are there any plans to develop a
java interface for that version?

Thanks

Jim

Francesc Alted wrote:

···

A Saturday 09 August 2008, escriguéreu:

Maybe you're correct, the logical model of HDF is well suited to my
problem though and I'm committed to it in the near term. This
design bias towards large datasets in relatively few containers is
not obvious from the documentation.

Perhaps this fact is not obvious from reading the docs, but I've clearly read it from the book of my own experience ;-)

At any rate I'm not talking about a huge number of datasets, just 30-100 or so typically.

30-100 datasets should be not a problem at all, even when using HDF5 1.6.x series.

HDf 1.8x is not an option because, as far as I know, there are no
java jni bindings available yet. I will try your suggestion for a
relatively small cache. Thanks very much for the numbers and
advice.

Glad that I helped you.

This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Francesc_Alted1 · August 11, 2008, 6:33pm

Hi Jim,

If your issue is the time to access the different datasets, then you
will not see too much difference between using 1.6.x or 1.8.x. In my
benchmarks (made on my pretty old laptop), I'm opening a dataset in
about 200 microseconds when the dataset is not in the HDF5 metadata
cache and in about 35 microseconds when the dataset is in the cache,
irregardingly of using 1.6.7 or 1.8.0 -- incidentally, when the dataset
metadata is in the cache there is some small advantage in favor of
1.6.7 of a 25%, but this should not be too important in your setup.

Most importantly, my experiments show that the time needed to open a
single dataset is approximately *independent* of the number of datasets
in a file (at least in the range of 100 ~ 10000 datasets). At any
rate, I don't find 35 microseconds (this is with PyTables, but this
time should be certainly better when using a C program) to be an
excessive figure for reopening a dataset. I'd recommend you to double
check where your bottleneck really is (perhaps is in other piece of
code that really depends on the number of datasets visited).

If you really need much more speed than 35 microseconds, and you want to
keep using HDF5, then I'd think a way to consolidate more information
in your datasets, be adding more dimensions or appending more data to
the existing datasets. Then, you can perform direct I/O by doing
hyperselections of the interesting parts of your datasets. Of course,
that implies to setup a sort of index to quickly locate those
interesting sub-datasets. Whether your new index is faster than the
cost of re-opening datasets in HDF5 will depend on the complexity of
such an index -- so you should look for a simple enough implementation.

BTW, after seeing the effectiveness of the new HDF5 1.8.x series, I
think that I'm starting to change my mind in that HDF5 is not useful
when there exist *a lot* of datasets in the same file. The 1.8.x
versions seems to work really nice in this scenario. Many thanks to
the HDF5 team for this

Francesc

A Monday 11 August 2008, Jim Robinson escrigué:

···

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
 <meta content="text/html;charset=ISO-8859-1"
http-equiv="Content-Type"> </head>
<body bgcolor="#ffffff" text="#000000">
Hi Francesc,   I should have defined "problem" more
clearly.   In my application HDF is serve up data in small
chunks to support an interactive visualization of genomic
data.   As they zoom and pan more reads are triggered as
additional "tiles" come into view (its modeled closely on google
maps).     There are many datasets spread across
multiple HDF5 files.    As long as the user is hitting
only a few datasets zooming and panning are fairly smooth, but when
the numbers get in the "10s"  the lag from opening datasets
becomes noticeable. 
The datasets are organized by zoom level so by caching datasets
panning is now smooth.   Delays when zooming are more
acceptable,  somehow you expect a delay when zooming in, 
or at least visually it is less disturbing than jerky
panning.   I don't really need an LRU cache,  simply
dumping everything and starting over when the cache is full is good
enough.  
 
I would really like to try 1.8X.    Are there any
plans to develop a java interface for that version? 
 
Thanks 
 
Jim 
 
 
 
Francesc Alted wrote:
<blockquote cite="mid:200808110847.00909.faltet@pytables.com"
type="cite">
 <pre wrap="">A Saturday 09 August 2008, escriguéreu:
 </pre>
 <blockquote type="cite">
 <pre wrap="">Maybe you're correct, the logical model of HDF is
well suited to my problem though and I'm committed to it in the near
term. This design bias towards large datasets in relatively few
containers is not obvious from the documentation.
 </pre>
 </blockquote>
 <pre wrap="">
Perhaps this fact is not obvious from reading the docs, but I've
clearly read it from the book of my own experience

 </pre>
 <blockquote type="cite">
 <pre wrap="">At any rate I'm not talking
about a huge number of datasets, just 30-100 or so typically.
 </pre>
 </blockquote>
 <pre wrap="">
30-100 datasets should be not a problem at all, even when using HDF5
1.6.x series.

 </pre>
 <blockquote type="cite">
 <pre wrap="">HDf 1.8x is not an option because, as far as I know,
there are no java jni bindings available yet. I will try your
suggestion for a relatively small cache. Thanks very much for the
numbers and advice.
 </pre>
 </blockquote>
 <pre wrap="">
Glad that I helped you.

 </pre>
</blockquote>
</body>
</html>

---------------------------------------------------------------------
- This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to
hdf-forum-subscribe@hdfgroup.org. To unsubscribe, send a message to
hdf-forum-unsubscribe@hdfgroup.org.

--
Francesc Alted
Freelance developer
Tel +34-964-282-249

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Darryl_Okahata · August 14, 2008, 5:13pm

Well, it came from a 1.8.1 tarball, downloaded on July 24th:

$ ll hdf5-1.8.1.tar.bz2
-rw-r--r-- 1 darrylo eesofrd 5083921 Jul 24 16:39 hdf5-1.8.1.tar.bz2

I just re-extracted the tarball, and compared it with the 1.8.1 source
tree from which my h5dump came, and the two source trees are identical,
except for hdf5-1.8.1/Makefile (it appears to have been regenerated by
automake).

I'll try to rebuild HDF with -g and without optimization (I'm using
an old gcc 4.1.1 compiler). Perhaps that will help. It'll be a few
days, though.

Thanks.

···

Quincey Koziol <koziol@hdfgroup.org> wrote:

Hmm, this is odd. What version of h5dump are you using? It doesn't
look like it's 1.8.1...

--
Darryl Okahata
darrylo@soco.agilent.com

DISCLAIMER: this message is the author's personal opinion and does not
constitute the support, opinion, or policy of Agilent Technologies, or
of the little green men that have been following him all day.

Quincey_Koziol · August 14, 2008, 6:16pm

Hi Darryl,

···

On Aug 14, 2008, at 12:13 PM, Darryl Okahata wrote:

Quincey Koziol <koziol@hdfgroup.org> wrote:

 Hmm, this is odd. What version of h5dump are you using? It doesn't
look like it's 1.8.1...

 Well, it came from a 1.8.1 tarball, downloaded on July 24th:

 $ ll hdf5-1.8.1.tar.bz2
 -rw-r--r-- 1 darrylo eesofrd 5083921 Jul 24 16:39 hdf5-1.8.1.tar.bz2

I just re-extracted the tarball, and compared it with the 1.8.1 source
tree from which my h5dump came, and the two source trees are identical,
except for hdf5-1.8.1/Makefile (it appears to have been regenerated by
automake).

 I'll try to rebuild HDF with -g and without optimization (I'm using
an old gcc 4.1.1 compiler). Perhaps that will help. It'll be a few
days, though.

Hmm, if you are going to be rebuilding, could you try the current stable 1.8 release code from our public subversion repository: http://svn.hdfgroup.uiuc.edu/hdf5/branches/hdf5_1_8

It's got some related fixes and might work better for you,
Quincey

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Darryl_Okahata · August 13, 2008, 10:38pm

Thanks. It turns out that the expected file suffixes are "-m.h5"
and "-r.h5" (this is for 1.8.1).

I tracked down the problem to the use of H5P_set_libver_bounds():

status = H5Pset_libver_bounds(fapl, H5F_LIBVER_LATEST, H5F_LIBVER_LATEST);

If I don't use this, h5dump works. However, the use of the above line
results in:

$ h5dump ext.h5
h5dump error: internal error (file h5dump.c:line 4150)

h5dump works if I don't use H5Pset_libver_bounds(). I've attached a
short diff to the example, "h5_extend.c", program. If you compile and
run it, the resulting file will not work with h5dump.

[ Also, note that the patch is for 1.8.1, as I had to modify the example
to work with the 1.8.1 API. ]

h5_extend.c.diffs (1.95 KB)

···

Quincey Koziol <koziol@hdfgroup.org> wrote:

I believe this should work if you say "h5dump --filedriver split
ext.h5"

--
Darryl Okahata
darrylo@soco.agilent.com

DISCLAIMER: this message is the author's personal opinion and does not
constitute the support, opinion, or policy of Agilent Technologies, or
of the little green men that have been following him all day.

Quincey_Koziol · August 14, 2008, 11:43am

Hi Darryl,

     I believe this should work if you say "h5dump --filedriver split
ext.h5"

    Thanks. It turns out that the expected file suffixes are "-m.h5"
and "-r.h5" (this is for 1.8.1).

    I tracked down the problem to the use of H5P_set_libver_bounds():

       status = H5Pset_libver_bounds(fapl, H5F_LIBVER_LATEST, H5F_LIBVER_LATEST);

If I don't use this, h5dump works. However, the use of the above line
results in:

       $ h5dump ext.h5
       h5dump error: internal error (file h5dump.c:line 4150)

Hmm, this is odd. What version of h5dump are you using? It doesn't look like it's 1.8.1...

Quincey

···

On Aug 13, 2008, at 5:38 PM, Darryl Okahata wrote:

Quincey Koziol <koziol@hdfgroup.org> wrote:

h5dump works if I don't use H5Pset_libver_bounds(). I've attached a
short diff to the example, "h5_extend.c", program. If you compile and
run it, the resulting file will not work with h5dump.

[ Also, note that the patch is for 1.8.1, as I had to modify the example
to work with the 1.8.1 API. ]

--
Darryl Okahata
darrylo@soco.agilent.com

DISCLAIMER: this message is the author's personal opinion and does not
constitute the support, opinion, or policy of Agilent Technologies, or
of the little green men that have been following him all day.

<h5_extend.c.diffs>

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Darryl_Okahata · August 8, 2008, 6:22pm

I would suggest trying the enhancements that come with using the
latest version of the file format, which can be enabled by calling
H5Pset_libver_bounds() with both bounds settings to
H5F_LIBVER_LATEST.

Thanks. I tried this, but the difference is minimal, for my test
program.

You can nest HDF5's variable-length datatypes arbitrarily deep - does
that give you what you are looking for?

One of the issues is that the data won't fit into memory. Worst
case, the entire pile of data is 10-100+ terabytes in size.

I'm now trying plan B: concatenating the original datasets,
end-to-end, in one big dataset. I have another dataset that keeps track
of the locations and sizes of the original datasets, in the big dataset.
So far, this seems to be somewhat fast and scalable (the write times
seem to scale roughly linearly). I need to do read timings, though.

On a different note: does h5dump support split files? I can't seem
to get h5dump to recognize them. I wrote some test code that uses the
split driver to write two files:

ext.h5.meta
ext.h5.raw

$ h5dump ext.h5.meta
h5dump error: unable to open file "ext.h5.meta"
$ h5dump ext.h5.raw
h5dump error: unable to open file "ext.h5.raw"
$ h5dump --filedriver split ext.h5.meta
h5dump error: unable to open file "ext.h5.meta"
$ h5dump --filedriver split ext.h5.raw
h5dump error: unable to open file "ext.h5.raw"

I don't think it's a problem with the test code, as it produces a
dumpable "ext.h5" file if I comment out the call to H5Pset_fapl_split()
(the test code originated from the h5_extend.c example).

···

--
Darryl Okahata
darrylo@soco.agilent.com

DISCLAIMER: this message is the author's personal opinion and does not
constitute the support, opinion, or policy of Agilent Technologies, or
of the little green men that have been following him all day.

Quincey_Koziol · August 11, 2008, 6:52pm

Hi Darryl,

···

On Aug 8, 2008, at 1:22 PM, Darryl Okahata wrote:

    On a different note: does h5dump support split files? I can't seem
to get h5dump to recognize them. I wrote some test code that uses the
split driver to write two files:

  ext.h5.meta
  ext.h5.raw

$ h5dump ext.h5.meta
h5dump error: unable to open file "ext.h5.meta"
$ h5dump ext.h5.raw
h5dump error: unable to open file "ext.h5.raw"
$ h5dump --filedriver split ext.h5.meta
h5dump error: unable to open file "ext.h5.meta"
$ h5dump --filedriver split ext.h5.raw
h5dump error: unable to open file "ext.h5.raw"

I don't think it's a problem with the test code, as it produces a
dumpable "ext.h5" file if I comment out the call to H5Pset_fapl_split()
(the test code originated from the h5_extend.c example).

I believe this should work if you say "h5dump --filedriver split ext.h5"

Quincey

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.