Questions regarding H5Dcreate_anon() and reclaiming unused disk file space

Hi,

I have some questions regarding H5Dcreate_anon() as implemented in
version 1.8.18 of the HDF5 library...

I'd like to use this function to create a temporary test dataset. If
it meets a certain condition, which I basically can't determine until
writing of the test dataset to disk is finished, I'll make it
accessible in the HDF5 file on disk with H5Olink(). Otherwise I'll
discard the temporary dataset and try again with relevant changes.

I'd like to be certain of two things that are needed for this approach
to work well:

1) Does the dataset generated by H5Dcreate_anon() actually exist
(transiently) on-disk, rather than being a clever wrapper for some
memory buffer? I am generating the dataset chunked and writing it out
chunk-by-chunk, so insufficient RAM isn't a problem UNLESS there is a
concern with using H5Dcreate_anon() for a dataset too large to fit in
memory at once.

2) I understand that "normal" H5Dcreate() and a dataset write,
followed sometime later by H5Ldelete(), can end up (in 1.8.18)
resulting in wasted space in the file on disk. Can wasted space be
produced similarly by H5Dcreate_anon() when no later call to H5Olink()
is made? [Assume that H5Dclose() gets properly called.] I'm hoping
not ... ?

Thanks in advance for info on this subject!

Regarding the wasted space, I have secondary questions.

3) I know that h5repack can be used to produce a new file without
wasted space. But without h5repack, would the creation of more
datasets in the same file (with library version 1.8.18) re-use that
wasted disk space when possible?

4) There are apparently some mechanisms in 1.10.x for managing /
reclaiming wasted space on disk in HDF5 files? Does it happen
automatically upon any call to H5Ldelete() with the 1.10.x library, or
are some additional function calls needed? I can't really find
anything in the docs about this so a pointer would be much
appreciated. (As noted on this list previously, my employer can't
upgrade to 1.10.x until there is a way to produce 1.8.x backwards
compatible output, but eventually I guess we'll all get there...)

Thanks again,

···

--
Kevin B. McCarty
<kmccarty@gmail.com>

Hi Kevin,

Hi,

I have some questions regarding H5Dcreate_anon() as implemented in
version 1.8.18 of the HDF5 library...

I'd like to use this function to create a temporary test dataset. If
it meets a certain condition, which I basically can't determine until
writing of the test dataset to disk is finished, I'll make it
accessible in the HDF5 file on disk with H5Olink(). Otherwise I'll
discard the temporary dataset and try again with relevant changes.

I'd like to be certain of two things that are needed for this approach
to work well:

1) Does the dataset generated by H5Dcreate_anon() actually exist
(transiently) on-disk, rather than being a clever wrapper for some
memory buffer? I am generating the dataset chunked and writing it out
chunk-by-chunk, so insufficient RAM isn't a problem UNLESS there is a
concern with using H5Dcreate_anon() for a dataset too large to fit in
memory at once.

  Yes, it’s really on disk.

2) I understand that "normal" H5Dcreate() and a dataset write,
followed sometime later by H5Ldelete(), can end up (in 1.8.18)
resulting in wasted space in the file on disk. Can wasted space be
produced similarly by H5Dcreate_anon() when no later call to H5Olink()
is made? [Assume that H5Dclose() gets properly called.] I'm hoping
not … ?

  Yes, there could be some wasted space in the file with H5Dcreate_anon, although it will be less that what could occur with H5Dcreate.

Thanks in advance for info on this subject!

Regarding the wasted space, I have secondary questions.

3) I know that h5repack can be used to produce a new file without
wasted space. But without h5repack, would the creation of more
datasets in the same file (with library version 1.8.18) re-use that
wasted disk space when possible?

  Yes, as long as you don’t close & reopen the file. In the 1.8 release sequence, the free file space info is tracked in memory until the file is closed. (In the 1.10 sequence, there’s a property to request that this information be tracked persistently in the file)

4) There are apparently some mechanisms in 1.10.x for managing /
reclaiming wasted space on disk in HDF5 files? Does it happen
automatically upon any call to H5Ldelete() with the 1.10.x library, or
are some additional function calls needed? I can't really find
anything in the docs about this so a pointer would be much
appreciated. (As noted on this list previously, my employer can't
upgrade to 1.10.x until there is a way to produce 1.8.x backwards
compatible output, but eventually I guess we'll all get there…)

  Yes, you want the H5Pset_file_space_strategy property (alluded to above).

    Quincey

···

On Aug 22, 2017, at 7:13 PM, Kevin B. McCarty <kmccarty@gmail.com> wrote:

Hi,

I've read many examples from both H5TB high level API and low level API for compound HDF data type, but I didn't find a good solution for my special use case. All those examples have one problematic assumption: data structure (which means number of fields and their types and values) must be known a priori - that's the problem in my case, when I don't know this structure and I need to create a table HDF dataset not only row-by-row, but also field-by-field in the row.

I need your advice how to achieve what I want using a proper sequence of HDF API calls.

Let's say my final HDF table will look like this:
['a', 1, 3.14]
['b', 2, 2.11]
['c', 3, 1.89]

So we simply have a HDF table with 3 columns of types: char, int, float
and 3 rows with some values.

Creation of that table must be divided into some "steps".
After 1st "step" I should have a table:
['a']

After 2nd step:
['a', 1]

After 3rd step:
['a', 1, 3.14]

After 4th step:
['a', 1, 3.14]
['b', x, x]

where x after 4th step is undefined and can be some default values which will be overwritten in the next steps.

How to achieve that use case?

Is it possible to create a table by calling H5TBmake_table(), but having no fields and no records at the beginning and then just call H5TBinsert_field() in the next steps?

Is it possible to have "data" attribute of H5TBinsert_field() function a NULL value when we insert a new field to a table dataset with no records yet?

What about 4th step - can I create just a first column value for a new record in a table?

I know it's maybe a strange use case, but the problem is that I could have really huge structure model (a lot of columns and a lot of records) which should be stored in the HDF table dataset, so I need to avoid "collecting" required information (number of fields, their types, values) by initial iterating over whole structure.
The second problem is that I have a vector of objects which need to be stored as HDF table (where table row is the given object and columns are its fields), but all examples I've seen just work on C struct.

I would appreciate any advice!

Regards,
Rafal

Let's say my final HDF table will look like this:
['a', 1, 3.14]
['b', 2, 2.11]
['c', 3, 1.89]

So we simply have a HDF table with 3 columns of types: char, int,
float
and 3 rows with some values.

Creation of that table must be divided into some "steps".
After 1st "step" I should have a table:
['a']

After 2nd step:
['a', 1]

After 3rd step:
['a', 1, 3.14]

After 4th step:
['a', 1, 3.14]
['b', x, x]

where x after 4th step is undefined and can be some default values
which will be overwritten in the next steps.

How to achieve that use case?

I have to do something similar for my program tablator

  https://github.com/Caltech-IPAC/tablator

I read in tables in other formats and write out as HDF5. So I do not
know the types of the rows at compile time. It is all in C++. The
details of writing HDF5 are in

  src/Table/write_hdf5/

Is it possible to create a table by calling H5TBmake_table(), but
having no fields and no records at the beginning and then just call
H5TBinsert_field() in the next steps?

I do not think that is going to work, because you need to know the
sizes of rows when you create the table.

Is it possible to have "data" attribute of H5TBinsert_field() function
a NULL value when we insert a new field to a table dataset with no
records yet?

What about 4th step - can I create just a first column value for a new
record in a table?

I do not know of a way to do that. I would end up creating a whole
new table with the new field. You can then populate the empty fields
with appropriate default values.

I know it's maybe a strange use case, but the problem is that I could
have really huge structure model (a lot of columns and a lot of
records) which should be stored in the HDF table dataset, so I need to
avoid "collecting" required information (number of fields, their
types, values) by initial iterating over whole structure.
The second problem is that I have a vector of objects which need to be
stored as HDF table (where table row is the given object and columns
are its fields), but all examples I've seen just work on C struct.

That sounds similar to the internal data structure I use in tablator.

Hope that helps,
Walter Landry

···

Rafal Lichwala <syriusz@man.poznan.pl> wrote:

Thank you for the answers, Quincey!

        Yes, there could be some wasted space in the file with H5Dcreate_anon, although it will be less that what could occur with H5Dcreate.

OK, that is good to know. I guess I'll have to try some practical
tests to see whether the amount of wasted space is small enough for
this approach to be acceptable to us or not.

[me]

4) There are apparently some mechanisms in 1.10.x for managing /
reclaiming wasted space on disk in HDF5 files? Does it happen
automatically upon any call to H5Ldelete() with the 1.10.x library, or
are some additional function calls needed? I can't really find
anything in the docs about this so a pointer would be much
appreciated. (As noted on this list previously, my employer can't
upgrade to 1.10.x until there is a way to produce 1.8.x backwards
compatible output, but eventually I guess we'll all get there…)

        Yes, you want the H5Pset_file_space_strategy property (alluded to above).

I've found the documentation for this function... I also found a
couple documents here with more detailed and higher-level
descriptions, for which I'll post a link for the sake of anyone else
looking for this info:

https://support.hdfgroup.org/HDF5/docNewFeatures/FileSpace/

I'll spend some time looking at these. Are these documents accurate
with respect to the library (as of 1.10.1) ? Or is there a more
up-to-date version that I should look at?

Thanks again,

···

On Wed, Aug 23, 2017 at 8:01 AM, Quincey Koziol <koziol@lbl.gov> wrote:

--
Kevin B. McCarty
<kmccarty@gmail.com>

Hi Walter, hi All,

Thank you for sharing your work.
I've analyzed your codes briefly and it seems you manually create a dataset with compound type and then you put your values in such a dataset - is that correct?

But that means for my use case I need to collect all columns (their types) first and then create a compound dataset.
When the number of such columns is really huge this operation can be time and resource consuming. But that's OK if there is no other solution...

If I well understand your codes, you are collecting your columns (which are separate classes in your case) just in vectors and then you calculating columns offsets basing on std::vector::data - is that correct?

Any other suggestions from HDF Forum Team which could help to solve my use case?

Thank you.

Regards,
Rafal

W dniu 2017-08-23 o 16:25, Walter Landry pisze:

···

Rafal Lichwala <syriusz@man.poznan.pl> wrote:

Let's say my final HDF table will look like this:
['a', 1, 3.14]
['b', 2, 2.11]
['c', 3, 1.89]

So we simply have a HDF table with 3 columns of types: char, int,
float
and 3 rows with some values.

Creation of that table must be divided into some "steps".
After 1st "step" I should have a table:
['a']

After 2nd step:
['a', 1]

After 3rd step:
['a', 1, 3.14]

After 4th step:
['a', 1, 3.14]
['b', x, x]

where x after 4th step is undefined and can be some default values
which will be overwritten in the next steps.

How to achieve that use case?

I have to do something similar for my program tablator

   https://github.com/Caltech-IPAC/tablator

I read in tables in other formats and write out as HDF5. So I do not
know the types of the rows at compile time. It is all in C++. The
details of writing HDF5 are in

   src/Table/write_hdf5/

Is it possible to create a table by calling H5TBmake_table(), but
having no fields and no records at the beginning and then just call
H5TBinsert_field() in the next steps?

I do not think that is going to work, because you need to know the
sizes of rows when you create the table.

Is it possible to have "data" attribute of H5TBinsert_field() function
a NULL value when we insert a new field to a table dataset with no
records yet?

What about 4th step - can I create just a first column value for a new
record in a table?

I do not know of a way to do that. I would end up creating a whole
new table with the new field. You can then populate the empty fields
with appropriate default values.

I know it's maybe a strange use case, but the problem is that I could
have really huge structure model (a lot of columns and a lot of
records) which should be stored in the HDF table dataset, so I need to
avoid "collecting" required information (number of fields, their
types, values) by initial iterating over whole structure.
The second problem is that I have a vector of objects which need to be
stored as HDF table (where table row is the given object and columns
are its fields), but all examples I've seen just work on C struct.

That sounds similar to the internal data structure I use in tablator.

Hope that helps,
Walter Landry

What you want is run time type definitions - which is something HDF
supports but C doesn't. Have a look at python/numpy's dtype for a good
idea of the task you're in for, especially on the c side, which maps really
well to hdf's type system. If you just want something really simple you
dont' have to get too crazy.

Basically you need to know how big a type is (a runtime version of sizeof),
how many fields are in it, the types of those fields, and to prefix sum ( a
cumsum with a zero shifted in from the left) the size and arity of those
fields to get the offsets of each field of the struct-blob-at-runtime.
Then you have to work with those fields adaptively since their runtime
type varies because someone's gotta work with bytes at the end of the day.
It's still quite fast and what you see happening essentially with python's
numpy api.

The table/packet table apis will have no problem with these types as
records.

FYI for performance you should still buffer some of these up. The
alternative approach is a storing structure of arrays - which is a very
common technique - matlab uses it when it stores in hdf5 for instance and
it is more .... available to other users of HDF5 who care not to venture
into these things. I side on runtime structs but other software sometime
isn't always so clever. This boils down to each column having it's own
dataset/table/packet table and being whatever primative hdf5, such that you
will likely not have to touch compounds. The drawback of this approach is
that you always have to do work to reconstruct the object, which can be
relatively slow in python for instance and involve lots of copying and
metaprogramming. In matlab there is no generic solution for gluing it back
together for instance - so you end up writing boilerplate and supporting
that is difficult as the documents change.

-Jason

···

On Wed, Aug 23, 2017 at 5:10 AM, Rafal Lichwala <syriusz@man.poznan.pl> wrote:

Hi,

I've read many examples from both H5TB high level API and low level API
for compound HDF data type, but I didn't find a good solution for my
special use case. All those examples have one problematic assumption: data
structure (which means number of fields and their types and values) must be
known a priori - that's the problem in my case, when I don't know this
structure and I need to create a table HDF dataset not only row-by-row, but
also field-by-field in the row.

I need your advice how to achieve what I want using a proper sequence of
HDF API calls.

Let's say my final HDF table will look like this:
['a', 1, 3.14]
['b', 2, 2.11]
['c', 3, 1.89]

So we simply have a HDF table with 3 columns of types: char, int, float
and 3 rows with some values.

Creation of that table must be divided into some "steps".
After 1st "step" I should have a table:
['a']

After 2nd step:
['a', 1]

After 3rd step:
['a', 1, 3.14]

After 4th step:
['a', 1, 3.14]
['b', x, x]

where x after 4th step is undefined and can be some default values which
will be overwritten in the next steps.

How to achieve that use case?

Is it possible to create a table by calling H5TBmake_table(), but having
no fields and no records at the beginning and then just call
H5TBinsert_field() in the next steps?

Is it possible to have "data" attribute of H5TBinsert_field() function a
NULL value when we insert a new field to a table dataset with no records
yet?

What about 4th step - can I create just a first column value for a new
record in a table?

I know it's maybe a strange use case, but the problem is that I could have
really huge structure model (a lot of columns and a lot of records) which
should be stored in the HDF table dataset, so I need to avoid "collecting"
required information (number of fields, their types, values) by initial
iterating over whole structure.
The second problem is that I have a vector of objects which need to be
stored as HDF table (where table row is the given object and columns are
its fields), but all examples I've seen just work on C struct.

I would appreciate any advice!

Regards,
Rafal

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Correct.

Cheers,
Walter Landry

···

Rafal Lichwala <syriusz@man.poznan.pl> wrote:

Hi Walter, hi All,

Thank you for sharing your work.
I've analyzed your codes briefly and it seems you manually create a
dataset with compound type and then you put your values in such a
dataset - is that correct?

But that means for my use case I need to collect all columns (their
types) first and then create a compound dataset.
When the number of such columns is really huge this operation can be
time and resource consuming. But that's OK if there is no other
solution...

If I well understand your codes, you are collecting your columns
(which are separate classes in your case) just in vectors and then you
calculating columns offsets basing on std::vector::data - is that
correct?

Hi Rafal,

it looks both the dimensions and the structural type of your dataset are supposed to be dynamical in your use case. This sounds like a demand that would be possible but unhealthily nonperforming to manage if all is put into one dataset that is dynamically updated and re-organized both in memory and disk when inserting new data.

I'd rather use a group with many datasets in such a case. Each dataset can have unlimited dimensions in the one direction where you update data, but use only one type for each dataset, not a compound structure. So if a new column is added, you add a new dataset. To iterate over the compounds over your data type, you iterate over the containing group and check the type of each dataset there.

Admittedly I have no experience with the HDF5 Table API, probably it's not possible to use that one and you'd need to use the lower level H5D and H5G API's.

Regards,

        Werner
···

On 24.08.2017 06:47, Rafal Lichwala wrote:

Hi Walter, hi All,

Thank you for sharing your work.
I've analyzed your codes briefly and it seems you manually create a dataset with compound type and then you put your values in such a dataset - is that correct?

But that means for my use case I need to collect all columns (their types) first and then create a compound dataset.
When the number of such columns is really huge this operation can be time and resource consuming. But that's OK if there is no other solution...

If I well understand your codes, you are collecting your columns (which are separate classes in your case) just in vectors and then you calculating columns offsets basing on std::vector::data - is that correct?

Any other suggestions from HDF Forum Team which could help to solve my use case?

Thank you.

Regards,
Rafal

W dniu 2017-08-23 o 16:25, Walter Landry pisze:

Rafal Lichwala <syriusz@man.poznan.pl> wrote:

Let's say my final HDF table will look like this:
['a', 1, 3.14]
['b', 2, 2.11]
['c', 3, 1.89]

So we simply have a HDF table with 3 columns of types: char, int,
float
and 3 rows with some values.

Creation of that table must be divided into some "steps".
After 1st "step" I should have a table:
['a']

After 2nd step:
['a', 1]

After 3rd step:
['a', 1, 3.14]

After 4th step:
['a', 1, 3.14]
['b', x, x]

where x after 4th step is undefined and can be some default values
which will be overwritten in the next steps.

How to achieve that use case?

I have to do something similar for my program tablator

https://github.com/Caltech-IPAC/tablator

I read in tables in other formats and write out as HDF5. So I do not
know the types of the rows at compile time. It is all in C++. The
details of writing HDF5 are in

src/Table/write_hdf5/

Is it possible to create a table by calling H5TBmake_table(), but
having no fields and no records at the beginning and then just call
H5TBinsert_field() in the next steps?

I do not think that is going to work, because you need to know the
sizes of rows when you create the table.

Is it possible to have "data" attribute of H5TBinsert_field() function
a NULL value when we insert a new field to a table dataset with no
records yet?

What about 4th step - can I create just a first column value for a new
record in a table?

I do not know of a way to do that. I would end up creating a whole
new table with the new field. You can then populate the empty fields
with appropriate default values.

I know it's maybe a strange use case, but the problem is that I could
have really huge structure model (a lot of columns and a lot of
records) which should be stored in the HDF table dataset, so I need to
avoid "collecting" required information (number of fields, their
types, values) by initial iterating over whole structure.
The second problem is that I have a vector of objects which need to be
stored as HDF table (where table row is the given object and columns
are its fields), but all examples I've seen just work on C struct.

That sounds similar to the internal data structure I use in tablator.

Hope that helps,
Walter Landry

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

--
___________________________________________________________________________
Dr. Werner Benger Visualization Research
Center for Computation & Technology at Louisiana State University (CCT/LSU)
2019 Digital Media Center, Baton Rouge, Louisiana 70803
Tel.: +1 225 578 4809 Fax.: +1 225 578-5362