how to hide a certain field when writing a compound datatype

Kambrian · October 12, 2015, 3:55pm

Hello,

I was expecting that by not inserting a field into memtype, I could avoid
outputing that field to disk. However, the attached test file shows that
this is not the case. Even if I do not insert the field, everything is
written and everything is read back as well.

I guess it has to do with the size of memtype. I tried pack() to reduce the
size but the data interpretation then went wrong.

Defining a new struct containing only these wanted fields is not optimal,
since it would require copying the data to the new struct or back, while my
application involves huge amounts of data. What I try to hide is actually a
vector field, which I write out separately as an array of variable-length
arrays. Currently although I have omitted the vector field in memtype, it
is still written and then read back as well, which corrupts the memory (the
reading automatically fills the size and memory pointer of the vectors with
their written values, which are not valid pointers anymore).

So is there a way that I can truly hide a particular field from being
written as well as from being read back, without having to define a new
temporary class?

Thank you!

Jiaxin

test_h5compound2.cpp (2.11 KB)

nevion · October 14, 2015, 1:42am

You can't hide it.

Your options are copy it to a new packed struct layout - maybe performing
the conversions in chunks of a suitable number. This would probably be the
fastest and probably pretty friendly on memory & high performance if you
choose the right constants. You can use the type conversion routines to go
from the originanl memory layout of the structs to the packed ones, or
simply reduce it yourself - which will of course be higher performance /
lower overhead if you preallocate (rather than append) and avoid all the
complicated conversion code.

Your second option is structure of arrays which decouples where each field
goes, which of course allows you to specify exactly what you want. You
could then either post-process or preprocess them back in as structures
wherever is most appropriate.

HTH,
-Jason

···

On Mon, Oct 12, 2015 at 11:55 AM, Jiaxin Han <hanjiaxin@gmail.com> wrote:

Hello,

I was expecting that by not inserting a field into memtype, I could avoid
outputing that field to disk. However, the attached test file shows that
this is not the case. Even if I do not insert the field, everything is
written and everything is read back as well.

I guess it has to do with the size of memtype. I tried pack() to reduce
the size but the data interpretation then went wrong.

Defining a new struct containing only these wanted fields is not optimal,
since it would require copying the data to the new struct or back, while my
application involves huge amounts of data. What I try to hide is actually a
vector field, which I write out separately as an array of variable-length
arrays. Currently although I have omitted the vector field in memtype, it
is still written and then read back as well, which corrupts the memory (the
reading automatically fills the size and memory pointer of the vectors with
their written values, which are not valid pointers anymore).

So is there a way that I can truly hide a particular field from being
written as well as from being read back, without having to define a new
temporary class?

Thank you!

Jiaxin

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: x.com

Kambrian · October 14, 2015, 2:07am

Thank you Jason!

I actually just found a solution after digging really hard into the
documentation. I've posted my answer here

The key is to specify a properly packed datatype in file that is different
from the datatype in memory.

Bests,

Jiaxin

···

2015-10-14 2:42 GMT+01:00 Jason Newton <nevion@gmail.com>:

You can't hide it.

Your options are copy it to a new packed struct layout - maybe performing
the conversions in chunks of a suitable number. This would probably be the
fastest and probably pretty friendly on memory & high performance if you
choose the right constants. You can use the type conversion routines to go
from the originanl memory layout of the structs to the packed ones, or
simply reduce it yourself - which will of course be higher performance /
lower overhead if you preallocate (rather than append) and avoid all the
complicated conversion code.

Your second option is structure of arrays which decouples where each field
goes, which of course allows you to specify exactly what you want. You
could then either post-process or preprocess them back in as structures
wherever is most appropriate.

HTH,
-Jason

On Mon, Oct 12, 2015 at 11:55 AM, Jiaxin Han <hanjiaxin@gmail.com> wrote:

Hello,

I was expecting that by not inserting a field into memtype, I could avoid
outputing that field to disk. However, the attached test file shows that
this is not the case. Even if I do not insert the field, everything is
written and everything is read back as well.

I guess it has to do with the size of memtype. I tried pack() to reduce
the size but the data interpretation then went wrong.

Defining a new struct containing only these wanted fields is not optimal,
since it would require copying the data to the new struct or back, while my
application involves huge amounts of data. What I try to hide is actually a
vector field, which I write out separately as an array of variable-length
arrays. Currently although I have omitted the vector field in memtype, it
is still written and then read back as well, which corrupts the memory (the
reading automatically fills the size and memory pointer of the vectors with
their written values, which are not valid pointers anymore).

So is there a way that I can truly hide a particular field from being
written as well as from being read back, without having to define a new
temporary class?

Thank you!

Jiaxin

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: x.com

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: x.com

nevion · October 14, 2015, 3:19am

Right so, from your desire to avoid a copy, I must inform you that you are
making a copy - maybe I didn't understand your case though - seemed you
didn't want to have any overhead of memory and maybe cpu time.

What happens is the type conversion system is being ran inside the dataset
write/read which unless provided with a background buffer via transfer
properties, allocates a buffer, then proceeds to run a batch conversion
from the source memory with source type to destination memory with
destination type.

It's not very heavy at all if you do it in batches of the right size - and
you must weight this against the fairly high overhead of the HDF api
relative to C++ with smart memory management techniques and user conversion
code. But its usually not necessary to do that and the conversion api is
there for free + robust.

-Jason

···

On Tue, Oct 13, 2015 at 10:07 PM, Jiaxin Han <hanjiaxin@gmail.com> wrote:

Thank you Jason!

I actually just found a solution after digging really hard into the
documentation. I've posted my answer here

c++ - hide certain fields of a compound datatype from being written to (or read back from) hdf5 file - Stack Overflow

The key is to specify a properly packed datatype in file that is different
from the datatype in memory.

Bests,

Jiaxin

2015-10-14 2:42 GMT+01:00 Jason Newton <nevion@gmail.com>:

You can't hide it.

Your options are copy it to a new packed struct layout - maybe performing
the conversions in chunks of a suitable number. This would probably be the
fastest and probably pretty friendly on memory & high performance if you
choose the right constants. You can use the type conversion routines to go
from the originanl memory layout of the structs to the packed ones, or
simply reduce it yourself - which will of course be higher performance /
lower overhead if you preallocate (rather than append) and avoid all the
complicated conversion code.

Your second option is structure of arrays which decouples where each
field goes, which of course allows you to specify exactly what you want.
You could then either post-process or preprocess them back in as structures
wherever is most appropriate.

HTH,
-Jason

On Mon, Oct 12, 2015 at 11:55 AM, Jiaxin Han <hanjiaxin@gmail.com> wrote:

Hello,

I was expecting that by not inserting a field into memtype, I could
avoid outputing that field to disk. However, the attached test file shows
that this is not the case. Even if I do not insert the field, everything is
written and everything is read back as well.

I guess it has to do with the size of memtype. I tried pack() to reduce
the size but the data interpretation then went wrong.

Defining a new struct containing only these wanted fields is not
optimal, since it would require copying the data to the new struct or back,
while my application involves huge amounts of data. What I try to hide is
actually a vector field, which I write out separately as an array of
variable-length arrays. Currently although I have omitted the vector field
in memtype, it is still written and then read back as well, which corrupts
the memory (the reading automatically fills the size and memory pointer of
the vectors with their written values, which are not valid pointers
anymore).

So is there a way that I can truly hide a particular field from being
written as well as from being read back, without having to define a new
temporary class?

Thank you!

Jiaxin

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: x.com

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: x.com

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: x.com

Kambrian · October 14, 2015, 1:11pm

Hi Jason,

Thank you again!

Right so, from your desire to avoid a copy, I must inform you that you are
making a copy - maybe I didn't understand your case though - seemed you
didn't want to have any overhead of memory and maybe cpu time.

Yes memory and cpu overhead is what I am trying to avoid. There is

maintenance consideration as well, to avoid duplicating the struct
definition manually everytime it is updated.

What happens is the type conversion system is being ran inside the dataset

write/read which unless provided with a background buffer via transfer
properties, allocates a buffer, then proceeds to run a batch conversion
from the source memory with source type to destination memory with
destination type.

Good to know this! How is the buffer size determined by default? Is it the

size to hold the whole dataset of destination type , or is conversion done
one by one on the array elements so that only one element need to be
allocated? How can I optimize it if it becomes a concern?

It's not very heavy at all if you do it in batches of the right size - and

you must weight this against the fairly high overhead of the HDF api
relative to C++ with smart memory management techniques and user conversion
code. But its usually not necessary to do that and the conversion api is
there for free + robust.

Bests,

Jiaxin

···

2015-10-14 4:19 GMT+01:00 Jason Newton <nevion@gmail.com>:

nevion · October 14, 2015, 10:59pm

I wouldn't say the maintenance overhead is generally much considering you
must still whitelist the members for the projected structure.

The buffer size is the target type size * number of elements. As I said,
it's a batch conversion because HDF doesn't do item-by-item just in time as
written. That would probably be slower based on syscall overhead and HDF
conversion path overhead.

I already explained how you can optimize: write in suitably sized chunks or
do the conversions yourself. For batched writes, what is the right size is
something you'll find empirically. Powers of 2 from 32-1024 are good
starts. For appending tables, I use the packet table APIs... that's
another subject though.

remember not to microoptimize things that don't matter (memory or cpu).

···

On Wed, Oct 14, 2015 at 9:11 AM, Jiaxin Han <hanjiaxin@gmail.com> wrote:

Hi Jason,

Thank you again!

2015-10-14 4:19 GMT+01:00 Jason Newton <nevion@gmail.com>:

Right so, from your desire to avoid a copy, I must inform you that you
are making a copy - maybe I didn't understand your case though - seemed you
didn't want to have any overhead of memory and maybe cpu time.

Yes memory and cpu overhead is what I am trying to avoid. There is

maintenance consideration as well, to avoid duplicating the struct
definition manually everytime it is updated.

What happens is the type conversion system is being ran inside the dataset

write/read which unless provided with a background buffer via transfer
properties, allocates a buffer, then proceeds to run a batch conversion
from the source memory with source type to destination memory with
destination type.

Good to know this! How is the buffer size determined by default? Is it

the size to hold the whole dataset of destination type , or is conversion
done one by one on the array elements so that only one element need to be
allocated? How can I optimize it if it becomes a concern?

It's not very heavy at all if you do it in batches of the right size - and

you must weight this against the fairly high overhead of the HDF api
relative to C++ with smart memory management techniques and user conversion
code. But its usually not necessary to do that and the conversion api is
there for free + robust.

Bests,

Jiaxin

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: x.com

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

how to hide a certain field when writing a compound datatype