I was expecting that by not inserting a field into memtype, I could avoid
outputing that field to disk. However, the attached test file shows that
this is not the case. Even if I do not insert the field, everything is
written and everything is read back as well.
I guess it has to do with the size of memtype. I tried pack() to reduce the
size but the data interpretation then went wrong.
Defining a new struct containing only these wanted fields is not optimal,
since it would require copying the data to the new struct or back, while my
application involves huge amounts of data. What I try to hide is actually a
vector field, which I write out separately as an array of variable-length
arrays. Currently although I have omitted the vector field in memtype, it
is still written and then read back as well, which corrupts the memory (the
reading automatically fills the size and memory pointer of the vectors with
their written values, which are not valid pointers anymore).
So is there a way that I can truly hide a particular field from being
written as well as from being read back, without having to define a new
temporary class?
Your options are copy it to a new packed struct layout - maybe performing
the conversions in chunks of a suitable number. This would probably be the
fastest and probably pretty friendly on memory & high performance if you
choose the right constants. You can use the type conversion routines to go
from the originanl memory layout of the structs to the packed ones, or
simply reduce it yourself - which will of course be higher performance /
lower overhead if you preallocate (rather than append) and avoid all the
complicated conversion code.
Your second option is structure of arrays which decouples where each field
goes, which of course allows you to specify exactly what you want. You
could then either post-process or preprocess them back in as structures
wherever is most appropriate.
I was expecting that by not inserting a field into memtype, I could avoid
outputing that field to disk. However, the attached test file shows that
this is not the case. Even if I do not insert the field, everything is
written and everything is read back as well.
I guess it has to do with the size of memtype. I tried pack() to reduce
the size but the data interpretation then went wrong.
Defining a new struct containing only these wanted fields is not optimal,
since it would require copying the data to the new struct or back, while my
application involves huge amounts of data. What I try to hide is actually a
vector field, which I write out separately as an array of variable-length
arrays. Currently although I have omitted the vector field in memtype, it
is still written and then read back as well, which corrupts the memory (the
reading automatically fills the size and memory pointer of the vectors with
their written values, which are not valid pointers anymore).
So is there a way that I can truly hide a particular field from being
written as well as from being read back, without having to define a new
temporary class?
Your options are copy it to a new packed struct layout - maybe performing
the conversions in chunks of a suitable number. This would probably be the
fastest and probably pretty friendly on memory & high performance if you
choose the right constants. You can use the type conversion routines to go
from the originanl memory layout of the structs to the packed ones, or
simply reduce it yourself - which will of course be higher performance /
lower overhead if you preallocate (rather than append) and avoid all the
complicated conversion code.
Your second option is structure of arrays which decouples where each field
goes, which of course allows you to specify exactly what you want. You
could then either post-process or preprocess them back in as structures
wherever is most appropriate.
I was expecting that by not inserting a field into memtype, I could avoid
outputing that field to disk. However, the attached test file shows that
this is not the case. Even if I do not insert the field, everything is
written and everything is read back as well.
I guess it has to do with the size of memtype. I tried pack() to reduce
the size but the data interpretation then went wrong.
Defining a new struct containing only these wanted fields is not optimal,
since it would require copying the data to the new struct or back, while my
application involves huge amounts of data. What I try to hide is actually a
vector field, which I write out separately as an array of variable-length
arrays. Currently although I have omitted the vector field in memtype, it
is still written and then read back as well, which corrupts the memory (the
reading automatically fills the size and memory pointer of the vectors with
their written values, which are not valid pointers anymore).
So is there a way that I can truly hide a particular field from being
written as well as from being read back, without having to define a new
temporary class?
Right so, from your desire to avoid a copy, I must inform you that you are
making a copy - maybe I didn't understand your case though - seemed you
didn't want to have any overhead of memory and maybe cpu time.
What happens is the type conversion system is being ran inside the dataset
write/read which unless provided with a background buffer via transfer
properties, allocates a buffer, then proceeds to run a batch conversion
from the source memory with source type to destination memory with
destination type.
It's not very heavy at all if you do it in batches of the right size - and
you must weight this against the fairly high overhead of the HDF api
relative to C++ with smart memory management techniques and user conversion
code. But its usually not necessary to do that and the conversion api is
there for free + robust.
Your options are copy it to a new packed struct layout - maybe performing
the conversions in chunks of a suitable number. This would probably be the
fastest and probably pretty friendly on memory & high performance if you
choose the right constants. You can use the type conversion routines to go
from the originanl memory layout of the structs to the packed ones, or
simply reduce it yourself - which will of course be higher performance /
lower overhead if you preallocate (rather than append) and avoid all the
complicated conversion code.
Your second option is structure of arrays which decouples where each
field goes, which of course allows you to specify exactly what you want.
You could then either post-process or preprocess them back in as structures
wherever is most appropriate.
I was expecting that by not inserting a field into memtype, I could
avoid outputing that field to disk. However, the attached test file shows
that this is not the case. Even if I do not insert the field, everything is
written and everything is read back as well.
I guess it has to do with the size of memtype. I tried pack() to reduce
the size but the data interpretation then went wrong.
Defining a new struct containing only these wanted fields is not
optimal, since it would require copying the data to the new struct or back,
while my application involves huge amounts of data. What I try to hide is
actually a vector field, which I write out separately as an array of
variable-length arrays. Currently although I have omitted the vector field
in memtype, it is still written and then read back as well, which corrupts
the memory (the reading automatically fills the size and memory pointer of
the vectors with their written values, which are not valid pointers
anymore).
So is there a way that I can truly hide a particular field from being
written as well as from being read back, without having to define a new
temporary class?
Right so, from your desire to avoid a copy, I must inform you that you are
making a copy - maybe I didn't understand your case though - seemed you
didn't want to have any overhead of memory and maybe cpu time.
Yes memory and cpu overhead is what I am trying to avoid. There is
maintenance consideration as well, to avoid duplicating the struct
definition manually everytime it is updated.
What happens is the type conversion system is being ran inside the dataset
write/read which unless provided with a background buffer via transfer
properties, allocates a buffer, then proceeds to run a batch conversion
from the source memory with source type to destination memory with
destination type.
Good to know this! How is the buffer size determined by default? Is it the
size to hold the whole dataset of destination type , or is conversion done
one by one on the array elements so that only one element need to be
allocated? How can I optimize it if it becomes a concern?
It's not very heavy at all if you do it in batches of the right size - and
you must weight this against the fairly high overhead of the HDF api
relative to C++ with smart memory management techniques and user conversion
code. But its usually not necessary to do that and the conversion api is
there for free + robust.
I wouldn't say the maintenance overhead is generally much considering you
must still whitelist the members for the projected structure.
The buffer size is the target type size * number of elements. As I said,
it's a batch conversion because HDF doesn't do item-by-item just in time as
written. That would probably be slower based on syscall overhead and HDF
conversion path overhead.
I already explained how you can optimize: write in suitably sized chunks or
do the conversions yourself. For batched writes, what is the right size is
something you'll find empirically. Powers of 2 from 32-1024 are good
starts. For appending tables, I use the packet table APIs... that's
another subject though.
remember not to microoptimize things that don't matter (memory or cpu).
Right so, from your desire to avoid a copy, I must inform you that you
are making a copy - maybe I didn't understand your case though - seemed you
didn't want to have any overhead of memory and maybe cpu time.
Yes memory and cpu overhead is what I am trying to avoid. There is
maintenance consideration as well, to avoid duplicating the struct
definition manually everytime it is updated.
What happens is the type conversion system is being ran inside the dataset
write/read which unless provided with a background buffer via transfer
properties, allocates a buffer, then proceeds to run a batch conversion
from the source memory with source type to destination memory with
destination type.
Good to know this! How is the buffer size determined by default? Is it
the size to hold the whole dataset of destination type , or is conversion
done one by one on the array elements so that only one element need to be
allocated? How can I optimize it if it becomes a concern?
It's not very heavy at all if you do it in batches of the right size - and
you must weight this against the fairly high overhead of the HDF api
relative to C++ with smart memory management techniques and user conversion
code. But its usually not necessary to do that and the conversion api is
there for free + robust.