VLEN conversion problems

andrew.collette · April 27, 2009, 7:23pm

Hello,

I'm trying to write a conversion callback that can translate between
HDF5 variable-length strings and another format (in this case, Python
strings). I have a good deal of experience with the HDF5 C API but
have never used vlens before, so I'm having some difficulty. I've
read the document at:

http://www.hdfgroup.org/HDF5/doc1.6/Datatypes.html#Datatypes-DataConversion

and have based my conversion code on this example. However, I've not
been able to get it to work. So far I've tried writing a scalar
attibute; the created attribute has the correct type according to
h5dump (H5T_STRING, strsize H5T_VARIABLE), but the string itself is
length zero. I have a feeling I'm not using the right representation
for vlen strings. Are HDF5 variable-length strings still represented
by the hvl_t struct? Or are they just char*'s? The heart of my
conversion process is this:

(*buf is provided as the argument callback)

for reading:

obj_buf = (PyObject**)buf;
vlen_buf = (hvl_t*)buf;

for(i=0;i<nelements-1;i++){
obj_buf[i] = PyString_FromStringAndSize((char*)vlen_buf[i].p,
vlen_buf[i].len);
}

and for writing:

for(i=nelements-1;i>=0; i--){
    vlen_buf[i].p = malloc(<string length>)
    vlen_buf[i].len = <string length>
    memcpy(vlen_buf[i].p, (Python object's char*), <string length>)
}

The full code is here:

http://code.google.com/p/h5py/source/browse/branches/vlen/h5py/newtypes.c?spec=svn310&r=310

I'd appreciate any thoughts or advice people might have on this.

Thanks,
Andrew Collette

···

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

andrew.collette · April 27, 2009, 9:14pm

Thanks to both of you for responding! However, the char** format
doesn't seem to be what's supplied to a conversion callback. To be
clear, I'm writing a pair of callbacks registered on the soft
conversion paths to and from H5T_STRING and H5T_OPAQUE (I use a type
of this class for my Python object pointer). For now, I'm trying to
read this file:

ftp://ftp.hdfgroup.uiuc.edu/hdf_files/hdf5/samples/strings.h5

which also uses strings declared as H5T_STRING with size H5T_VARIABLE.

The values array passed to *buf seems to be neither a char** array nor
an array of hvl_t's, unless there's some kind of padding going on.
Whatever structure this is seems to have a width of 16 bytes (on my
platform, hvl_t's are 8 bytes). It seems like some kind of larger
version of an hvl_t; for example, the first element (corresponding to
the file string "A fight is a contract that takes two people to
honor.") has the following 16 bytes:

53
0
0
0
0
16
0
0
0
0
0
0
4
0
0
0

It's suspicious that the first byte matches the string length exactly.
The other elements (whatever they are) repeat with this 16-byte
period, with the string length in the corresponding position. Perhaps
this is some kind of internal HDF5 structure?

Andrew

···

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Quincey_Koziol · April 27, 2009, 9:02pm

Hi Andrew,

Hello,

I'm trying to write a conversion callback that can translate between
HDF5 variable-length strings and another format (in this case, Python
strings). I have a good deal of experience with the HDF5 C API but
have never used vlens before, so I'm having some difficulty. I've
read the document at:

http://www.hdfgroup.org/HDF5/doc1.6/Datatypes.html#Datatypes-DataConversion

and have based my conversion code on this example. However, I've not
been able to get it to work. So far I've tried writing a scalar
attibute; the created attribute has the correct type according to
h5dump (H5T_STRING, strsize H5T_VARIABLE), but the string itself is
length zero. I have a feeling I'm not using the right representation
for vlen strings. Are HDF5 variable-length strings still represented
by the hvl_t struct? Or are they just char*'s?

Variable-length strings in HDF5 are represented as char*'s. You could set up a datatype that was a variable-length sequence of characters, which would use an hvl_t, but that's not what most people want - they have a string in memory (via a char*) and want to store it in an HDF5 file.

The heart of my
conversion process is this:

(*buf is provided as the argument callback)

for reading:

obj_buf = (PyObject**)buf;
vlen_buf = (hvl_t*)buf;

for(i=0;i<nelements-1;i++){
   obj_buf[i] = PyString_FromStringAndSize((char*)vlen_buf[i].p,
vlen_buf[i].len);
}

and for writing:

for(i=nelements-1;i>=0; i--){
   vlen_buf[i].p = malloc(<string length>)
   vlen_buf[i].len = <string length>
   memcpy(vlen_buf[i].p, (Python object's char*), <string length>)
}

The full code is here:

http://code.google.com/p/h5py/source/browse/branches/vlen/h5py/newtypes.c?spec=svn310&r=310

I'd appreciate any thoughts or advice people might have on this.

I haven't read your code in detail, but switching to char*'s should be the right direction to go in.

Quincey

···

On Apr 27, 2009, at 2:23 PM, Andrew Collette wrote:

Quincey_Koziol · April 28, 2009, 11:55am

Hi Andrew,

Thanks to both of you for responding! However, the char** format
doesn't seem to be what's supplied to a conversion callback. To be
clear, I'm writing a pair of callbacks registered on the soft
conversion paths to and from H5T_STRING and H5T_OPAQUE (I use a type
of this class for my Python object pointer). For now, I'm trying to
read this file:

ftp://ftp.hdfgroup.uiuc.edu/hdf_files/hdf5/samples/strings.h5

which also uses strings declared as H5T_STRING with size H5T_VARIABLE.

The values array passed to *buf seems to be neither a char** array nor
an array of hvl_t's, unless there's some kind of padding going on.
Whatever structure this is seems to have a width of 16 bytes (on my
platform, hvl_t's are 8 bytes). It seems like some kind of larger
version of an hvl_t; for example, the first element (corresponding to
the file string "A fight is a contract that takes two people to
honor.") has the following 16 bytes:

53
0
16
0
4
0

It's suspicious that the first byte matches the string length exactly.
The other elements (whatever they are) repeat with this 16-byte
period, with the string length in the corresponding position. Perhaps
this is some kind of internal HDF5 structure?

Hmm, yes, I think you've found a bug here. It does look like you are getting an array of the heap IDs that are used to store strings inside the HDF5 file. I don't think we've ever had anyone hook into the string conversion path before and so hadn't noticed the issue...

Unfortunately, in some sense, this is unfixable since we don't want to expose an API for interacting with the internal heaps in the file (for your conversion routine to call to retrieve the string data). However, what if you just read the strings into memory as "native" strings (i.e. char*'s) and then converted them to opaque objects, either in your code, or with a separate call to H5Tconvert()?

Quincey

···

On Apr 27, 2009, at 4:14 PM, Andrew Collette wrote:

andrew.collette · April 28, 2009, 6:32pm

Hi,

Well, that's too bad for me. Am I correct in thinking this only
applies to reads/writes which have vlen strings "on disk" at one end?
What about non-string vlens? Unfortunately I think this is also why
my attempts to write (as opposed to read) vlens didn't work. I have
thought about using a temporary buffer and H5Tconvert (provided the
correct type is supplied to the callback in this case). However, I'm
not sure how to handle the case with arbitrary file and memory
dataspace selections. The procedure would have to be:

(1) read data in the file-space selection into a temporary contiguous
buffer (gather),
(2) run H5Tconvert on this buffer, and then
(3) copy the result out of the temporary buffer into the destination
buffer according to the memory-space selection (scatter)

I'm not aware of any function in the public C API which will allow me
to do (3), without resorting to things like H5Diterate or temporary
datasets in an H5FD_CORE file... is there a correct way to do this in
the present framework?

As far as fixing it, I understand that you don't want to expose these
kinds of details. However, would it be reasonable to arrange for the
correct type (hvl_t* or char**) to be supplied to conversion
callbacks?

Thanks again,
Andrew Collette

···

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Quincey_Koziol · April 28, 2009, 11:01pm

Hi Andrew,

Hi,

Well, that's too bad for me. Am I correct in thinking this only
applies to reads/writes which have vlen strings "on disk" at one end?
What about non-string vlens? Unfortunately I think this is also why
my attempts to write (as opposed to read) vlens didn't work.

You will have the same problem with any variable-length type, I believe. (Probably region references also, since they are stored in heaps inside the HDF5 file)

I have
thought about using a temporary buffer and H5Tconvert (provided the
correct type is supplied to the callback in this case). However, I'm
not sure how to handle the case with arbitrary file and memory
dataspace selections. The procedure would have to be:

(1) read data in the file-space selection into a temporary contiguous
buffer (gather),
(2) run H5Tconvert on this buffer, and then
(3) copy the result out of the temporary buffer into the destination
buffer according to the memory-space selection (scatter)

I'm not aware of any function in the public C API which will allow me
to do (3), without resorting to things like H5Diterate or temporary
datasets in an H5FD_CORE file... is there a correct way to do this in
the present framework?

You are correct, and there is not a good way to make this happen currently.

As far as fixing it, I understand that you don't want to expose these
kinds of details. However, would it be reasonable to arrange for the
correct type (hvl_t* or char**) to be supplied to conversion
callbacks?

Yes, something like that should be done. I'll file a bug about it and we can prioritize it with other changes.

Sorry for the current inconvenience,
Quincey

···

On Apr 28, 2009, at 1:32 PM, Andrew Collette wrote:

andrew.collette · April 30, 2009, 4:41am

Hi,

   You will have the same problem with any variable\-length type, I
believe. (Probably region references also, since they are stored in heaps
inside the HDF5 file)

OK, I have implemented a version which reads into a buffer first
before conversion. It works well now, except for the added cost of
the buffer and the scatter/gather operation.

Perhaps you can clarify one more issue I ran into... when converting
vlens which happen to occur within compound types, the arguments
"buf_stride" and "bkg_stride" are supplied to my conversion callback.
Presumably these exist to make it possible to convert compound types
in-place one element at a time. I wasn't able to get the following
information from the documentation:

1) What happens if the output type is bigger than the input type? For
example, what if I'm converting one member of a compound type from a
4-byte pointer to an 8-byte hvl_t. Has HDF5 already allocated padding
between elements? It seems like the "overhang" from the new element
will stomp on the beginning of the next field. Should I write the
results to the buffer at the same byte offset as the n-th input
element (buf+(n*4)), or at the offset corresponding to the n-th output
element (buf+(n*8))?

2) Are buf_stride and bkg_stride always exact multiples of the input
and output type sizes? This doesn't seem possible if they're related
to field offsets...

3) I have noticed that when converting elements which are *not*
members of a compound type, buf_stride and bkg_stride are both set to
zero (!). Is this intentional or a bug?

Thanks in advance,
Andrew Collette

···

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Quincey_Koziol · May 1, 2009, 1:25am

Hi Andrew,

Hi,

You will have the same problem with any variable-length type, I
believe. (Probably region references also, since they are stored in heaps
inside the HDF5 file)

OK, I have implemented a version which reads into a buffer first
before conversion. It works well now, except for the added cost of
the buffer and the scatter/gather operation.

Good, I'm glad that works.

Perhaps you can clarify one more issue I ran into... when converting
vlens which happen to occur within compound types, the arguments
"buf_stride" and "bkg_stride" are supplied to my conversion callback.
Presumably these exist to make it possible to convert compound types
in-place one element at a time. I wasn't able to get the following
information from the documentation:

1) What happens if the output type is bigger than the input type? For
example, what if I'm converting one member of a compound type from a
4-byte pointer to an 8-byte hvl_t. Has HDF5 already allocated padding
between elements? It seems like the "overhang" from the new element
will stomp on the beginning of the next field. Should I write the
results to the buffer at the same byte offset as the n-th input
element (buf+(n*4)), or at the offset corresponding to the n-th output
element (buf+(n*8))?

If the size of the destination datatype is larger than the size of the source datatype, the buffer is converted from the last element toward the first element, so that no data in the buffer is overwritten. You should make the first element at the beginning of the buffer.

2) Are buf_stride and bkg_stride always exact multiples of the input
and output type sizes? This doesn't seem possible if they're related
to field offsets...

They should be multiples of the input & output type sizes, yes. Why do you think that's a problem?

3) I have noticed that when converting elements which are *not*
members of a compound type, buf_stride and bkg_stride are both set to
zero (!). Is this intentional or a bug?

Those fields aren't used for converting non-compound datatypes, so it's intentional that they are set to zero.

Hope that helps,
Quincey

···

On Apr 29, 2009, at 11:41 PM, Andrew Collette wrote:

andrew.collette · May 1, 2009, 4:26am

Hi,

   If the size of the destination datatype is larger than the size of
the source datatype, the buffer is converted from the last element toward
the first element, so that no data in the buffer is overwritten. You should
make the first element at the beginning of the buffer.

Yes, this strategy is part of the example in the documentation... you
start writing in the empty space at the end of the buffer and work
backwards. However, I have to admit I'm still confused about how this
works with strided input. If I have a 4-byte type, and am given an
8-byte stride, that means I read in data at offsets 0, 8, 16, 24, etc
(every other element). Am I supposed to write back my new types on
these 0, 8, 16, etc. offsets, or should I just pack them all into the
buffer one after another, starting at 0? Does the data "in between"
have any meaning? If my destination type size is larger, it seems
like that "in between" data will get stomped no matter what I do.

2) Are buf_stride and bkg_stride always exact multiples of the input
and output type sizes? This doesn't seem possible if they're related
to field offsets...
   They should be multiples of the input &amp; output type sizes, yes\.  Why
do you think that's a problem?

I thought the stride values were being used to process compound types
"in-place", which is why I was concerned about the size difference.
If this were true, then for example if a compound type were a 4-byte
int followed by a 1-byte char, then the conversion callback for the
int might be given a buf_stride of 5. Perhaps I'm over-thinking this.
If this isn't the case, could you briefly explain what data is being
skipped over?

   Those fields aren&#39;t used for converting non\-compound datatypes, so
it's intentional that they are set to zero.

Yes, but you have no way of knowing a-priori whether your callback is
being applied to a member of a compound type. The general expression
(which works for both contiguous and strided cases) for calculating
the offset in the buffer is something like:

sometype* buf_ptr;

for(i=0;i<nelements;i++){
buf_ptr = (sometype*)(buf+(i*buf_stride));
... do something to modify (*buf_ptr) ...
}

which doesn't respond well to a buf_stride of 0. However, it's not
hard to detect 0 and set it to the type size.

Thanks for your help!
Andrew Collette

···

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Quincey_Koziol · May 4, 2009, 8:55pm

Hi Andrew,

Hi,

If the size of the destination datatype is larger than the size of
the source datatype, the buffer is converted from the last element toward
the first element, so that no data in the buffer is overwritten. You should
make the first element at the beginning of the buffer.

Yes, this strategy is part of the example in the documentation... you
start writing in the empty space at the end of the buffer and work
backwards. However, I have to admit I'm still confused about how this
works with strided input. If I have a 4-byte type, and am given an
8-byte stride, that means I read in data at offsets 0, 8, 16, 24, etc
(every other element). Am I supposed to write back my new types on
these 0, 8, 16, etc. offsets, or should I just pack them all into the
buffer one after another, starting at 0? Does the data "in between"
have any meaning? If my destination type size is larger, it seems
like that "in between" data will get stomped no matter what I do.

"Don't worry about it" The compound datatype conversion routine will figure out the correct stride (positive or negative) and will give that information to the scalar conversion routines for the fields in the compound datatype.

2) Are buf_stride and bkg_stride always exact multiples of the input
and output type sizes? This doesn't seem possible if they're related
to field offsets...

They should be multiples of the input & output type sizes, yes. Why
do you think that's a problem?

I thought the stride values were being used to process compound types
"in-place", which is why I was concerned about the size difference.
If this were true, then for example if a compound type were a 4-byte
int followed by a 1-byte char, then the conversion callback for the
int might be given a buf_stride of 5. Perhaps I'm over-thinking this.
If this isn't the case, could you briefly explain what data is being
skipped over?

Yes, it's probable that the buf_stride would be 5 in that case (assuming the fields are packed).

Quincey

···

On Apr 30, 2009, at 11:26 PM, Andrew Collette wrote:

       Those fields aren't used for converting non-compound datatypes, so
it's intentional that they are set to zero.

Yes, but you have no way of knowing a-priori whether your callback is
being applied to a member of a compound type. The general expression
(which works for both contiguous and strided cases) for calculating
the offset in the buffer is something like:

sometype* buf_ptr;

for(i=0;i<nelements;i++){
   buf_ptr = (sometype*)(buf+(i*buf_stride));
   ... do something to modify (*buf_ptr) ...
}

which doesn't respond well to a buf_stride of 0. However, it's not
hard to detect 0 and set it to the type size.

Thanks for your help!
Andrew Collette

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.