vlen data and SWMR

SWMR doesn't support vlen, and we want to make vlen data available while writing hdf5. Right now, I see a decent way to encode the vlen data in typical datasets, which I'll explain below. My question is, what is the best way to get vlen in SWMR? The way that will be easiest for users to work with the data? Working with or funding the hdf5 group to develop vlen for SWMR might be the answer (or maybe this feature is already in development?) However I think users find the vlen data types difficult to work with through h5py and Matlab. The real advantage though, is we could write a row based schema where each row corresponds to a shot. If our data acquisition system records data for three shots, and reduces each shot to a list of features, We can, if we had the vlen type in SMWR, write one dataset like:

DATA
0 [a0, a1, a2]
1 [b0, b1, b2, b3, b4]
2 [c0, c1]

that is, three rows (I'm labeling as 0 1 2) each corresponding to one of the three events, and all the features are there (a* for event 1, b* for event 2, etc).

To simulate vlen for SWMR, I'm thinking of two datasets, one is aligned with the shots, and it stores the range of where the features are in a 'blob' dataset, that is:

RANGE
0 [0,3]
1 [3,8]
2 [8,10]

BLOBDATA
0 a0
1 a1
2 a2
3 b0
4 b1
5 b2
6 b3
7 b4
8 c0
9 c1

Then a h5py user does

r0,r1 = range_ds[1,:]
features_event_1 = blobdata_ds[r0:r1]

On the h5py side, the users is just dealing with numpy arrays of basic types, with the hdf5 vlen type, they have to work with a object based type introduced to handle vlen data -- it gets messy depending on what you are doing.

Similarly, on the matlab side, users, I think, have to mess with cell arrays which I don't think they have to do otherwise (I don't use matlab much).

One disadvantage of the two datasets, RANGE and BLOBDATA, is we have to choose between 0-up and 1-up counting. We'll do 0-up, but then the Matlab/Fotran/Julia users that use 1-up indexing have to adjust.

best,

David Schneider

Sorry, I cc-ed my evernote account when I posted, meant to bcc it for a record, if you could reply to this message to continue thread, would be appreciated.

best,

David Schneider

ยทยทยท

________________________________________
From: Hdf-forum [hdf-forum-bounces@lists.hdfgroup.org] on behalf of Schneider, David A. [davidsch@slac.stanford.edu]
Sent: Thursday, February 2, 2017 10:49 AM
To: hdf-forum@lists.hdfgroup.org
Cc: evernote email
Subject: [Hdf-forum] vlen data and SWMR

SWMR doesn't support vlen, and we want to make vlen data available while writing hdf5. Right now, I see a decent way to encode the vlen data in typical datasets, which I'll explain below. My question is, what is the best way to get vlen in SWMR? The way that will be easiest for users to work with the data? Working with or funding the hdf5 group to develop vlen for SWMR might be the answer (or maybe this feature is already in development?) However I think users find the vlen data types difficult to work with through h5py and Matlab. The real advantage though, is we could write a row based schema where each row corresponds to a shot. If our data acquisition system records data for three shots, and reduces each shot to a list of features, We can, if we had the vlen type in SMWR, write one dataset like:

DATA
0 [a0, a1, a2]
1 [b0, b1, b2, b3, b4]
2 [c0, c1]

that is, three rows (I'm labeling as 0 1 2) each corresponding to one of the three events, and all the features are there (a* for event 1, b* for event 2, etc).

To simulate vlen for SWMR, I'm thinking of two datasets, one is aligned with the shots, and it stores the range of where the features are in a 'blob' dataset, that is:

RANGE
0 [0,3]
1 [3,8]
2 [8,10]

BLOBDATA
0 a0
1 a1
2 a2
3 b0
4 b1
5 b2
6 b3
7 b4
8 c0
9 c1

Then a h5py user does

r0,r1 = range_ds[1,:]
features_event_1 = blobdata_ds[r0:r1]

On the h5py side, the users is just dealing with numpy arrays of basic types, with the hdf5 vlen type, they have to work with a object based type introduced to handle vlen data -- it gets messy depending on what you are doing.

Similarly, on the matlab side, users, I think, have to mess with cell arrays which I don't think they have to do otherwise (I don't use matlab much).

One disadvantage of the two datasets, RANGE and BLOBDATA, is we have to choose between 0-up and 1-up counting. We'll do 0-up, but then the Matlab/Fotran/Julia users that use 1-up indexing have to adjust.

best,

David Schneider

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5