Selecting rows and columns at the same time


#1

Hi, all. I’m very new to HDF5 and just have finished quickly skimming the API reference.

I found that selecting specific rows or columns of a two-dimensional array can be achieved by using the H5S_SELECT_HYPERSLAB function and the H5S_SELECT_OR operator. However, I would like to select specific rows and columns at the same time. I want this because I only need a small portion (say 10%) of rows and columns in a large dataset, and thus reading rows and then dropping some columns will require unnecessary I/O and decompression operations. I thought there should be an operator function that returns an intersection of two (i.e., rows and columns) hyperslab selections, but I could not find such an API.

So, my question is how we should select or read specific rows and columns in a 2D dataset without wasting time? Any APIs I’m missing?


#2

In this API ref, you can take an intersection using H5S_SELECT_AND when combining them (e.g. multiple successive calls to H5Sselect_hyperslab).

That said, to specify a region which is the intersection of just some rows and just some columns, if the rows and colums are contiguous, it would be a single call to H5Sselect_hyperslab with appropriate values for starts, strides and counts. If the rows and/or columns are NOT contiguous, then you’d do it with multiple successive calls to H5Sselect_hyperslab and H5S_SELECT_AND. For example, first select all the rows of interest using H5S_SELECT_SET, then select all the cols of interest with H5S_SELECT_AND. The result will be only those entries where the rows and cols intersect.


#3

Thank you for your reply. In my case, I’d like to select multiple rows and columns that are not necessarily contiguous. Thus, I think I need to call H5Sselect_hyperslab multiple times for each row or column. I tried to select multiple rows with H5S_SELECT_SET as you said, but it seemed to select only the last row I specified, which is expected since the docs says H5S_SELECT_SET replaces the existing selection with a new one. If we use H5S_SELECT_OR instead, we can select multiple rows. However, I think we cannot select the intersection of the selected rows and the selected columns with any API as far as I know.


#4

I am curious of how this goes performance wise. Would it be possible to keep me posted: MB/sec size of object … etc?
Also if this is a C or C++ application.
best wishes: steven


#5

Oh, my apologies. Of course, you are correct. A single H5Sselect_hyperslab call is needed for each contiguous range of rows (or columns). If you maybe had the fortune that all the rows are spaced apart equally, then I do think the block(s) argument of the call could be used to accomodate that case in a single call.


#6

Thank you very much. So, the current approach I can take would be selecting and reading data as a bulk and then dropping unnecessary parts, or selecting and reading row-wise or column-wise to minimize IO.


#7

Not quite…here is what it kinda sorta would look like…

H5Screate_simple(); /*create dataspace of size of the dataset in the file or get it from dataset with H5Dget_space() */
H5Sselect_none(); /* just to be safe/clear */
for each row...
    H5Sselect_hyperslab(H5S_SELECT_OR,...);
/* at this point, you have a selection that is the OR of all the rows you want */
for each col...
    H5Sselect_hyperslab(H5S_SELECT_AND,...);
/* at this point, you have a selection which is the intersection of all the rows with all the cols */
H5Dread(); /* a single H5Dread call returns all the desired values...confirm for yourself the ordering of those values in the buffer */

#8

But the column selection will be disjoint and therefore I think it selects nothing…


#9

Doh!!, right again. Boy, glad you weren’t relying on me to code this. Hmmm…maybe just use H5Sselect_elements?


#10

Hi!
I’m in the final steps of wrapping up a new API routine that will allow an application to AND a set of selected rows with a set of selected columns. If you’d like to give this a try, you can check out a copy of the ‘hyperslab_updates’ branch from git: https://koziol@bitbucket.hdfgroup.org/scm/hdffv/hdf5.git Then use one of the new API routines: H5Scombine_hyperslab, H5Smodify_select, or H5Scombine_select. I would imagine that you’d want either H5Smodify_select or H5Scombine_select, since they accept two arbitrary selections as inputs to the operation. H5Smodify_select puts the results back into the first selection, and H5Scombine_select returns a new selection with the results in it.

Quincey

#11

Thank you very much. I think H5Smodify_select is the thing I really want. I’m interested in the release schedule of the feature, since it is still only in a development branch of HDF5. I’m ignorant of the release policy of HDF5, but judginfg from the timeline of the previous releases, I guess I need to wait for the next minor release of HDF5, say version 1.12.0, which may take a few years or so?


#12

Ah, it seems you introduce not only bug fixes but also new API functions in patch releases. So, if I’m fortunate I can expect those functions will be in version 1.10.6 or something.


#13

Those new API routines are scheduled to be included in the upcoming (April or May, 2019) 1.12.0 release. The HDF Group is also seriously considering porting them back into the 1.10.x branch.

Quincey

#14

That is great news to me!