AsStrWrapper and numpy.array()


#1

Hi,

Not sure this is a bug, a feature request or plain dumb bad use, therefore my post.

We do use a lot of h5 files with h5py, but also other binary files that we read with numpy.memmap. Until now we were stuck with h5py 2.8, but I am looking at upgrading to 3. I however face the new behavior when reading string data sets, that now return bytes. As our workflow specifies that we always stick to UTF-8, I’d like to always decode the strings. So I want to use the new AsStrWrapper. However, I am not sure how to properly use the Dataset.asstr() as it only affects the getitem.

Currently, we have one simple load action valid for both memmap and Dataset:

data = numpy.array(unloaded_data)

where unloaded_data is either a numpy.memmap or an h5py.Dataset. However, the AsStrWrapper does not overload array, such that this still leaves me with bytes.

Why would the wrapper not wrap array? Is this a design choice or is it not possible? Is our approach actually wrong, and we should always use the getitem method?

Thanks in advance.


#2

There’s no particular reason AsStrWrapper doesn’t have __array__(). I just didn’t think about it, because I prefer reading data to be somewhat explicit (e.g. dataset[:10]). I wouldn’t have any objection if you want to make a PR adding __array__().


#3

Hi,
Thanks for the answer. I will have a look how to make a PR. I think I know how I would change the code, but I am not familiar with git and the testing framework of h5py. I’ll have a look at the guidelines.


#4

Thanks! If you haven’t already found it, the docs have a rough overview of how to contribute:

https://docs.h5py.org/en/stable/contributing.html#how-to-get-your-code-into-h5py