Does h5py support masked numpy arrays?

Hello, dear HDF5 team.
I’m wondering if you can clarify for me, please, if h5py supports np.ma arrays (i.e., masked numpy arrays). In other words, when I create a dataset - for example as dataSet = groupName.create_dataset("dataSetName", data=np.ma.masked_array([1,2],mask=[True,False])) - will it store data as an np.ma array or as a regular np array? Documentation does not specify that explicitly (or I don’t understand how to interpret that table in the referenced link).
If follow up questions are allowed on this website, then how do I retrieve a masked array which I saved into an .hdf5 file beforehand?
Thank you in advance.
Ivan

Hi,

Would you please try reporting an issue at GitHub - h5py/h5py: HDF5 for Python -- The h5py package is a Pythonic interface to the HDF5 binary data format.?

I can’t claim to give an authoritative answer, but my understanding is that there is no mechanism to store masked arrays. I think it would only be implemented explicitly in h5py if HDF5 had a convention for storing dataset masks (which I doubt). I think you would have to implement your own convention by storing the field and masks as separate datasets, with some way of connecting them in your own code, e.g., by specifying the path to the mask as a dataset attribute or by naming the mask dataset by adding “_mask” to the dataset name. We use the former in the nexusformat package, which read and writes HDF5 files written using the NeXus standard.

2 Likes

There is a discussion of how we implement masks here.

1 Like

By the way, the nexusformat package can be used to read and write HDF5 files that do not conform to the NeXus standard. To implement your example, you could use the following code:

from nexusformat.nexus import NXfield, nxopen

data = NXfield(np.ma.masked_array([1,2],mask=[True,False]))

with nxopen('mydata.h5', 'w') as root:
    root['data'] = data
    print(root.tree)

This produces the following HDF5 file.

root:NXroot
  data = [-- 2]
    @mask = 'data_mask'
  data_mask = [ True False]

If you read it again, the datasets ‘data’ and ‘data_mask’ are returned as a single NXfield, which just wraps the NumPy array with its attributes.

with nxopen('mydata.h5') as root:
    input_data = root['data']

The masked array is contained within the NXfield, accessible as the nxvalue attribute.

masked_array(data=[--, 2],
             mask=[ True, False],
       fill_value=999999)
1 Like

Thank you very much,@rosborn! Your example with nexusformat looks pretty easy to work with. I think, I’ll stick with it.
@hyoklee, I will convert my post to an issue on GitHub.
Ivan

1 Like