Viewing 3D array (hdfview) + updating existing dataset (h5py)


#1

Hi All

I’m using h5py to record and update data, especially 3D arrays built using Numpy; I’m facing 2 “troubles”:

  1. 3D arrays
  • Under numpy, a 3D array has the following structure (d, r, c) where d,r,c are respectivly the depth, rows and columns
  • when opening the array using Hdfview (under Windows in my case), the structure is different that’s not usefull, I mean I would like to visualize a 2D array per depth.
  • let me saying there’s no issue when retrieving the array to work with: the structure is correct (just a question of “visualization” accordingly)

Is there an option to keep the original Numpy structure (d, r, c) into Hdfview ? I 've been looking for such information’but I’ve not found anything so far

  1. updating a dataset
    When I want to update a dataset, the only way I’ve found is to first delete it, and write a new one using the update array (in practise, it’s time consuming): my question might be naive, but is there another (better) way?

Thanks for any suggestion

Paul


#2

Paul,
One can modify HDF5 dataset using h5py without deleting and rewriting the whole dataset.

Have you checked h5py docs?
Maybe this h5py example that shows how to update data in a dataset will help too?

Elena


#3

Hi Paul,

I’ve been using numpy for a number of years and I don’t recall any (d,r,c) semantics. Multidimensional arrays have a 1st index, 2nd index, 3rd index, etc. The (left-to-right) index order in (default) numpy is the same as in HDFView.

You can use HDFView to consider any combination of the indexes as you the 2D slice you with to view using Open As on the dataset. You can then “click through” a 3rd dimension using the mouse. This works for arrays of more than 3 dimensions as well.

(There is bug in version 3.x of HDFView when viewing 4D arrays as images in certain circumstances–the images don’t update correctly as you click through the images. I’m told this has been fixed, but I don’t think the fixed version of HDFView has been released yet.)


#4

@epourmal1 : I was not accurate enough (my mistake). I’m speaking about 3D array (or multidimentional array) in which I’m adding 1 dimenion (1 depth as described in my previous post). I other word, I’m not speaking about changing values, but changing dimenions.

@Daniel: Well the semantic is mine that’s right. Here is there a basic example:

A = np.zeros((2, 3, 4), dtype=int)
print(f"{A}")

providing
[[[0 0 0 0]
[0 0 0 0]
[0 0 0 0]]

[[0 0 0 0]
[0 0 0 0]
[0 0 0 0]]]

(I was expecting the same 2x 2D matrices in hdfview, and surprissed accordingly, but it’s not the case)

Paul


#5

Then you have to use a different HDF5 dataset. Cannot re-use the 2d HDF5 dataset you have.

Aleksandar


#6

I’m not using 2D arrays

Ok, the basic test here after mimics what I’m doing: how would you proceed to update the dataset at each iterations?

Paul

import numpy as np

A = np.zeros((2, 3, 4), dtype=int)
B = np.zeros((1, 3, 4), dtype=int)

n = 2
for i in range(n):
A = np.concatenate( (A, B), axis = 0)
# update the database in the hdf5 file

print(f"{A}")


#7

Thanks for the example. What you need to do is to create a resizable HDF5 dataset. See https://docs.h5py.org/en/latest/high/dataset.html#resizable-datasets.

And you do not need to produce the concatenated version of A as in your example to update the HDF5 dataset, just write out the B addition. dset[i,:,:] = B[0,:,:].

Aleksandar


#8

Thanks for all 2 advices

So far I’ve got an error when I was trying to modify a dataset that ever exists (maybe because “unlimited” and “maxsize” cards were missing - let me doing tests)

Paul


#9

Here is a small snippet I’m currently working on. 2 mai remarks

  1. Test case 1: works but not the best method

<note the printed array is:
[[[0 0 0 0]
[0 0 0 0]
[0 0 0 0]]

whereas hdfview provides the following screenshot (different structure, isn’t it?)
screenshot

  1. Test case 2: doesn’t work if I want to increase the size of the array, that’s my first trial, but I’m still trying for figure out what I’m missing

Paul

import os, time, h5py
import numpy as np

Path = str(os.getcwd())
h5File1 = 'MyFile1.h5'
h5File2 = 'MyFile2.h5'

# test case 1
A = np.zeros((2, 3, 4), dtype=int)
B = np.zeros((1, 3, 4), dtype=int)

t0 = time.time()
n = 2
for i in range(n):
    A = np.concatenate( (A, B), axis = 0)
    
    h5 = h5py.File(Path + '/' + h5File1, 'a')
    MyGroup = h5.require_group('/MyGroup')
    if (i>0): del h5['/MyGroup/SubGroup']
    MyDatasets = MyGroup.create_dataset(name='/MyGroup/SubGroup', 
                                        data=A, 
                                        dtype='d', 
                                        chunks=True, 
                                        compression='gzip', 
                                        compression_opts=9)
    h5.flush()
    h5.close()
t1 = time.time()
print(f" duration for MyFile1= {t1-t0}")
print(f"{A}")


# test case 2
A = np.zeros((2, 3, 4), dtype=int)
B = np.zeros((1, 3, 4), dtype=int)

t0 = time.time()

h5 = h5py.File(Path + '/' + h5File2, 'a')
MyGroup = h5.require_group('/MyGroup')
MyDatasets = MyGroup.create_dataset(name='/MyGroup/SubGroup', 
                                    data=A,
                                    maxshape=(None, None, None),
                                    dtype='d', 
                                    chunks=True,
                                    compression='gzip', 
                                    compression_opts=9)
h5.flush()
h5.close()
    
d,r,c = np.shape(A)   
print(f"d={d}, r ={r}, c={c}") 

n = 2
# for i in range(n):
for i in range(2, 2+n):
    h5 = h5py.File(Path + '/' + h5File2, 'a')
    MyGroup = h5.require_group('/MyGroup')
    MyDatasets = h5.get('/MyGroup/SubGroup') 
    MyDatasets[i, :, :] = B[0, :, :]
    print(f"{MyDatasets}")
    h5.flush()
    h5.close()
t1 = time.time()
print(f" duration for MyFile2= {t1-t0}")

#10

You are missing one MyDatasets.resize() command in the for loop.

# test case 2
A = np.zeros((2, 3, 4), dtype=int)
B = np.zeros((1, 3, 4), dtype=int)

h5 = h5py.File(Path + '/' + h5File2, 'w')
MyGroup = h5.require_group('/MyGroup')
MyDatasets = MyGroup.create_dataset(name='/MyGroup/SubGroup',
                                    data=A,
                                    maxshape=(None, None, None),
                                    dtype='d',
                                    chunks=True,
                                    compression='gzip',
                                    compression_opts=9)
h5.flush()
h5.close()

d,r,c = np.shape(A)
print(f"d={d}, r ={r}, c={c}")

n = 2
# for i in range(n):
for i in range(2, 2+n):
    h5 = h5py.File(Path + '/' + h5File2, 'a')
    MyGroup = h5.require_group('/MyGroup')
    MyDatasets = h5.get('/MyGroup/SubGroup')
    MyDatasets.resize(i + 1, axis=0)
    MyDatasets[i, :, :] = B[0, :, :]
    print(f"{MyDatasets}")
    h5.flush()
    h5.close()

Below are some additional remarks once you start storing real data:

  • Use None in maxshape only for resizable dimensions.
  • Be careful with chunks=True for your real data. Verify it calculates an appropriate chunk shape for your typical read and write access. Much better to set chunk shape explicitly.
  • Resizing HDF5 datasets frequently, like above, is definitely not recommended. Better to do resizing in bigger steps than by one. HDF5 library will not write into the file the unused part of the HDF5 dataset.
  • Opening and closing HDF5 files inside a for loop is also not recommended.

Aleksandar


#11

Thanks Aleksandar for your support; your advices are highly appreciated.

I am aware that my approach is not fully satisfying, and it might be improved accordingly (especially if I want to introduce parallelization, topic that’s absolutly not familiar for me)

“HDF5 library will not write into the file the unused part of the HDF5 dataset” => I do not understand what’s behind? I’ve supposed that “flush” enforces to write (or update) the hdf5 file? Am I wrong?

One topic remains opened I think: why hdfview does not display 3D arrays as expected?

Thanks again

Paul


#12

HDFView expects the fastest changing dimension ‘0’ to be on the right and the depth ‘2’ to be the slowest changing dimension on the left of the order where order is [2, 1 ,0]. Which at first glance seems to match the python, unless python assumes the opposite order?


#13

Python versus Hdfview; here is a small snippet:

import os, h5py
import numpy as np
d, r, c = (2, 3, 4)

A = np.zeros( (d, r, c), dtype=int)
print(f"A = {A}")

Path = str(os.getcwd())
h5 = h5py.File(Path + '/h5file.h5', 'a')
MyGroup = h5.require_group('/MyGroup')
MyDatasets = MyGroup.create_dataset(name='/MyGroup/SubGroup', 
                                    data=A,
                                    dtype='d')
h5.flush()
h5.close()

Which provides under Python:
screenshot_python

And under hdfview:
screenshot_hdfview

Same?
As ever said, I don’t change anything on any further processing, but I’s a bit “disturbing” to not see what it supposed to be. Works fine for 2D array otherwise :smile:

Paul


#14

No, you are not, the flush operation does exactly what you think.

I was talking about HDF5 dataset chunks that do not have any data. Say, you resize your dataset but then do not use up all the new extra space. HDF5 library does not write unused, “empty” chunks. This means you don’t have to resize the dataset by one just so to avoid having unused space in the file.

Aleksandar