Merge multiple h5 files containing groups into one h5

akanksha.vision · April 12, 2020, 9:29pm

Hi,

Sorry, I could not find a potential solution on the web so writing here.
I have 2 or more h5 files in a directory and I want to merge all into a single h5 file. All the h5 files have the same structure, having three groups such as:
>>> f.keys()
[u’TauClusters’, u’TauJets’, u’TauTracks’]

Inside each group I have many datasets such as:

(‘Collection:’, u’TauClusters’)
(’ Variable:‘, u’CENTER_LAMBDA’)
(’ Variable:‘, u’SECOND_LAMBDA’)
(’ Variable:‘, u’SECOND_R’)

len(f[‘TauJets/pt’])
187503

Each dataset within a file has same number of entries. However, the number of entries for different files are different.

Can anyone please let me know if it is possible to merge these files in a python script? Any brief script will be very helpful.
I am putting the files here, in case: CERNBox
To open these files I need:
f = h5py.File(‘sig_0p_train_%d.h5’, ‘r’, driver=“family”, memb_size=8*1024**3)

Thanks and regards,
Akanksha

ken.walker · April 13, 2020, 6:55pm

Is it OK if I reference answers I posted on StackOverflow?
If so, I wrote several examples of ways to do this with PyTables or h5py modules in Python.
Here are the links:
Using Pytables:

Using h5p:

gheber · April 14, 2020, 12:21pm

Thanks for the reply, Ken. Yes, it’s perfectly fine to reference StackOverflow or other venues. The more the merrier! G.

ken.walker · April 15, 2020, 1:39am

The SO post I mentioned shows the process to copy data between 2 HDF5 files. It works with driver=family, but might be confusing (it confused me at first!).
When you create a HDF5 file with driver=family, the data is divided into a series of files based on the %d naming used to created the file. In your example it is ‘sig_0p_train_%d.h5’.

You don’t need to open all of the files – just open the file with the same name declaration (but open in ‘r’ mode). The driver magically handles rest for you.

Here is a simple procedure for your file.

import h5py

with h5py.File('sig_0p_train_%d.h5', 'r', driver='family', memb_size=8*1024**3) as h5r:
    print ('h5 files opened to read with family driver')

    with h5py.File('sig_0p_train_all.h5', 'w') as h5w:
        for group in h5r.keys():
            print (group)
            for ds in h5r[group].keys():
                ds_arr = h5r[group][ds][:]
                print (ds, ':', ds_arr.dtype, ds_arr.shape)
                h5w.create_dataset(group+'/'+ds, data=ds_arr)
    print ('done')

akanksha.vision · April 15, 2020, 10:56am

Dear Ken,

Thank you so much for your replies. It’s a very crisp and precise code.
I agree with your point regarding the %d, but it seems like there was a problem while creating the files. The %d does not seem to call all the files. I printed the number of entries and get:

len(sig_0p_train_0['TauJets/pt']) : 187444
len(sig_0p_train_1['TauJets/pt']) : 187811
len(sig_0p_train_all['TauJets/pt']) : 187444 #this should be 375255

But I have managed to write a lengthy code with a loop and merged the files. I will implement the brief nice version from your answer where I can.

Thank you once again.

ken.walker · April 15, 2020, 1:25pm

Hello, akanksha.vision,
Check your file sizes (here values from my Windows system):
sig_0p_train_0.h5: 148,966,181
sig_0p_train_1.h5: 149,266,236
sig_0p_train_all.h5: 364,425,952
This shows _all.h5 is 2x bigger than _0.h5 and _1.h5

Note you don’t know which %d.h5 file actually contains the data. I used HDFView to see the data. When I opened sig_0p_train_0.h5, I could see ALL of the data (all groups and datasets, all rows and columns). As I said, I’m not familiar with the driver=family method. I think it magically glues the data from all the %d.h5 files together when you operate on them.

In other words, I think sig_0p_train_0[‘TauJets/pt’] and sig_0p_train_1[‘TauJets/pt’] refer to the exact same data: You can verify by comparing values from a few different rows.

Good luck.
-Ken

ken.walker · April 16, 2020, 1:11am

I created a simple example that shows how a family of h5 files are created with driver=family, and then how to read and copy the data from the family of files to one file. The first file write will create 3 files named famdriver_1/2/3.h5 and the second file write will create 1 file named famdriver_export.h5 by copying datasets from the family of files.

import h5py
import numpy as np
print ('create h5 file with family driver')

with h5py.File('famdriver_test_%d.h5', 'w', driver='family', memb_size=1024**2) as h5f:
    print ('h5 files created with family driver')
    arr = np.random.random(100000).reshape(1000,100)
    h5f.create_dataset('ds1', data=arr)
    arr = np.random.random(100000).reshape(1000,100)
    h5f.create_dataset('ds2', data=arr)
    arr = np.random.random(100000).reshape(1000,100)
    h5f.create_dataset('ds3', data=arr)
    print ('done')
 
with h5py.File('famdriver_test_%d.h5', 'r', driver='family', memb_size=1024**2) as h5r:
    print ('h5 files opened with family driver')
    
with h5py.File('famdriver_export.h5', 'w') as h5w: 
    for key1 in h5r.keys():
        print (key1)
        ds_arr = h5r[key1][:]
        print (ds_arr.dtype, ds_arr.shape)
        h5w.create_dataset(key1, data=ds_arr)
    print ('done')

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Merge multiple h5 files containing groups into one h5