H5i_dec_ref hangs

I guess I’m missing a bit of context here. Now we are talking about dataset region references? Where are they coming from? (The MPI-IO VFD doesn’t support those, because they are non-fixed size.) And what about those attributes? Are they created and written collectively? (There is no partial I/O for attributes.)

G.

I’m using region references to store image cutouts for individual spectra. These region references are stored in datasets which are created in the collective mode, but written after the collective operation finishes in serial mode by one writer. The erroneous file I have sent you contains only empty datasets, without attempting to write the region references.

However, I also tried to create the file without those region reference datasets to see whether they were causing the trouble and the error was still there. I will provide you that file as well.

The attributes are written in collective mode by all the workers, these are just copy-pasted FITS header attributes, nothing fancy going on here.

Can you modify (or tell me how to modify) our toy example to mimic what you are trying to do? G.

Hi, thank you for the second option :slight_smile: if I can avoid studying the C API at this point, I’d be very happy.

I have added comments starting with //JN # , there are 6 comments.

#include "hdf5.h"
#include <mpi.h>

#include <assert.h>
#include <stdlib.h>

int main(int argc, char** argv)
{
    MPI_Comm comm = MPI_COMM_WORLD;
    int comm_rank, comm_size;
    hid_t file;

    MPI_Init(&argc, &argv);
    MPI_Comm_size(comm, &comm_size);
    MPI_Comm_rank(comm, &comm_rank);

    { // create the file in parallel
        hid_t fapl;
        char fname[] = "foo.h5";
        assert((fapl = H5Pcreate(H5P_FILE_ACCESS)) != H5I_INVALID_HID);
        assert(H5Pset_fapl_mpio(fapl, comm, MPI_INFO_NULL) >= 0);
        assert((file = H5Fcreate(fname, H5F_ACC_TRUNC, H5P_DEFAULT, fapl)) !=
               H5I_INVALID_HID);
        assert(H5Pclose(fapl) >= 0);
    }

    // all processes have to call H5Dcreate() to create a dataset, even if the
    // dataset will be accessed by only one MPI rank.
    for (int i = 0; i < comm_size; ++i)
    {
        //JN #1 - Create a tree structure of groups - cca 10 levels of groups for each dset
        char name[255];
        hid_t fspace, dset;
        assert((fspace = H5Screate_simple(1, (hsize_t[]){10 + i}, NULL)) !=
               H5I_INVALID_HID);
        assert(sprintf(name, "%d", i) > 0);
        //JN #2 - The dset should be created in the leaf node group (the group at the bottom)
        assert((dset = H5Dcreate(file, name, H5T_STD_I32LE, fspace, H5P_DEFAULT,
                                 H5P_DEFAULT, H5P_DEFAULT)) != H5I_INVALID_HID);
        //JN #3 - For every dset, add cca 100 attributes string (fixed length) to it.
        //JN #4 - Create another dataset "regref", length 1000, type RegionReference.
        assert(H5Dclose(dset) >= 0);
        assert(H5Sclose(fspace) >= 0);
    }

    { // each rank opens and writes its "own" dataset
        //JN #5- this remains the same
        hid_t dset;
        char name[255];
        int* data = (int*)malloc((10+comm_rank)*sizeof(int));
        for (int i = 0; i < 10+comm_rank; ++i)
            data[i] = comm_rank;

        assert(sprintf(name, "%d", comm_rank) > 0);
        assert((dset = H5Dopen(file, name, H5P_DEFAULT)) >= 0);

        assert(H5Dwrite(dset, H5T_NATIVE_INT, H5S_ALL, H5S_ALL,
                        H5P_DEFAULT, data) >= 0);

        assert(H5Dclose(dset) >= 0);
        free(data);
    }

    assert(H5Fclose(file) >= 0);

    //JN #6- if rank == 0, reopen the file in serial mode and write the region references to the "regref" datasets.

    MPI_Finalize();

    return EXIT_SUCCESS;
}

Point #4 can be also moved to the serial phase, the regref datasets are needed only after all of the regular datasets are written.
Point #6 - here it occurred to me that this might never work, but I haven’t found any mention of this in the documentation. If a file is created in parallel mode, can it then be opened and written in serial mode and vice versa (that is to access it with MPIO driver and other drivers, given the file is closed and re-opened between them)? This is really crucial to me, since the region referencing was one of the major reasons why I chose HDF5 for my problem.

We could have a call where I share you my screen and explain my reasons for doing what I’m doing. An article about this will also be shortly available on Astronomy&Computing.

All good.

Yes, that’s fine. There is only one HDF5 file format and you can’t tell if a file was created/written in serial or parallel mode. The file driver is set in a file access property list, which is constructed at runtime and not stored in the file.

G.

Hi Gerd,

I have tried an alternative solution with a single writer multiple reader where all processes send the preprocessed data to master, but unfortunately, it’s way too slow. The master rank spends all the time deserializing the MPI messages while the others are waiting for him (one image is cca 100MB in memory after preprocessing).

I am now trying to pinpoint the issue I’m experiencing. I have produced a minified example, see below:

import h5py
from mpi4py import MPI
import numpy as np

H5PATH = "../results/SDSS_cube_parallel.h5"
f = h5py.File(H5PATH, 'r+', driver='mpio', comm=MPI.COMM_WORLD)

rank = MPI.COMM_WORLD.Get_rank()

if rank == 0:
    ds = f["/semi_sparse_cube/5/22/90/362/1450/5802/23208/92832/4604806771.19/3551/(2048, 1489)/frame-u-004899-2-0260.fits"]
    ds.write_direct(np.ones((1489, 2048, 2)))
if rank == 1:
    ds = f["/semi_sparse_cube/5/22/90/362/1450/5802/23208/92832/4604806842.83/8932/(2048, 1489)/frame-z-004899-2-0260.fits"]
    ds.write_direct(np.ones((1489, 2048, 2)))
f.close()

use it with this h5 file (2.8 MB)
while running:

mpiexec -n 2 python script.py

The h5 file was produced in serial mode by just one process (without MPI).

Hope this helps you investigating the matter - you can also try C code on this file whether it produces the same issue?

If interested in a C++ solution, please send me the size/shape of the dataset – and make corrections to your posted algorithm if needed.

If rank 0 -> Open H5 file in serial mode.
If rank 0 -> create 10 datasets.
If rank 0 -> Close H5 file in serial mode.
Barrier
Open the file in parallel mode.
Each rank writes to its own dataset (rank = index of the dataset).
Close the file in parallel mode.

steve

Hi Steven,

thx for reply - the size of the data set can be seen in the example above - 1489, 2048, 2.

The correct algorithm is below - I think the problematic parts might be the attributes or the region reference datasets along.

If rank 0 -> Open H5 file in serial mode.
If rank 0 -> create 10 groups
If rank 0 -> inside each group, create a float type dataset (10 in total)
If rank 0 -> inside each group, create a region reference type dataset (10 in total), one dimension, size 10
If rank 0 -> add attributes to the datasets above (vlen strings)
If rank 0 -> Close H5 file in serial mode.
Barrier
Open the file in parallel mode.
Each rank writes to its own float type dataset (rank = index of the dataset).
Close the file in parallel mode.
If rank 0 -> open file in serial mode
If rank 0 -> for each group -> take a region reference dataset and write a 64x64x1 region reference to a float type dataset into it (just an example, I think it doesn't matter which dataset you choose for the region reference)
If rank 0 -> close the file

In python, this gets stuck on: “Close the file in parallel mode”.

Cheers,

Jiri

Do the two datasets that you are trying to write exist in the file? What is their layout? Have they been allocated? (For exampole, contiguous layout uses late allocation. Creating such a dataset without writing any dataset elements does not allocate any storage for dataset elements in the file.)

G.

At first didn’t quite get your problem; had to peruse the thread to see you want this to scale to millions of files. Took on this challenge and added reference interface calls to H5CPP; to test this feature you have to check out the repository:

https://github.com/steven-varga/h5cpp.git
git checkout master
make install # or just copy/link the 'h5cpp' directory to '/usr/local/include'
cd examples/reference && make clean && reset && make

The implementation allows arbitrary shapes, but on the h5::exp::read and h5::exp::write interface: which dereferences the regions for you, and does IO ops – allows only a single block selection.
Bob’s your uncle! Easy, pythonic selection for HDF5 with modern C++. Give me a few days to check your original question.

This is how the syntax looks:

 
#include <armadillo>
#include <vector>
#include <h5cpp/all>

int main(){
    h5::fd_t fd = h5::create("ref.h5", H5F_ACC_TRUNC);
	{
        h5::ds_t ds = h5::create<float>(fd,"01",  
            h5::current_dims{10,20}, h5::chunk{2,2} | h5::fill_value<float>{1} );
        
        h5::reference_t ref = h5::reference(fd, "01", h5::offset{2,2}, h5::count{4,4});
        h5::write(fd, "single reference", ref);
        /* you can factor out `count` this way :  h5::count count{2,2};  */ 
        std::vector<h5::reference_t> idx {
            // The HDF5 CAPI reqires fd + dataset name, instead of hid_t to ds: wishy-washy 
            h5::reference(fd, "01", h5::offset{2,2}, h5::count{4,4}),
            h5::reference(fd, "01", h5::offset{4,8}, h5::count{1,1}),
            h5::reference(fd, "01", h5::offset{6,12}, h5::count{3,3}),
            h5::reference(fd, "01", h5::offset{8,16}, h5::count{2,1})
        };
        // datset shape can be controlled with dimensions, in this case is 2x2
        // and is not related to the selected regions!!! 
        // data type is H5R_DATASET_REGION when dataspace is provided, otherwise OBJECT
        h5::write(fd, "index", idx, h5::current_dims{2,2}, h5::max_dims{H5S_UNLIMITED, 2});
    }
    { // we going to update the regions referenced by the set of region-references 
      // stored in "index"
        h5::ds_t ds = h5::open(fd, "index");
        std::vector color(50, 9);
        // this is to read from selection
        for(auto& ref: h5::read<std::vector<h5::reference_t>>(ds))
            h5::exp::write(ds, ref, color.data());
    }

    { // we are reading back data from the regions, now they all must be 'color' value '9'
        h5::ds_t ds = h5::open(fd, "index");
        // this is to read from selection
        for(auto& ref: h5::read<std::vector<h5::reference_t>>(ds)){
            arma::fmat mat = h5::exp::read<arma::fmat>(ds, ref);
            std::cout << mat << "\n";
        }
    }
    { // for verification
        std::cout << h5::read<arma::fmat>(fd, "01") << "\n\n";

    }
}

dump:


h5dump ref.h5
HDF5 "ref.h5" {
GROUP "/" {
   DATASET "01" {
      DATATYPE  H5T_IEEE_F32LE
      DATASPACE  SIMPLE { ( 10, 20 ) / ( 10, 20 ) }
      DATA {
      (0,0): 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
      (1,0): 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
      (2,0): 1, 1, 9, 9, 9, 9, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
      (3,0): 1, 1, 9, 9, 9, 9, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
      (4,0): 1, 1, 9, 9, 9, 9, 1, 1, 9, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
      (5,0): 1, 1, 9, 9, 9, 9, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
      (6,0): 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 9, 9, 9, 1, 1, 1, 1, 1,
      (7,0): 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 9, 9, 9, 1, 1, 1, 1, 1,
      (8,0): 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 9, 9, 9, 1, 9, 1, 1, 1,
      (9,0): 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 9, 1, 1, 1
      }
   }
   DATASET "index" {
      DATATYPE  H5T_REFERENCE { H5T_STD_REF_DSETREG }
      DATASPACE  SIMPLE { ( 2, 2 ) / ( H5S_UNLIMITED, 2 ) }
      DATA {
         DATASET "/01" {
            REGION_TYPE BLOCK  (2,2)-(5,5)
            DATATYPE  H5T_IEEE_F32LE
            DATASPACE  SIMPLE { ( 10, 20 ) / ( 10, 20 ) }
         }
         DATASET "/01"  {
            REGION_TYPE BLOCK  (4,8)-(4,8)
            DATATYPE  H5T_IEEE_F32LE
            DATASPACE  SIMPLE { ( 10, 20 ) / ( 10, 20 ) }
         }
         DATASET "/01"  {
            REGION_TYPE BLOCK  (6,12)-(8,14)
            DATATYPE  H5T_IEEE_F32LE
            DATASPACE  SIMPLE { ( 10, 20 ) / ( 10, 20 ) }
         }
         DATASET "/01"  {
            REGION_TYPE BLOCK  (8,16)-(9,16)
            DATATYPE  H5T_IEEE_F32LE
            DATASPACE  SIMPLE { ( 10, 20 ) / ( 10, 20 ) }
         }
      }
   }
   DATASET "single reference" {
      DATATYPE  H5T_REFERENCE { H5T_STD_REF_DSETREG }
      DATASPACE  SCALAR
      DATA {
         DATASET "/01" {
            REGION_TYPE BLOCK  (2,2)-(5,5)
            DATATYPE  H5T_IEEE_F32LE
            DATASPACE  SIMPLE { ( 10, 20 ) / ( 10, 20 ) }
         }
      }
   }
}
}

steve

There is no problem with the C API calls, I executed the proposed algorithm on an 8 core (I only had a laptop at hand) and did the job. The project is uploaded to my github page, in the makefile please adjust the srun -n 8 -w io ./mpi-reference-test to srun -n 8 ./mpi-reference-test, removing the -w io part.
note: instead of 10 groups, etc I altered your spec to using number of cores; which was 8 on my laptop.

Here is the software:

/* copyright steven varga, vargaconsulting 2021, june 08, Toronto, ON, Canada;  MIT license
*/

#include <mpi.h>  /* MUST preceede H5CPP includes*/ 
#include <h5cpp/all>
#include <string>
#include <vector>
#include <fmt/format.h>


constexpr hsize_t m=1489, n=2048, k=2;
constexpr const char* filename = "mpi-reference.h5";
constexpr float data_value = 3.0;

int main(int argc, char **argv) {

    int rank_size, current_rank, name_len;
    MPI_Init(NULL, NULL);
	MPI_Info info  = MPI_INFO_NULL;
	MPI_Comm comm  = MPI_COMM_WORLD;

    MPI_Comm_size(MPI_COMM_WORLD, &rank_size);
    MPI_Comm_rank(MPI_COMM_WORLD, &current_rank);
    
    if(current_rank == 0) { // serial mode, ran on rank 0
        h5::fd_t fd = h5::create(filename, H5F_ACC_TRUNC);
        for(int i=0; i<rank_size; i++){
            // this is your dataset:
            h5::ds_t ds = h5::create<double>(fd, fmt::format("{:03d}/dataset", i), h5::chunk{149,256,1}, h5::current_dims{m,n,k});
            // VL string attribute written to it
            ds["attribute name"] = "this is a vl string attribute";
            // and a reference, with rank_size -- which is 10 in the description
            h5::create<h5::reference_t>(fd, fmt::format("{:03d}/reference", i), h5::current_dims{static_cast<hsize_t>(rank_size)}, 
                h5::chunk{2});
        }
    }
    MPI_Barrier(MPI_COMM_WORLD);

    { // parallel mode
        h5::fd_t fd = h5::open(filename, H5F_ACC_RDWR, h5::mpiio({MPI_COMM_WORLD, info}));
        h5::write(fd,  fmt::format("{:03d}/dataset",current_rank), 
            std::vector<float>(m*n*k, 1.0), // <-- single shot write: dataset size must match file space size 
            h5::independent);               // collective | independent
    }
    MPI_Barrier(MPI_COMM_WORLD);

    if(current_rank == 0) { // serial mode, ran on rank 0
        // dimension >= referenced region or bad things happen
        std::vector<float> in_memory_values(64*64, data_value);

        h5::fd_t fd = h5::open(filename, H5F_ACC_RDWR);
        for(int i=0; i<rank_size; i++){
            h5::reference_t reference = h5::reference(fd, // CREATE a reference
                fmt::format("{:03d}/dataset", i),         // dataset the region is within
                h5::offset{1, 1, 0},                      // location of the region 
                h5::count{64,64,1});                      // size of region 
            // write data into the referenced region, the memepry space of passed pointer must match the region 
            h5::exp::write(fd, fmt::format("{:03d}/dataset", i) , reference, in_memory_values.data());
            // don't forget to write the REFERENCE into the index file for later use,
            // NOTE: the order of data and reference IO calls is arbitrary
            h5::write(fd, fmt::format("{:03d}/reference", i), 
                reference, 
                h5::count{1}, h5::offset{0} ); // <-- which cell you update within file space
        };
    }
    MPI_Barrier(MPI_COMM_WORLD);
    MPI_Finalize();
    
	return 0;
}

Ps: dont forget to pull the h5cpp github repo, as I added a write io call variant.

1 Like

Hi gents,

sorry for the delay, just came back from vacation.

@steven, I understand what you are doing, and I agree it’s an implementation of my algorithm in C++.

But maybe I just missed the point how does it solve the Python implementation? Should I compile your program and cythonize it and call it from Python? Then I’d be essentially duplicating the h5py functionality because if I understand correctly, it already claims it can do this by itself.

@gheber The datasets are created via the Group object method “require_dataset” with a contiguous layout. I’ll try to write zeroes into them during the creation and then get back to you if it changes anything.

Thanks for getting back: I think I inquired if you were interested in a C++ solution, I took your reply as an implicit yes. Then I added this functionality to H5CPP, because I could :slight_smile:

If you want actual HPC where P stands for Performance, you do want your implementation along the way I provided. Because of the de-referencing there are lot’s of ‘padding’ in such an approach – hidden behind library calls.

I use the following approach with Julia, and turn it into a shared object:

extern "C" {
   void my_exported_function(char*, char*, int [, ...]);
}

Alternatively you could just turn over to the dark side of C++, and let the power flow through your fabric!
cheers: steve

1 Like

I don’t know if Group.require_dataset supports H5Pset_alloc_time, which you wanna set to H5D_ALLOC_TIME_EARLY. (You can also suppress writing fill values if the whole thing gets overwritten.) Writing at least one value has the same net effect (the allocation being triggered).

G.

Hi, Gerd, you actually hit it. For contiguous layout, it is enough to write one element to get it allocated and then every parallel write works like a charm and the file closes without issues. This is actually something that was not obvious to me even though I could notice that the file with the metadata was too small to have the space allocated. This is, by all means, a big leap in my project - I’m finally able to parallelize the preprocessing on bigger scale.

So the lesson learned for this topic is that if you want to write in parallel to a dataset, you need to make sure it is allocated before the parallel file open happens, otherwise it will just hang on the file closure.

I’m now trying to figure out the chunked storage - if I just change the datasets to use chunks of size (128,128,2), the first parallel batch write runs ok, but if I try to write a second batch it hangs on the write_direct procedure. Does the chunked storage require some special kind of care during allocation or during the writes themselves?

Thank you very much for your help so far!

Cheers,

Jiri

1 Like

Are you using H5D_ALLOC_TIME_EARLY and H5D_FILL_TIME_ALLOC or H5D_FILL_TIME_NEVER with the chunked dataset? (I assume you are not using any kind of compression…)

G.

I’m using defaults and just writing one zero to one element of the dataset.

I’m checking the h5py and it uses a cythonized h5d.create procedure and doesn’t seem to support passing any arguments for these.

h5d.pyx has:

dsid = H5Dcreate(loc.id, cname, tid.id, space.id,
             pdefault(lcpl), pdefault(dcpl), pdefault(dapl))

and the default dcpl is created by h5p.pyx:

cdef hid_t newid
newid = H5Pcreate(cls.id)
return propwrap(newid)

So whatever the H5Pcreate procedure uses.

I’d be happy to test it if you could point me to some guide how to handle the properties via python manually.

And yes, I’m not using any compression.

Jiri

If you use the h5py low level API, you can do basically everything that HDF5 supports (so long as h5py exposes the functions). It would look something like this (untested):

dcpl = h5py.h5p.create(h5py.h5p.DATASET_CREATE)
dcpl.set_alloc_time(h5py.h5d.ALLOC_TIME_EARLY)
space = h5py.h5s.create_simple((10,))
dsid = h5py.h5d.create(f.id, b'data',  h5py.h5t.NATIVE_INT32, space, dcpl=dcpl)

# Wrap in a Dataset object to use the high-level API again
ds = h5py.Dataset(dsid)

As you can see, the basic pattern is that anything with an H5X prefix in the C API belongs to a corresponding h5py.h5x module, and most C functions that take an HDF5 ID as the first argument are exposed as methods (e.g. PropDCID.set_alloc_time()). The low-level API is documented at https://api.h5py.org/, along with a page in the main docs about how the high-level and low-level fit together.

1 Like

Hi all, sorry for the longer delay.

Yes, indeed this was it. I can now write chunked datasets in parallel by using the following:

dcpl.set_alloc_time(h5py.h5d.ALLOC_TIME_EARLY)
dcpl.set_fill_time(h5py.h5d.FILL_TIME_NEVER)

However, after all this work (and thank you very much for your contribution, it wouldn’t be possible without your help :wink: ), I’m getting a little disappointed by the performance:

Chunked
Workers Time [s] Efficiency Write speed [MB/s]
1 188 100% 30
2 99 95% 58
4 59 80% 97
8 49 48% 117
Contiguous
Workers Time [s] Efficiency Write Speed[MB/s]
1 103 100% 56
2 54 95% 106
4 36 71% 158
8 31 42% 184

I’m testing this on a NVMe disk with 2GB/s write speed, so that should not be a problem, however, the efficiency drops quite quickly with adding multiple processes.

When looking at the slog file it seems the writes are indeed not a problem. This one is for one producer, 8 workers:

All of the workers work independently (read from disk, do cca 1 sec preprocessing, write to 4 different datasets).

I’m trying to play with the buffer properties, but they are not available within the h5py low level API. @thomas1 how can I set a property that has not a special method in the python property class? There are lots more that seem to exist only in the HDF5 implementation beneath but don’t have setters and getters in python.

Cheers,

Jiri

As far as I know, you can only access the property list functions which h5py exposes. Which specific functions are you wanting to use?