H5i_dec_ref hangs

nadvornik.ji · June 7, 2021, 9:13am

Hi Steven,

thx for reply - the size of the data set can be seen in the example above - 1489, 2048, 2.

The correct algorithm is below - I think the problematic parts might be the attributes or the region reference datasets along.

If rank 0 -> Open H5 file in serial mode.
If rank 0 -> create 10 groups
If rank 0 -> inside each group, create a float type dataset (10 in total)
If rank 0 -> inside each group, create a region reference type dataset (10 in total), one dimension, size 10
If rank 0 -> add attributes to the datasets above (vlen strings)
If rank 0 -> Close H5 file in serial mode.
Barrier
Open the file in parallel mode.
Each rank writes to its own float type dataset (rank = index of the dataset).
Close the file in parallel mode.
If rank 0 -> open file in serial mode
If rank 0 -> for each group -> take a region reference dataset and write a 64x64x1 region reference to a float type dataset into it (just an example, I think it doesn't matter which dataset you choose for the region reference)
If rank 0 -> close the file

In python, this gets stuck on: “Close the file in parallel mode”.

Cheers,

Jiri

gheber · June 8, 2021, 4:14pm

Do the two datasets that you are trying to write exist in the file? What is their layout? Have they been allocated? (For exampole, contiguous layout uses late allocation. Creating such a dataset without writing any dataset elements does not allocate any storage for dataset elements in the file.)

G.

steven · June 8, 2021, 4:39pm

At first didn’t quite get your problem; had to peruse the thread to see you want this to scale to millions of files. Took on this challenge and added reference interface calls to H5CPP; to test this feature you have to check out the repository:

https://github.com/steven-varga/h5cpp.git
git checkout master
make install # or just copy/link the 'h5cpp' directory to '/usr/local/include'
cd examples/reference && make clean && reset && make

The implementation allows arbitrary shapes, but on the h5::exp::read and h5::exp::write interface: which dereferences the regions for you, and does IO ops – allows only a single block selection.
Bob’s your uncle! Easy, pythonic selection for HDF5 with modern C++. Give me a few days to check your original question.

This is how the syntax looks:

 
#include <armadillo>
#include <vector>
#include <h5cpp/all>

int main(){
    h5::fd_t fd = h5::create("ref.h5", H5F_ACC_TRUNC);
	{
        h5::ds_t ds = h5::create<float>(fd,"01",  
            h5::current_dims{10,20}, h5::chunk{2,2} | h5::fill_value<float>{1} );
        
        h5::reference_t ref = h5::reference(fd, "01", h5::offset{2,2}, h5::count{4,4});
        h5::write(fd, "single reference", ref);
        /* you can factor out `count` this way :  h5::count count{2,2};  */ 
        std::vector<h5::reference_t> idx {
            // The HDF5 CAPI reqires fd + dataset name, instead of hid_t to ds: wishy-washy 
            h5::reference(fd, "01", h5::offset{2,2}, h5::count{4,4}),
            h5::reference(fd, "01", h5::offset{4,8}, h5::count{1,1}),
            h5::reference(fd, "01", h5::offset{6,12}, h5::count{3,3}),
            h5::reference(fd, "01", h5::offset{8,16}, h5::count{2,1})
        };
        // datset shape can be controlled with dimensions, in this case is 2x2
        // and is not related to the selected regions!!! 
        // data type is H5R_DATASET_REGION when dataspace is provided, otherwise OBJECT
        h5::write(fd, "index", idx, h5::current_dims{2,2}, h5::max_dims{H5S_UNLIMITED, 2});
    }
    { // we going to update the regions referenced by the set of region-references 
      // stored in "index"
        h5::ds_t ds = h5::open(fd, "index");
        std::vector color(50, 9);
        // this is to read from selection
        for(auto& ref: h5::read<std::vector<h5::reference_t>>(ds))
            h5::exp::write(ds, ref, color.data());
    }

    { // we are reading back data from the regions, now they all must be 'color' value '9'
        h5::ds_t ds = h5::open(fd, "index");
        // this is to read from selection
        for(auto& ref: h5::read<std::vector<h5::reference_t>>(ds)){
            arma::fmat mat = h5::exp::read<arma::fmat>(ds, ref);
            std::cout << mat << "\n";
        }
    }
    { // for verification
        std::cout << h5::read<arma::fmat>(fd, "01") << "\n\n";

    }
}

dump:


h5dump ref.h5
HDF5 "ref.h5" {
GROUP "/" {
   DATASET "01" {
      DATATYPE  H5T_IEEE_F32LE
      DATASPACE  SIMPLE { ( 10, 20 ) / ( 10, 20 ) }
      DATA {
      (0,0): 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
      (1,0): 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
      (2,0): 1, 1, 9, 9, 9, 9, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
      (3,0): 1, 1, 9, 9, 9, 9, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
      (4,0): 1, 1, 9, 9, 9, 9, 1, 1, 9, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
      (5,0): 1, 1, 9, 9, 9, 9, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
      (6,0): 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 9, 9, 9, 1, 1, 1, 1, 1,
      (7,0): 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 9, 9, 9, 1, 1, 1, 1, 1,
      (8,0): 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 9, 9, 9, 1, 9, 1, 1, 1,
      (9,0): 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 9, 1, 1, 1
      }
   }
   DATASET "index" {
      DATATYPE  H5T_REFERENCE { H5T_STD_REF_DSETREG }
      DATASPACE  SIMPLE { ( 2, 2 ) / ( H5S_UNLIMITED, 2 ) }
      DATA {
         DATASET "/01" {
            REGION_TYPE BLOCK  (2,2)-(5,5)
            DATATYPE  H5T_IEEE_F32LE
            DATASPACE  SIMPLE { ( 10, 20 ) / ( 10, 20 ) }
         }
         DATASET "/01"  {
            REGION_TYPE BLOCK  (4,8)-(4,8)
            DATATYPE  H5T_IEEE_F32LE
            DATASPACE  SIMPLE { ( 10, 20 ) / ( 10, 20 ) }
         }
         DATASET "/01"  {
            REGION_TYPE BLOCK  (6,12)-(8,14)
            DATATYPE  H5T_IEEE_F32LE
            DATASPACE  SIMPLE { ( 10, 20 ) / ( 10, 20 ) }
         }
         DATASET "/01"  {
            REGION_TYPE BLOCK  (8,16)-(9,16)
            DATATYPE  H5T_IEEE_F32LE
            DATASPACE  SIMPLE { ( 10, 20 ) / ( 10, 20 ) }
         }
      }
   }
   DATASET "single reference" {
      DATATYPE  H5T_REFERENCE { H5T_STD_REF_DSETREG }
      DATASPACE  SCALAR
      DATA {
         DATASET "/01" {
            REGION_TYPE BLOCK  (2,2)-(5,5)
            DATATYPE  H5T_IEEE_F32LE
            DATASPACE  SIMPLE { ( 10, 20 ) / ( 10, 20 ) }
         }
      }
   }
}
}

steve

steven · June 9, 2021, 1:33pm

There is no problem with the C API calls, I executed the proposed algorithm on an 8 core (I only had a laptop at hand) and did the job. The project is uploaded to my github page, in the makefile please adjust the srun -n 8 -w io ./mpi-reference-test to srun -n 8 ./mpi-reference-test, removing the -w io part.
note: instead of 10 groups, etc I altered your spec to using number of cores; which was 8 on my laptop.

Here is the software:

/* copyright steven varga, vargaconsulting 2021, june 08, Toronto, ON, Canada;  MIT license
*/

#include <mpi.h>  /* MUST preceede H5CPP includes*/ 
#include <h5cpp/all>
#include <string>
#include <vector>
#include <fmt/format.h>


constexpr hsize_t m=1489, n=2048, k=2;
constexpr const char* filename = "mpi-reference.h5";
constexpr float data_value = 3.0;

int main(int argc, char **argv) {

    int rank_size, current_rank, name_len;
    MPI_Init(NULL, NULL);
	MPI_Info info  = MPI_INFO_NULL;
	MPI_Comm comm  = MPI_COMM_WORLD;

    MPI_Comm_size(MPI_COMM_WORLD, &rank_size);
    MPI_Comm_rank(MPI_COMM_WORLD, &current_rank);
    
    if(current_rank == 0) { // serial mode, ran on rank 0
        h5::fd_t fd = h5::create(filename, H5F_ACC_TRUNC);
        for(int i=0; i<rank_size; i++){
            // this is your dataset:
            h5::ds_t ds = h5::create<double>(fd, fmt::format("{:03d}/dataset", i), h5::chunk{149,256,1}, h5::current_dims{m,n,k});
            // VL string attribute written to it
            ds["attribute name"] = "this is a vl string attribute";
            // and a reference, with rank_size -- which is 10 in the description
            h5::create<h5::reference_t>(fd, fmt::format("{:03d}/reference", i), h5::current_dims{static_cast<hsize_t>(rank_size)}, 
                h5::chunk{2});
        }
    }
    MPI_Barrier(MPI_COMM_WORLD);

    { // parallel mode
        h5::fd_t fd = h5::open(filename, H5F_ACC_RDWR, h5::mpiio({MPI_COMM_WORLD, info}));
        h5::write(fd,  fmt::format("{:03d}/dataset",current_rank), 
            std::vector<float>(m*n*k, 1.0), // <-- single shot write: dataset size must match file space size 
            h5::independent);               // collective | independent
    }
    MPI_Barrier(MPI_COMM_WORLD);

    if(current_rank == 0) { // serial mode, ran on rank 0
        // dimension >= referenced region or bad things happen
        std::vector<float> in_memory_values(64*64, data_value);

        h5::fd_t fd = h5::open(filename, H5F_ACC_RDWR);
        for(int i=0; i<rank_size; i++){
            h5::reference_t reference = h5::reference(fd, // CREATE a reference
                fmt::format("{:03d}/dataset", i),         // dataset the region is within
                h5::offset{1, 1, 0},                      // location of the region 
                h5::count{64,64,1});                      // size of region 
            // write data into the referenced region, the memepry space of passed pointer must match the region 
            h5::exp::write(fd, fmt::format("{:03d}/dataset", i) , reference, in_memory_values.data());
            // don't forget to write the REFERENCE into the index file for later use,
            // NOTE: the order of data and reference IO calls is arbitrary
            h5::write(fd, fmt::format("{:03d}/reference", i), 
                reference, 
                h5::count{1}, h5::offset{0} ); // <-- which cell you update within file space
        };
    }
    MPI_Barrier(MPI_COMM_WORLD);
    MPI_Finalize();
    
	return 0;
}

Ps: dont forget to pull the h5cpp github repo, as I added a write io call variant.

nadvornik.ji · June 29, 2021, 7:44am

Hi gents,

sorry for the delay, just came back from vacation.

@steven, I understand what you are doing, and I agree it’s an implementation of my algorithm in C++.

But maybe I just missed the point how does it solve the Python implementation? Should I compile your program and cythonize it and call it from Python? Then I’d be essentially duplicating the h5py functionality because if I understand correctly, it already claims it can do this by itself.

@gheber The datasets are created via the Group object method “require_dataset” with a contiguous layout. I’ll try to write zeroes into them during the creation and then get back to you if it changes anything.

steven · June 29, 2021, 12:31pm

Thanks for getting back: I think I inquired if you were interested in a C++ solution, I took your reply as an implicit yes. Then I added this functionality to H5CPP, because I could

If you want actual HPC where P stands for Performance, you do want your implementation along the way I provided. Because of the de-referencing there are lot’s of ‘padding’ in such an approach – hidden behind library calls.

I use the following approach with Julia, and turn it into a shared object:

extern "C" {
   void my_exported_function(char*, char*, int [, ...]);
}

Alternatively you could just turn over to the dark side of C++, and let the power flow through your fabric!
cheers: steve

gheber · June 29, 2021, 3:49pm

I don’t know if Group.require_dataset supports H5Pset_alloc_time, which you wanna set to H5D_ALLOC_TIME_EARLY. (You can also suppress writing fill values if the whole thing gets overwritten.) Writing at least one value has the same net effect (the allocation being triggered).

G.

nadvornik.ji · July 24, 2021, 10:02am

Hi, Gerd, you actually hit it. For contiguous layout, it is enough to write one element to get it allocated and then every parallel write works like a charm and the file closes without issues. This is actually something that was not obvious to me even though I could notice that the file with the metadata was too small to have the space allocated. This is, by all means, a big leap in my project - I’m finally able to parallelize the preprocessing on bigger scale.

So the lesson learned for this topic is that if you want to write in parallel to a dataset, you need to make sure it is allocated before the parallel file open happens, otherwise it will just hang on the file closure.

I’m now trying to figure out the chunked storage - if I just change the datasets to use chunks of size (128,128,2), the first parallel batch write runs ok, but if I try to write a second batch it hangs on the write_direct procedure. Does the chunked storage require some special kind of care during allocation or during the writes themselves?

Thank you very much for your help so far!

Cheers,

Jiri

gheber · July 27, 2021, 4:00pm

Are you using H5D_ALLOC_TIME_EARLY and H5D_FILL_TIME_ALLOC or H5D_FILL_TIME_NEVER with the chunked dataset? (I assume you are not using any kind of compression…)

G.

nadvornik.ji · August 7, 2021, 1:12pm

I’m using defaults and just writing one zero to one element of the dataset.

I’m checking the h5py and it uses a cythonized h5d.create procedure and doesn’t seem to support passing any arguments for these.

h5d.pyx has:

dsid = H5Dcreate(loc.id, cname, tid.id, space.id,
             pdefault(lcpl), pdefault(dcpl), pdefault(dapl))

and the default dcpl is created by h5p.pyx:

cdef hid_t newid
newid = H5Pcreate(cls.id)
return propwrap(newid)

So whatever the H5Pcreate procedure uses.

I’d be happy to test it if you could point me to some guide how to handle the properties via python manually.

And yes, I’m not using any compression.

Jiri

thomas1 · August 11, 2021, 9:24am

If you use the h5py low level API, you can do basically everything that HDF5 supports (so long as h5py exposes the functions). It would look something like this (untested):

dcpl = h5py.h5p.create(h5py.h5p.DATASET_CREATE)
dcpl.set_alloc_time(h5py.h5d.ALLOC_TIME_EARLY)
space = h5py.h5s.create_simple((10,))
dsid = h5py.h5d.create(f.id, b'data',  h5py.h5t.NATIVE_INT32, space, dcpl=dcpl)

# Wrap in a Dataset object to use the high-level API again
ds = h5py.Dataset(dsid)

As you can see, the basic pattern is that anything with an H5X prefix in the C API belongs to a corresponding h5py.h5x module, and most C functions that take an HDF5 ID as the first argument are exposed as methods (e.g. PropDCID.set_alloc_time()). The low-level API is documented at https://api.h5py.org/, along with a page in the main docs about how the high-level and low-level fit together.

nadvornik.ji · October 8, 2021, 11:19am

Hi all, sorry for the longer delay.

Yes, indeed this was it. I can now write chunked datasets in parallel by using the following:

dcpl.set_alloc_time(h5py.h5d.ALLOC_TIME_EARLY)
dcpl.set_fill_time(h5py.h5d.FILL_TIME_NEVER)

However, after all this work (and thank you very much for your contribution, it wouldn’t be possible without your help ), I’m getting a little disappointed by the performance:

Chunked
Workers	Time [s]	Efficiency	Write speed [MB/s]
1	188	100%	30
2	99	95%	58
4	59	80%	97
8	49	48%	117

Contiguous
Workers	Time [s]	Efficiency	Write Speed[MB/s]
1	103	100%	56
2	54	95%	106
4	36	71%	158
8	31	42%	184

I’m testing this on a NVMe disk with 2GB/s write speed, so that should not be a problem, however, the efficiency drops quite quickly with adding multiple processes.

When looking at the slog file it seems the writes are indeed not a problem. This one is for one producer, 8 workers:

All of the workers work independently (read from disk, do cca 1 sec preprocessing, write to 4 different datasets).

I’m trying to play with the buffer properties, but they are not available within the h5py low level API. @thomas1 how can I set a property that has not a special method in the python property class? There are lots more that seem to exist only in the HDF5 implementation beneath but don’t have setters and getters in python.

Cheers,

Jiri

thomas1 · October 8, 2021, 12:20pm

As far as I know, you can only access the property list functions which h5py exposes. Which specific functions are you wanting to use?

nadvornik.ji · October 8, 2021, 12:57pm

I was thinking the https://support.hdfgroup.org/HDF5/doc/RM/RM_H5P.html#Property-SetBuffer . I would set the buffer to ~100MB instead of 1 and then observe whether it behaves differently.

I’m not sure what to compare against though. What is a reasonable expectation for the write bandwidth the HDF5 can achieve when using independent writes?

Jiri

thomas1 · October 8, 2021, 1:28pm

Aha, yes. Most of the data transfer plist functions are not exposed in h5py, perhaps because you have to do a lot with the low-level API to use them. It’s not difficult if you’d like them, though - it’s even documented how to expose new functions.

Apologies if it was mentioned somewhere, but are you specifically wanting to convert data on writing? Your minimal example seems to have float32 datasets, whereas np.ones() will make a float64 array by default, so HDF5 has to convert it. This is probably slower than writing the same data type (and I think H5Pset_buffer is irrelevant if there’s no conversion involved). Even if you do need to convert, I might try doing the conversion with NumPy before giving data to HDF5 (arr.astype(np.float32)).

I don’t have a good feel for what bandwidth you can expect, I’m afraid.

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

H5i_dec_ref hangs