Creating a Group per MPI process using parallel HDF5 with the C++ API

maric · September 21, 2018, 4:08pm

Hello everyone,

I have asked this question on Stack Overflow and got some guidance, but I am still not sure about what I should do.

I am trying to use the C++ API to the PHDF5 to write a simple parallel program where each MPI process should create a group in a HDF5 file. This is what I have got so far, with a modification with respect to the Stack Overflow Post:

#include <iostream>
#include <mpi.h>
#include <sstream>
#include <iostream>
#include <memory>
using std::cout;
using std::endl;

#include <string>
#include "H5Cpp.h"
using namespace H5;
using namespace std; 

int main(void)
{
    MPI_Init(NULL, NULL); 

    // Get the number of processes
    int size;
    MPI_Comm_size(MPI_COMM_WORLD, &size);

    // Get the rank of the process
    int rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    auto acc_tpl1 = H5Pcreate(H5P_FILE_ACCESS);
    /* set Parallel access with communicator */
    H5Pset_fapl_mpio(acc_tpl1, MPI_COMM_WORLD, MPI_INFO_NULL);

    // Creating the file with H5File stores only a single group with 4 MPI processes.
    auto testFile = H5File("test.h5", H5F_ACC_TRUNC, H5P_DEFAULT, acc_tpl1);

    for (unsigned int i = 0; i < size; ++i)
    {
        std::stringstream ss; 
        ss << "/RANK_GROUP" << rank; 
        string rankGroup {ss.str()}; 
        // Create the rank group with testFile.
        if (! testFile.exists(rankGroup))
        {
            cout << rankGroup << endl; 
            testFile.createGroup(rankGroup);
        }
    }

    // Release the file-access template 
    H5Pclose(acc_tpl1);

    // Release the testFile 
    testFile.close();

    MPI_Finalize();

    return 0;
}

I have compiled the program using:

h5c++ test-mpi-group-creation.cpp -o test-mpi-group-creation

The version of h5c++ is wrapped around g++ (GCC) 8.2.0. I am using the community/hdf5-openmpi 1.10.3-1 HDF5 library with openmpi support on Arch Linux, extra/openmpi 3.1.2-1 .

The output of the program execution is:

mpirun -np 4 ./test-mpi-group-creation 2>&1 | tee log 
/RANK_GROUP0
/RANK_GROUP3
/RANK_GROUP2
/RANK_GROUP1

Which is great, as it seems that H5File is executing the branch and reporting the proper number of groups being created. However, when I examine the test.h5 file, I see that only process with rank 1 has created a group:

h5ls -lr test.h5 
/                        Group
/RANK_GROUP1             Group

I still have a few questions:

I am using a mix of C++ and C API. C API I have used to set up the MPI driver for the file, based on online documenation. How can I do the same using the C++ API?
Why aren’t the groups written in test.h5 file as expected?

bljones · October 23, 2018, 6:13pm

Hello!

All processes must participate in group creation (collective). See this page for for information on how HDF5 functions must be called (collectively or independently):

https://portal.hdfgroup.org/display/HDF5/Collective+Calling+Requirements+in+Parallel+HDF5+Applications

This page has general information on using Parallel HDF5:

https://portal.hdfgroup.org/display/HDF5/Parallel+HDF5

Also, Parallel HDF5 APIs are only available for C and Fortran.

-Barbara

steven · October 23, 2018, 6:37pm

Hello,

I just noticed this post, and would like to add that H5CPP is a new approach to hdf5 c++ interface and will support MPI by early december. 2018.
The project may be freely downloaded from github, see htpp://h5cpp.ca for details.

Best wishes,
Steven

maric · October 25, 2018, 3:41pm

Hi Steven,

What is the difference between H5CPP and HighFive? It seems that H5CPP focuses on writing data used by numerical linear algebra algorithms, which is also very interesting, but I currently need a simple to use API for error arrays.

Thanks!

Best regards,
Tomislav

maric · October 25, 2018, 3:42pm

Hi Barbara,

Thanks for the help!

Will the C++ HDF API be extended to support parallel IO with MPI at some point?

Best regards,
Tomislav

steven · October 25, 2018, 5:30pm

Hello Tomislav,

While forum is not the right place to compare two different software solutions for slightly overlapping goals and properties, This linkto h5cpp website provides detailed description of what this project about. The provided examples are rather simple. The template library python-like constructs ( driven by compile time template metaprogramming based embedded parser ) makes the function calls devilishly easy.

All HDF5 CAPI calls are profiled, and continuously monitored for performance: check out packet table implementation and compare it with others. But wait: even that will improve about 100 folds once I revisit that interface and replace current H5Dwrite with the H5Owrite optimized version.
H5CPP compiler technology matched with templates aims for non-intrusive persistence for C++ as The HDFGroup and I presented it in Chicago C++ Usergroup meeting end of August

Currently major linear algebra packages, std::vector<POD_struct | integral types> are supported, and in a few weeks the entire STL will be added.
In addition with the newest version you get: full error handling, RAII, attributes, comprehensive coverage of HDF5 properties to make those tunings added only when they’re needed.

Not sure if your ‘error arrays’ fit in any containers mentioned?

steven

maric · October 26, 2018, 3:13pm

Hi Steven,

Thanks a lot for the detailed information. I just want to use HDF5 in my C++ applications for Computational Physics to store my results, preferentially also in parallel using MPI. My error arrays are simply collections of double values of fixed length, that I want to categorize into groups and document with metadata.

Thanks again!

Tomislav

steven · October 26, 2018, 3:47pm

To group them I would place the arrays in different directory, this is what I do with financial datasets. Each dataset can have attributes: strings, structs, arrays … – that is your description. Would this work?

In any event: can you provide a minimal working example of the c++ code block, and possibly a schema how you want it in hdf5 format. This way I could provide you with an example – which should be fairly simple to get this done.

Steve

maric · October 26, 2018, 4:45pm

Hi Steven,

I am currently trying out the C API as well, so I don’t want to steal your time with the example. I’ll take a look at H5CPP when I check the C API. It is cumbersome to use the C API in my honest opinion in a C++ code, as I am used to the destructors doing the cleanup work for the files, groups, etc, so I will most likely switch to H5CPP later, but I think I should take a look at the C API first at least.

Thanks a lot for your help!

Best regards,
Tomislav

steven · October 27, 2018, 10:22pm

Tomislav,

take a look at H5CPP design pattern which guarantees: RAII with all resources and automatic implicit/explicit conversion to CAPI hid_t handles. In other words H5CPP handles can be passed directly to the CAPI calls, and will be cleaned up (closed) when leaving the code block.
Extending ‘life span’ from code block can be done with std::move semantics.

In other words: there are patterns not-yet-implemented in H5CPP, for those you just pass the resource like this:

{ // code block
    h5::fd_t fd = h5::open("hdf5_file.h5", ... optional args );
    h5::gr_t gr{ H5Gopen(fd, ...)};   // your CAPI call: NOTE the BRACES!
    H5Giterate( gr, ... ); // do your thing
} // guaranteed cleanup of resources: fd, gr

All resources are wrapped, the conversion may be controlled from implicit | explicit | prohibit doing exactly what you are used to
here is the list of resource id-s:

	/*file:  */ h5::fd_t;  /*dataset:*/	h5::ds_t; /*packet table:*/ h5::pt_t;
	/*attrib:*/ h5::at_t;  /*group:  */	h5::gr_t;  /*object:*/      h5::ob_t;
	/*space: */ h5::sp_t; 
	/*datatype:*/   h5::dt_t;

The brace enclosed initializer list / direct initialization takes ownership of hid_t CAPI style handles, and will call the right H5??close on DTOR.
hope it helps
steven

maric · October 30, 2018, 4:26pm

Hi Steven,

Thanks a lot for the help! I will look into H5CPP as soon as the time allows!

Best regards,
Tomislav

maric · October 31, 2018, 9:44am

Hi Steven,

I have tried to use the C API to learn a bit, and it really seems to be very cumbersome to use it in a C++ program because of the reseource management but also data conversion for multidimensional datasets.

Also the native C++ API to HDF5 seems to make it difficult to do the most natural thing, write a 1D std::vector as a dataset into a file.

As a C++ programmer that wants to organize his research data with HDF5, all I want to do is store single doubles and 1D double vectors as errors from my simulation into an organized structure, that uses metadata to reflect the structure of the parameter study / simulation I am computing, and this already seems to be a problem for the C++ API and other C++ libraries I have tried so far.

I am finally trying out H5CPP next, I may bother you with questions.

Tomislav

steven · October 31, 2018, 2:16pm

The simplest way is:

#include <h5cpp/all>
int main(){	
   
    std::vector<double> v(10);std::fill(std::begin(v), std::end(v), 1e0 );
   // create HDF5 container, see H5CPP or CAPI doc for details they are matching
   // the returned 'fd' descriptorcan be passed to CAPI calls, but cleans up, and is binary compatible
   // with CAPI hid_t 
   h5::fd_t fd = h5::create("example.h5",H5F_ACC_TRUNC);
   // what good would H5CPP be if didn't know of std::vector?		 
   h5::write(fd,"stl/vector/full.dat", v); 
   // note: it does all the 'things' you've been asking people on this forum; and capable of doing a lot more
}

Tomislav, keep in mind while H5CPP does the right thing without knowing of the CAPI if you have knowledge of it, you should be easily work and intagrate/mix CAPI code with H5CPP template-metaprogramming assisted technology. See this link for suported objects. H5CPP provides near 100% coverage of CAPI internals – only the documentation is lagging**, and FYI: full STL support is coming soon!

** attributes are not documented yet, h5aread, h5awrite should work for all supported objects + rank 0 objects, character arrays, …; Wait few weeks for eye-catching syntactic sugar or look at examples/attributes for current status.

In addition to the ‘simple’ here is a more complicated case, which you can freely download from the project github page:

int main(){
	//RAII will close resource, noo need H5Fclose( any_longer ); 
	h5::fd_t fd = h5::create("example.h5",H5F_ACC_TRUNC);
	{
		std::vector<double> v(10);std::fill(std::begin(v), std::end(v), 1e0 );
		h5::write(fd,"stl/vector/full.dat", v); // simplest example

		//An elaborate example to demonstrate how to use H5CPP when you know the details, but no time/budget
		//to code it. The performance must be on par with the best C implementation -- if not: shoot an email and I fix it
		h5::create<double>(fd,"stl/vector/partial.dat",
				// arguments can be written any order without loss of performance thanks to compile time parsing
				h5::current_dims{20,10,5},h5::max_dims{H5S_UNLIMITED,10,5}, h5::chunk{1,10,5} | h5::gzip{9} );

		// you have some memory region you liked to read/write from, and H5CPP doesn't know of your object + no time to
		// fiddle around you want it done:
		// SOLUTION: write/read from/to memory region, NOTE the type cast: h5::write<DOUBLE>( ... );
		h5::write<double>(fd,"stl/vector/partial.dat",  v.data(), h5::count{3,1,1}, h5::offset{2,1,1} );
	}


	{ // creates + writes entire POD STRUCT tree
//  THIS MAY REQUIRE TO USE H5CPP LLVM asisted compiler technology
		std::vector<sn::example::Record> vec = h5::utils::get_test_data<sn::example::Record>(20);
		h5::write(fd, "orm/partial/vector one_shot", vec );
		// dimensions and other properties specified additional argument 
		h5::write(fd, "orm/partial/vector custom_dims", vec,
				h5::current_dims{100}, h5::max_dims{H5S_UNLIMITED}, h5::gzip{9} | h5::chunk{20} );
		// you don't need to remember order, compiler will do it for you without runtime penalty:
		h5::write(fd, "orm/partial/vector custom_dims different_order", vec,
			 h5::chunk{20} | h5::gzip{9}, 
			 h5::block{2}, h5::max_dims{H5S_UNLIMITED}, h5::stride{2}, h5::current_dims{100}, h5::offset{3} );
	}
	{ // read entire dataset back
		using T = std::vector<sn::example::Record>;
		auto data = h5::read<T>(fd,"/orm/partial/vector one_shot");
		std::cerr <<"reading back data previously written:\n\t";
		for( auto r:data )
			std::cerr << r.idx <<" ";
		std::cerr << std::endl;
	}
}

And for closing lines: as the architect of H5CPP I am keenly interested of your problem, possibly we can discuss it in detail so you get what you want – with less work?

steven

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Creating a Group per MPI process using parallel HDF5 with the C++ API