HDF5 virtual dataset out of a file with a compound type?

angelv · December 20, 2019, 12:59am

Hi,

one of our applications uses HDF5 files with a compound type. Something like the following (simplified here):

,----
| HDF5 “test.h5” {
| GROUP “/” {
| ATTRIBUTE “GRID_DIMENSIONS” {
| DATATYPE H5T_STD_I32LE
| DATASPACE SIMPLE { ( 3 ) / ( 3 ) }
| }
| ATTRIBUTE “X_AXIS” {
| DATATYPE H5T_IEEE_F64LE
| DATASPACE SIMPLE { ( 3 ) / ( 3 ) }
| }
| ATTRIBUTE “Z_AXIS” {
| DATATYPE H5T_IEEE_F64LE
| DATASPACE SIMPLE { ( 70 ) / ( 70 ) }
| }
| GROUP “Module” {
| DATASET “g_data” {
| DATATYPE H5T_COMPOUND {
| H5T_IEEE_F64LE “temp”;
| H5T_IEEE_F64LE “density”;
| H5T_ARRAY { [3] H5T_IEEE_F64LE } “B”;
| H5T_ARRAY { [3] H5T_IEEE_F64LE } “V”;
| H5T_ARRAY { [20] H5T_IEEE_F64LE } “dm”;
| H5T_ARRAY { [9] H5T_IEEE_F64LE } “jkq”;
| }
| DATASPACE SIMPLE { ( 70, 3, 3 ) / ( 70, 3, 3 ) }
| }
`----

An now I’m trying to find a way in which I can create a virtual dataset or similar so that I can “extract/filter” data from the compound type (the idea was that perhaps I could create, for example, a virtual dataset “density”, taking the actual data from the “density” component of the g_data actual dataset.

Is that something that can be done?

Any ideas/pointers that can help me with this?

Many thanks,

steven · December 20, 2019, 3:03am

Solution

This solution may be checked out from my github page. See the Makefile and the related examples installed into /usr/share/h5cpp/examples/

create C++ POD struct in some namespace (or without) to describe dataset “Module”
read dataset chunk by chunk, one shot or …
invoke h5cpp compiler to fill in missing type descriptor
invoke C++17 compiler to compile and link project against -lhdf5 -lz -ldl -lm

Create POD struct

manually describe dataset from h5dump -pH test.h5 as a C++ POD struct

#ifndef  H5TEST_STRUCT_01 
#define  H5TEST_STRUCT_01

namespace sn {
	struct record_t {     // POD struct with nested namespace
		double temp;
		double density;
		double B[3];
		double V[3];
		double dm[20];
		double jkq[9];
	};
}
#endif

write software

as if you had all the HDF5 type descriptors available. In fact forget about all the details you won’t need them. Just write your software if you didn’t know much of HDF5 CAPI:

#include <iostream>
#include <vector>
#include "struct.h"
#include <h5cpp/core>
	// generated file must be sandwiched between core and io 
	// to satisfy template dependencies in <h5cpp/io>  
	#include "generated.h"
#include <h5cpp/io>
int main(){
	h5::fd_t fd = h5::create("test.h5", H5F_ACC_TRUNC);
	{ // this is to create the dataset
		h5::create<sn::record_t>(fd, "/Module/g_data", h5::max_dims{70,3,3} );
		// chunk must be set for partial access (huge files):  h5::chunk{1,3,3}
	}
	{ // read entire dataset back
		using T = std::vector<sn::record_t>;
		// for partial read be certain dataset is chunked, see documentation @ sandbox.h5cpp.org
		auto dataset = h5::read<T>(fd,"/Module/g_data");

		for( auto rec:dataset ) // this is your HPC loop
			std::cerr << rec.temp <<" ";
		std::cerr << std::endl;
	}
}

invoke `h5cpp compiler`

The LLVM based source code transformation tool will fill in the details for you: a minimal type descriptor for H5CPP template library to
save data into HDF5 format.

#ifndef H5CPP_GUARD_jAkGV
#define H5CPP_GUARD_jAkGV

namespace h5{
    //template specialization of sn::record_t to create HDF5 COMPOUND type
    template<> hid_t inline register_struct<sn::record_t>(){
        hsize_t at_00_[] ={3};            hid_t at_00 = H5Tarray_create(H5T_NATIVE_DOUBLE,1,at_00_);
        hsize_t at_01_[] ={20};            hid_t at_01 = H5Tarray_create(H5T_NATIVE_DOUBLE,1,at_01_);
        hsize_t at_02_[] ={9};            hid_t at_02 = H5Tarray_create(H5T_NATIVE_DOUBLE,1,at_02_);

        hid_t ct_00 = H5Tcreate(H5T_COMPOUND, sizeof (sn::record_t));
        H5Tinsert(ct_00, "temp",	HOFFSET(sn::record_t,temp),H5T_NATIVE_DOUBLE);
        H5Tinsert(ct_00, "density",	HOFFSET(sn::record_t,density),H5T_NATIVE_DOUBLE);
        H5Tinsert(ct_00, "B",	HOFFSET(sn::record_t,B),at_00);
        H5Tinsert(ct_00, "V",	HOFFSET(sn::record_t,V),at_00);
        H5Tinsert(ct_00, "dm",	HOFFSET(sn::record_t,dm),at_01);
        H5Tinsert(ct_00, "jkq",	HOFFSET(sn::record_t,jkq),at_02);

        //closing all hid_t allocations to prevent resource leakage
        H5Tclose(at_00); H5Tclose(at_01); H5Tclose(at_02); 

        //if not used with h5cpp framework, but as a standalone code generator then
        //the returned 'hid_t ct_00' must be closed: H5Tclose(ct_00);
        return ct_00;
    };
}
H5CPP_REGISTER_STRUCT(sn::record_t);

#endif

Actual Output:

h5cpp  struct.cpp -- -std=c++17 -I/usr/include -I/usr/include/h5cpp-llvm -Dgenerated.h
H5CPP: Copyright (c) 2018     , VargaConsulting, Toronto,ON Canada
LLVM : Copyright (c) 2003-2010, University of Illinois at Urbana-Champaign.
g++ -I/usr/include -o struct.o  -std=c++17 -c struct.cpp
g++ struct.o -lhdf5  -lz -ldl -lm -o struct	
./struct
h5dump -pH test.h5
HDF5 "test.h5" {
GROUP "/" {
   GROUP "Module" {
      DATASET "g_data" {
         DATATYPE  H5T_COMPOUND {
            H5T_IEEE_F64LE "temp";
            H5T_IEEE_F64LE "density";
            H5T_ARRAY { [3] H5T_IEEE_F64LE } "B";
            H5T_ARRAY { [3] H5T_IEEE_F64LE } "V";
            H5T_ARRAY { [20] H5T_IEEE_F64LE } "dm";
            H5T_ARRAY { [9] H5T_IEEE_F64LE } "jkq";
         }
         DATASPACE  SIMPLE { ( 70, 3, 3 ) / ( 70, 3, 3 ) }
         STORAGE_LAYOUT {
            CONTIGUOUS
            SIZE 0
            OFFSET 18446744073709551615
         }
         FILTERS {
            NONE
         }
         FILLVALUE {
            FILL_TIME H5D_FILL_TIME_IFSET
            VALUE  H5D_FILL_VALUE_DEFAULT
         }
         ALLOCATION_TIME {
            H5D_ALLOC_TIME_LATE
         }
      }
   }
}
}

best wishes:
steven

gheber · December 20, 2019, 12:59pm

We can quibble about terminology here, but I believe what you want might be called a ‘dataset view’ rather than a virtual dataset (VDS). VDS is based on the idea of stitching together a dataset based in selections on other datasets. You are looking for a “projection” on the datatype. You can certainly do partial I/O on a VDS of a compound type, but to my knowledge there is currently no standard vehicle to store this like a view (a precompiled query) in the database world.

angelv · December 20, 2019, 2:32pm

Hi Steven,

thanks for the complete sample code, but I’m missing something or your code doesn’t really address my question. I do have already a .h5 file with the compound data in it. My question is how to easily extract one of the components of that compound type as if it was a separate dataset. Don’t know if it is possible at all, but ideally I would like to be able to do something like:

h5dump -d density test.h5

where the actual heavy data for density is only in the g_data compound dataset, and the “density” dataset is only a virtual dataset, dataset view, whatever…

Cheers,

angelv · December 20, 2019, 2:38pm

Hi gheber,

yes, I’m not sure at all about what HDF5 feature I could use to attain this, but dataset view sound more promising. I know how to get with code the dataset that I want. For example, with Python I already do something like:

f[‘Module/g_data’][‘density’][:,:,:]

but I would like to find a way to have that as “a precompiled query”, as you put it, so that with standard hdf5 tools, for example, with hdf5, I could do:

h5dump -d /density test.h5

This would simplify things a lot in a number of our use cases for these files.

Thanks,

steven · December 20, 2019, 4:22pm

Hi Angelv,

I do not know how to do that. Nor do I know the motivation, is it to:

save space?
reach the heterogeneous data-set field hmm: faster?

As I said Gerd Heber this morning, where we briefly discussed your case: please provide me an implementation and let’s compare properties; I know what you want and can’t have it that way.

You are missing the part that the dataset has an actual physical layout, and a cost associated with IO ops from/to a location; which is the function of the underlying (possibly block device), … .

best: steve

steven · December 20, 2019, 5:12pm

Actually Gerd is right saying I didn’t quite get what you want to do. Honestly I am still not quite getting it:

This is similar what I posted; except the sieving part is not hidden. You can use armadillo c++ or std::vector for a sieve buffer to get similar effect; add chunking for better performance. In any event, I would just write the utility, using the code base I posted for you combined with boost program options. Consider: it is only few lines for the special case – the general case with arbitrary compound types is much harder to do.

best: steve

gheber · December 20, 2019, 7:35pm

How about something like this?

h5dump -d Module/g_data --fields density,... test.h5

--fields would take a list of field names. If the type turned out not to be a compound or a field was non-existant a warning would be printed/error generated.

If that fits the bill, you could file a request in JIRA https://jira.hdfgroup.org/.

G.

angelv · December 20, 2019, 7:38pm

Hi Steven,

what I want to get is simple: to have an easy way to access one of the components of the compound data just as if it was its own 3D dataset (without having to write code for it, being somehow part of the description of the HDF5 file itself).

The motivation for it comes from visualizing the data in these files (although I can imagine others scenarios where this could be useful). Currently we use VisIt, and since this is, obviously, a home-grown HDF5 type of file, VisIt cannot read it. We use HDF5 for the ‘heavy’ data and XDMF for the ‘light’ data (description of where the acutal data is, grid dimensions, type of grid, etc.). So, for example, for the density field in the XDMF file I will specify where the data for this 3D field is coming from, for example with something like:

 <DataItem
        Name="Points"
        Dimensions="100 200 300 3"
        Format="HDF">
        MyData.h5:/XYZ
    </DataItem>

But XDMF has no syntax for accessing a component in a compound type, so there is no way for me to get into VisIt the density field. So I was hoping that with virtual datasets or dataset view or some other advanced trick, I could get it as part of the HDF5 file description itself, and so I could access it from within VisIt.

What we do now is to get the test.h5 file and with a Python code extract the individual 3D scalar fields to individual datasets (well, actually we transform them into .vtk files, but the idea is the same), which then we can easily read from within VisIt, ParaView, Mayavi, etc. This is not ideal (we have to post-process our HDF5 files and we end up replicating the data, and these files can be quite big), so I’m trying to find a way out of this.

I hope it is clearer now.

Any hints/ideas very welcome.

angelv · December 20, 2019, 7:52pm

Hi gheber,

if it is something specific to h5dump I guess it would not be what I’m looking for. The low-level machinery to extract the data would be right, but it would not be part of the test.h5 file itself.

Ideally what I would like is to somehow include this description of how to get the “density” values embedded in the test.h5 file itself, so any tool/program that understands HDF5 could be “fooled” into believing that “density” is a 3D dataset on its own right.

That would be fantastic: I could have the actual data nicely grouped together in a compound datatype (which for a number of reasons works much better in our application), but then I could access the 3D fields for each of its components with the simplicity of simple individual datasets.

I hope it makes sense.
Thanks

gheber · December 20, 2019, 8:44pm

It does. You can still file an improvement request. What (I believe) you are saying is that our array variables have two characteristics: shape and (element-)type. VDS has opened the shape to “virtualization,” but left the type untouched. You are suggesting to open the type to conversion (dropping a few components of a compound is an example of that) and let users compose (and persist!) such “views.” This is certainly compatible w/ the HDF5 data model, since layout, the mechanics of how dataset elements are stored and produced, is an implementation detail.

For the time being, you could mimic this behavior with a convention: you could create a committed type and decorate it with an object reference attribute, which refers to the dataset you’d like to read/reduce. (if you had more than one dataset you could create a special group and link them all together). There would be an extra step in which you read the (committed) datatype, make it (its native version) the in-memory type of your H5Dread, open (de-reference) the object reference and start reading. You’d have to know this protocol/interpretation, but it would tide you over until we have fully fledged views. Does that make sense?

steven · December 20, 2019, 9:40pm

Thanks for the details. I don’t see any simple way of doing this other than the code block I posted – which probably beats the python version; lets see what others have to say.

If this problem has a budgeted value either contact the HDFGroup consulting services or get in touch with me privately.
best: steve

steven · December 21, 2019, 5:48am

Taking a second look I learned of VisIt visualisation software, and it turns out it has a plugin framework for file formats it doesn’t yet know about. Your problem appears to be the case, and basically the view the @gheber mentioned is provided through this custom plugin ,or transfer function that you need to develop.

Having said this you are still bound by the already existing constraints of physical layout: the throughput is limited by the ratio of useful/not useful field within a read block. With clever re-design this limitation may be lifted; but that is another cup of tea.
On the page 89 of this GettingDataIntoVisIt2.0.0.pdf (5.2 MB) document, you find the description of the development process; which you may implement in-house or contract to external software developer.

My previously posted code is applicable and if the POD structure matches, you can use the generated.h without modification (you won’t need h5cpp compiler). To improve upon performance make sure your chunk size is optimal, and you do partial IO while sieving dataset. It should look something like this:

auto ds = h5::open(fd, "dataset");
/*make room for seive buffer, I use smart pointer here but 
std::vector or any supported linalg object also do fine */
std::unique_ptr<double> buffer(new double[buffer_size]);

for(int i=0; i< n_chunks ; i++){
// this function call when used with or greater 1.10.4, and no `h5::stride | h5::block`
// specified will call direct chunk read -- very hard to beat the throughout! 
    h5:read(ds, buffer, h5::offset{i+buffer_size}, h5::count{buffer_size});
    /* ** do sieving here, chunk by chunk, 
        copy the field you need into VTK datastruct ** */
}

There must be significant improvement because:

optmized direct chunk read
on demand copy directly from HDF5 container to VTK object
C++ is hmm not python?

steve

angelv · December 21, 2019, 9:15am

Thanks Steven. I had considered writing a plugin for VisIt (our application is written in C (the Python example is only for sample post-processing scripts) and we do all the reading/writing of the compound data no problem), but that would lock us into VisIt. Other users of our code might want to visualize the results using, for example, ParaView, so that is why I’m trying to find a solution at the HDF5 file level, which will have a broader applicability.
Cheers,
Angel

angelv · December 21, 2019, 9:20am

Many thanks, Gerd. I will file an improvement request then.

As for the second paragraph, where you mention how to mimic this behaviour, I’m not sure I properly follow. I will try to see if I understand what you mean when I have some time to look at the implementation details, likely on Monday. Cheers.

gheber · December 23, 2019, 2:47pm

Let’s say your original compound dataset, fields X, Y, Z, is linked as /a/b/c/dataset. Let’s say you’d like a “dataset view” for just the Y component. You can define and link a compound datatype object (committed datatype) that has just the Y field. Let’s say we’ve linked it as /e/f/g/Y_only. Since it’s an “HDF5 object” it can be decorated with HDF5 attributes We can create an attribute, say "source", of type HDF5 object reference and initialize the reference to point to our original dataset.

When the time comes to read just the Y component, we need to supply 1) the source dataset, 2) the in-memory type, and 3) in-memory and in-file selections. Looking at /e/f/g/Y_only, we get the source dataset by de-referencing the value of the "source" attribute. The in-memory type is the native version of the datatype object at /e/f/g/Y_only (i.e., H5Tget_native_type of H5Topen of /e/f/g/Y_only). In other words, the “decorated” data type object contains all that’s required to “encode” your dataset view (minus dataspace selections).

Variants are possible, for example, if the source dataset is in a different file, or you want multiple source datasets. In any event, you will mimic an aggregate of the datatype you want to read and source dataset(s), and a protocol how they are named and to be interpreted.

G.

angelv · February 16, 2020, 12:02pm

Hi,

way too busy during January, so I postponed this for too long… I was planning to file an improvement request for this stuff, but realized that I don’t know where I should do it. What would be the right channel to ask for such an improvement request?

Many thanks,
AdV

lori.cooper · February 18, 2020, 6:15pm

Hi,

You can send an email to help @ hdfgroup.org or submit via the web at https://help.hdfgroup.org.

Thanks!

Lori Cooper
Product Marketing Associate

angelv · November 21, 2020, 12:40am

I don’t know where time has gone …

I created today the feature request:
https://jira.hdfgroup.org/servicedesk/customer/portal/2/SUPPORT-1258

Many thanks,

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)