HDF5 C++ Webinar Followup - recording and Q&A transcript


#1

Hi all,

This post contains information on the HDF5 C++ Webinar, which happened on January 24th, 2019. This is also an introduction to this section of our forum, the C++ User’s Group. Feel free to post in this category whether related to the webinar content, or HDF5 C++ in general.

As a reminder, these were the presentations:

(You can read more about the presentations on our blog.)

You can catch the recording of the webinar at https://youtu.be/7A5dPL7zrj0

A transcript of the question and answer session has been posted on our blog at https://www.hdfgroup.org/2019/01/hdf5-c-webinar-followup/

We hope you appreciate these presentations from our community members and welcome your participation in our community discussions. Our team is eager to learn and work directly with you on your HDF5 initiatives and programs. Let us know how we can help by contacting us at info@hdfgroup.org.

Again, thank you to our community member presenters, and to all who could join us.


#2

Good presentation! Hot topic with me about the c++ wrappers having created my own internal version , closer to h5cpp"wrapper"; I’m sorry I missed this one as I fell ill at the time.

I have some responses to the questions and meeting content.

  1. Biggest issue - H5CPP and h5cpp cannot have the same name. Draw lots, anything, just one of the projects has to rename, sooner the better. I already used h5cpp “wrapper” and highfive and I didn’t even know H5CPP was a different thing. Now I have to explain the situation to others and HDF5 is the victim in terms of adoption and maturity perceptions, on top of the injury to engineering projects the confusion causes.

  2. Make the position of the internal HDF5 C++98 wrapper api deprecated and in maintenance mode for commercial customers. Put it in the documentation. Redirect to the other libraries. Elena basically said all these things but haven’t made the position absolutely clear or something people visiting the site will pick up on.

  3. I do not see the reason for public Ntuple with both H5CPP/h5cpp wrapper at this point. I understand how and why it was created though. I’m an h5py / pandas user too and I can do row or column major (struct of arrays or arrays or structs) with the h5cpp styled libraries, depending on where I want one or the other. Not a criticism!

  4. I need more time to evaluate H5CPP from Steven Varga, I’ve already been down my own road of the approach it has taken with llvm /clang https://github.com/nevion/metapod -and librarys more in the vein of h5cpp “Wrapper”, however I didn’t find much for how to build compound types manually - I want both as first class citizens. Probably has it but the front facing documentation page got in the way and just pointed to h5cpp at every opportunity. Out of the box support for Eigen et al matrix flavours is definitely noted but I think the most noteworthy thing - that again, I need to audit - is the new locally implemented “packet table”. I’ve got lots of history with the packet table as it exists already (search ML…)

  5. Between h5cpp wrapper and H5CPP , I think you both did great libraries. But you both don’t do different things. You fill the exact same gaps as far as I can tell. Consolidate again - the choice will only fragment and cause harm in the long term. HDF5 group, please make a recommendation for a c++ library, even if softly through blog posts or somesuch. H5py is the most fully defined “hdf5” library in python, and even that is somewhat more murkey than helpful (I don’t believe it got a recommendation at any point), but I’m not sure if without a recommendation things will become clear to people in C++ since the crowd drawn to C++ hdf5 has been considerably less than python.

  6. Thread safety, for independent threads - either for performance reasons - reading from multiple locations for higher IO (yes raid has overlap with this, but it doesn’t tap out NAS, or JBOD using 1 file at a time, and sometimes you want to amortize latency, too) - support for independent threads not blocking eachother though is the real gift that keeps giving. Modern programs have threads going alot of work, not just computations, openmp programs even. This isn’t so much on the side of the wrappers though as this is an internal limitation, more than anything else, of hdf5. With “unified memory” /svm I see no reason to involve hdf5 anything with the idea of GPU memory directly, I see no complications other than care for performance with that, but I see lots of reasons to support thread concurrency.


#3

Thank you for the questions!

  1. Please note that h5cpp is a neutral, descriptive name discovered and used independently in two different part of the world. We did have a friendly discussion whether to decide this pressing issue in arm wrestling or beer drinking. For now we’re both focusing on to deliver a better user experience.

  2. Great question to the HDFGroup, and I am curious of their response.

  3. Thank you for the feedback. Tuples / hypercube of POD structs are indeed not the linear algebra way to store data, but is a valid and native representation for events from various fields: particle collider, financial market, RealTimBidding, Sensor network on oil rigs, etc. Preserving structure is not unusual idea in these fields; If your use case is different please refer to data primitives supported by major linear algebra systems, or raw memory pointers.

  4. When H5CPP was presented in Chicago C++ usergroup meetings the LLVM compiler has been criticized, as a response a linux binary packages are provided so users can try and provide feedback to compiler assisted ‘introspection’. Did you prefer manual work please read the examples generated.h files. You need to specialize h5::register_struct<your_type>(){ } then register it with the macro: H5CPP_REGISTER_STRUCT. Here is an example:

namespace h5 {
   template<> hid_t inline register_struct<sn::example::Record>(){
        hid_t ct_00 = H5Tcreate(H5T_COMPOUND, sizeof (sn::example::Record));
        // see HDF5 CAPI COMPUND Dataype for details
		return ct_00; // <-- note the returned hid_t will be closed by H5CPP
   }
}
// don't forget to register the structure with H5CPP templates
H5CPP_REGISTER_STRUCT(sn::example::Record);

Thank you for your ‘audit’ let me know how it went. As of now the packet table runs on direct chunk IO, is near bare hardware speed, and as I further develop the library will have extensive filtering support with multithreading option. The packet table has been reworked from the original 2011 design to accommodate matrices, vectors, and element wise append. See ‘examples’ directory for details.

  1. Thank you for sharing your insight and advising a European and a Canadian independent group to work together. HDF5 users are diverse, as a Canadian I embrace diversity by showing the difference.
    The three projects are similar in a sense that all tying to the HDF5 CAPI to some level. However there are differences at first glance:
  • seamless POD struct through compiler assisted reflection vs something-else
  • easy to use pythonic experience based on template meta programming vs something-else
  • header only library, no other dependencies than HDF5 CAPI vs linked library
  • high performance IO based on chunk IO vs plain old HDF5 CAPI calls
  1. Threads, processes, MPI-ROMIO, RAID, … support some level of parallelism. I suggest reading on C++11 threading primitives and taking relevant courses to further expand on what we already know: To have access from different threads to the same IO device will only make it slower.
    There are cases when threads make a difference, filtering pipeline is a good example. Dedicate a thread as an IO server, and make request from other threads. This has been discussed in HDF Group SWMR approach. A much better one is use MPI-IO which will do the same with an added re-ordering of the blocks so it will do the right thing.
    HDF5 internally is very clean and fast, in my internal study it performs near bare hardware speed, on par with underlying filesystem. This gutted out version of HDF5 is/will be the base of H5CPP.
    Direct CUDA DMA from/to DISK is a hot topic, it doubles the available bandwidth which is significant gain in machine learning. One way to reorder priority is through grants/donations or a specific contract.

All in all If you find H5CPP interesting and need support to make it fit into your existing or new c++ project you can reach me at steven@vargaconsulting.ca


#4

RE: “HDF Group recommendations”

[I do work for The HDF Group, but I don’t speak for The HDF Group.]

Pieter Hintjens’ comment on the “Architecture of the 0MQ Community” comes
to mind where he explains that, “A lot of languages have multiple bindings
(…), written by different people over time or taking varying approaches.
We don’t regulate these in any way. There are no ‘official’ bindings.
You vote by using one or the other, contributing to it, or ignoring it.”

I think that ‘recommendation’ as in “HDF Group publicizes an interesting piece
of work in the community” or “We have seen someone doing something similar.
Why don’t you take a look at this?”, that’d be fair.
Beyond that it’s going out on a limb.

Many good (some might argue: most) things are happening in the HDF5 ecosystem
because The HDF Group is NOT involved.