What do you want to see in "HDF5 2.0"?

This library looks very promising. Thanks for drawing my attention to this.

I feel like we should consider dropping the existing C++ wrappers and blessing a library like this as “official”.

2 Likes

Maintain C API compatibility back to API versions 1.8 through the latest. The primary objective is to maintain build and run compatibility for all Netcdf-4 versions 4.0 and forward.

HOWEVER, it would make a lot of sense to do this through a legacy wrapper which is not part of the new core library. That would be good enough for me.

1 Like

Just out of curiosity, can you give us an example where “fake nesting” via the attribute names (e.g., /a/b/c/t) and user-defined types won’t do the trick?

Consider adopting the Zstandard (zstd/zst) as a “1st class” compression algorithm, and implement/bundle/link it in the “official” HDF5 distribution like gzip and szip currently.

Zstd have gained tremendous popularity in the last years, due to extremely efficient compression/decompression (thousands of MB/s on a single core) together with a quite acceptable compression ratio (on par with gzip). I am aware that there is a filter-plugin for it, but it’s not quite the same as built-in support.

In my opinion, one of the great things with HDF5 is the portability. When only using official library features I can assume that the HDF5 files I make is universally readable everywhere. If I suddenly start using filter-plugins, that is no longer the case. I do for instance rely on HDF5 for storing simulation data, and this data is opened with Paraview. Paraview binaries ship with HDF5 support, then my files are readable. Gzip is too slow for practical use, and if I use Zstd nobody will be able to open my files, because they lack the plugin.

Having official in-library Zstd compression would be awesome and allow using compression in many new cases!

2 Likes

Integrated zstd support would be wonderful. I’m been working hdfgroup and plugin packagers on this.

Zstandard has now been integrated into the Linux kernel at multiple levels and into the conda packaging system. I suspect this is due its wide applicability and tunability.

Zstd should be included in the standard plugin distribution both directly and via Blosc:

Further integration along the lines of how gzip is integrated would be great.

While we are talking about this in the HDF5 2.0 thread, inclusion of Zstandard as a builtin filter could come pre-2.0. It is not a breaking change.

Moreover the matter is really about distribution. Right now one can download a HDF5 installer and then there is a separate plugin installer. Rather an alternative would be for the default installer to include the plugins and then have a minimal installer without the plugins.

When it comes to filters, we’ll be working to better integrate the filter plugins rather than creating built-in filters.

1 Like

The problems with filters as separate plugins is the following:

I have a CFD code that I develop. We store data in HDF5 files. The files can be opened with tools like Paraview, and we also use a lot of custom Python programs for processing the data. The data is typically computed on a HPC system, and the visualization happens somewhere else. The systems used for processing and visualization are completely “random”, could be any system really.

I can include as many plugins and filters as I wish in my CFD code. That’s no problem. But then, when I go to a random system for postprocesisng to open it in Paraview, we instantly get a problem. Paraview ships its own HDF5 that might be a complete different version, built with different compilers and with different features. This is outside my control.

So If I compress my dataset with Zstd on the CFD code side, I will need to distribute the plugin along with the dataset to everyone that want to have a look, because the Paraview installation does not have the plugin installed. And even that is impossible, because 80% of the users would still not be able to get the plugin running properly. I would need to distribute plugins for Mac, Linux and Windows, which I have no possibility to do, I do not even have a Windows or Mac computer to compile such a thing on…

Gzip compression currently works because everyone has it in the default HDF5 build, but that is so incredibly slow that it is practically useless for any purpose.

If one manage something like -DHDF5_BUILD_PLUGINS=YES, where “YES” is the default, in the main HDF5 cmake build system, then it might work out. But anything that is opt-in, separate download etc. will just never reach the good widespread adoption that is needed for it to be useful in a situation where data is shared across systems and users where we there are no one single HDF5 library, I’m afraid…

I think, “fake nesting” would actually do the trick. Through, I am concerned regarding the run-time performance of this approach. Also, it appears rather cumbersome to work this way with complex meta-data.

For large (> 8) numbers, you are looking at B-tree/heap storage performance. This threshold is configurable. See H5Pset_attr_phase_change.

Can you give us an example of what you had in mind for complexity? Feel free to open a separate thread.

I would also caution against taking the notion “user metadata = attributes” too literally. Nobody says you must use HDF5 attributes for metadata. I’ve seen many HDF5 files (good ones!) that didn’t use attributes at all. The HDF5 data model offers certain primitives, which we had to name. How domain concepts are mapped onto these primitives is up to domain experts (and we are happy to assist). Don’t let names fool you when creating such a mapping.

G.

  1. Incorporation of zstd and bitshuffle filters as part of the standard build+distribution.
  2. An official c++ library the HDF5 group makes/selects; old one should be removed. I’d pitch ess-dmsc `s as the best and most straightforward take.
  3. Thread parallel reads and writes of datasets - as long as nothing else is using the same dataset on the write side. I’d compromise to file level instead of dataset level. Read side should always do full parallel.
  4. Compression can be one shot and not require chunking, there’s a long time now where setting chunksize and not having enough data results in a crash in writing, removing chunking is one way of dealing with this. I’d also settle for better handling of chunks when data of less than one chunk is written.
1 Like

How about a configuration file for libhdf5 many settings? I am not aware of any other way to share libhdf5 settings except as code snippets in a particular programming language. The use of such a config file would have to be via an explicit API function and not something happening silently in the background.

The order of processing would be something like:

  1. Any libhdf5 setting not in the config file keeps its default value.
  2. The value of all libhdf5 settings in the config can be changed in the application code.

The config file format could be TOML or YAML or…

-Aleksandar

2 Likes

One specific thing I’d like: a non-callback version of all iterators (things like H5Ovisit & H5Dchunk_iter). I.e. there would be functions to create an iterator, get the next value, and destroy it. This would be cleaner to integrate in h5py, and I imagine in other language bindings too. Technically, this doesn’t need to wait for 2.0, but it feels like a fairly big change.

More broadly… It feels like HDF5 tries to be a lot of things to a lot of people, and as a result, there are a lot of things that it does, but not especially well. I see maybe three big areas that HDF5 is trying to deal with (obviously these don’t have nice neat dividing lines):

  1. A file format for storing multidimensional data, including things like chunking and compression
  2. A format for describing data stored elsewhere - virtual datasets, external links, external data…
  • This sounds like a minor variant of 1, but it brings a whole new set of concerns, because to make it transparent the library has to deal with opening and closing files, raising possibilities like permissions errors and running out of file descriptors where they weren’t previously expected. E.g. many error cases when accessing a virtual dataset just show up as empty data.
  1. A framework for accessing & manipulating multidimensional data in C - with reference counting, the virtual object layer, async support…
  • A particular thread of this is to bring the HDF5 model to alternative storage formats, often involving ‘the cloud’ :cloud:

It seems like these often conflict. E.g. echoing other people, we’d like to read the HDF5 file format (point 1) from multiple threads. There’s no inherent barrier to reading and parsing data in parallel. But reference counting (part of point 3) is a major obstacle to multithreading (as the Python world knows all too well - it’s a big part of why the infamous Global Interpreter Lock is so hard to get rid of).

I don’t know how to address all this! :person_shrugging: From my perspective, it would be tempting to split it up somehow - maybe have a lower level piece purely for dealing with the file format, and build the ‘C framework’ on top of that. Then bindings to other languages could reuse the parser part, and make their own framework to fit in with their own language & ecosystem. But I imagine this would involve a ton of work and difficult decisions.

The above is mostly with my h5py hat on, rather than my EuXFEL hat.

3 Likes

What would the community think about dropping reference counting from HDF5 2.0?

Some background - when HSDS was initially being developed, I tried to mirror the library’s reference counting feature, but it wasn’t really practical for a distributed service architecture like HSDS. So to keep things (relatively) simple in HSDS link creation and deletion are independent of HDF object creation/deletion. E.g. if you delete the last link to a Dataset, you are left with an anonymous dataset (a dataset that you can access using its id, but not via any h5path). Now that I’ve gotten use to it I don’t really see how not having ref counting is especially problematic from a user point of view and actually has some benefits (like enabling persistent anonymous datasets/groups).

Dropping ref counting would be a breaking change, but no better time to do so like v2.0, right?

Re: cloud page storage formats, the HSDS schema (hsds/obj_store_schema_v2.md at master · HDFGroup/hsds · GitHub) is compatible with the HDF5 data model and has been used extensively with HSDS. A vol connector that read and wrote directly to the HSDS schema would be a nice complement to Library->rest-vol->HSDS.

1 Like

Isn’t that putting the cart before the horse? Reference counting is an implementation detail and a means to an (or multiple) end(s), i.e., something that users care about. What is that end, X? Assuming X is still required, is there a better (in a to-be-defined sense) way to implement it in the future? Isn’t that the question?

HSDS is an independent implementation of the HDF5 data model. There’s nothing about reference counts in the model. Unless there are changes to the data model, the data model underlying any HDF5 2.0 library or other implementation is still the same as for 1.x.

Without context, “HDF5 2.0” is as misleading as “Big 2.0.” Is Big 2.0 bigger than Big 1.0? :grin:

G.

1 Like

H5CPP uses reference counting as advertised by HDF5 CAPI, and delegates RAII management to the underlying CAPI calls. On my side this could be replaced with std::shared_ptr<T> if the wind turns, but the hid_t are integers/descriptors with a story behind them, including how many references has been made etc…

I would like to ask for support for missing data, i.e. to have an array of logical size N_1, which contains only N_2 < N_1 items and the remaining array entries are masked, typically because the data is missing.

This would be helpful for at least two use cases:

  • Experimental measurements where data is missing, e.g. Electron backscatter diffraction (EBSD). Ping @mike.jackson
  • Modular simulation tools, where output exists only on part of the domain, e.g. when locally a refined model is used.

NumPys masked array (Masked arrays — NumPy v1.26 Manual) offers related functionality, but the masked data is actually there and can be accessed, so storing a masked array can be done in HDF5 with two arrays (one with values and one with Booleans), potentially lumped together as a compound type. My proposal differs with respect to that in the way that only the existing data is stored to reduce the memory footprint. So the idea could be also seen as some way of storing sparse matrices (see Sparse matrix - Wikipedia).

The advantageous over hand-written solutions would be that HDF5 could use the most appropriate data structure/layout in the backend.

1 Like

Is this proposal: https://docs.hdfgroup.org/hdf5/rfc/RFC_Sparse_Chunks180830.pdf along the lines of what you are looking for?

1 Like