What do you want to see in "HDF5 2.0"?

derobins · July 21, 2022, 3:14pm

After HDF5 1.14.0 releases later this year, we’d like to make some more extensive changes to the library and API than we have in the past. HDF5 has been around for 25 years now and the API has not changed much. We’ve never even deprecated features we marked as obsolete at the HDF5 1.6 to 1.8 switch almost fifteen years ago.

“HDF5 2.0” is a chance to reconsider some of our past ideas. A bit of an “API reset” that we can use to address technical debt, while keeping the good things the same.

This is also a good time to start discussing what features should be prioritized in new library development.

What do you most want to see in HDF5 in 2023? Are there things in the API that cause you pain? Tell us about it below!

derobins · July 21, 2022, 3:16pm

Some things I can think of right off the bat:

Semantic versioning
All API calls return herr_t (no more cutting type sizes in half so we can return -1 as the ‘bad’ value)
Smarter about library-allocated memory (esp. filters & API calls that return library-allocated buffers)
Variable-length data handled better
Fix bad naming (e.g. “sec2” --> “posix”)
Actually remove calls marked as “deprecated”
Retire the multi VFD (but keep the split VFD aspect - we just don’t need multiple metadata channels)
Sanitize the metadata read code (the source of most CVE issues)

I have a lot more ideas, but this should be enough to get people chatting. I can also expound on internal things, if anyone has strong interest/feelings there.

daniel.kahn · July 21, 2022, 3:42pm

Actually deprecate calls marked as “deprecated”

Not sure what you mean here, I think marking a call as “deprecated” does deprecate it. Do you mean ‘Actually remove calls marked as “deprecated”?’

epourmal1 · July 21, 2022, 5:28pm

For HDF5 2.0 are you considering

File Format fixes/extensions to support missing features, for example, full support for UTF-8, new implementation of VL types, support for sparse data, complete fix for avoid-truncate feature and lifting some limits (e.g., number of attributes when creation order is tracked)
Performance improvements, for example, implementing shared chunk cache to reduce memory consumption when working with many chunked datasets, support for page buffering in parallel, sub-setting performance

or it is some general code/API cleanup only?

derobins · July 21, 2022, 5:32pm

Oops. Remove is what I intended to say. Fixed.

derobins · July 21, 2022, 5:39pm

I feel like everything should be on the table, but there are finite resources available at The HDF Group for unfunded mandates. That said, if we have a larger conversation with the community about features and priorities, non-HDF-Group people could pitch in and we could accomplish more in HDF5 2.0, 3.0, etc.

I do think that there is a lot of low-hanging fruit out there that we should take care of in short order, so I’m tempted to make HDF5 2.0 be about implementing obvious, needed improvements and API changes in a relatively short period of time and then taking on larger tasks in the next version.

jan-willem.blokland · July 22, 2022, 12:47pm

Is the following item also on the list: pHDF5 in combination with non-trivial data transform function and compression?

Currently, this combination is not possible because compression needs to be collectively while the data transform function should be done independently. This has been discussed in https://forum.hdfgroup.org/t/phdf5-1-12-data-transform-in-combination-with-compression-filter/8799. I have not clue how much effort it would take to make this possible, but maybe the mentioned item “Smarter about library-allocated memory (esp. filters & API calls that return library-allocated buffers)” could be a stepping stone for it.

Furthermore, the semantic versioning feature would be great feature for HDF5.

kittisopikulm · July 22, 2022, 4:12pm

I would like to see conformity between efile_prefix, virtual_prefix, and elink_prefix.

H5P_SET_EFILE_PREFIX is currently misdocumented. The actual default is the current working directory rather than the location of the HDF5 file. HDF5 2.0 should default to this to be as currently documented, relative to the HDF5 file.

derobins · July 25, 2022, 1:50pm

I feel like the efile_prefix issue could be fixed in HDF5 1.13.x

kittisopikulm · July 25, 2022, 3:04pm

Changing the default external file prefix would be a breaking change for many software packages. Changing the default was evaluated before, but it was determined it was not a backwards compatible change. If this were caught earlier, it would arguably be a bug fix, but it seems too late for that.

What could be fixed in 1.13 and on other branches is the documentation, which is currently incorrect.

HDF5 2.0 would be the appropriate time to introduce a breaking change.

derobins · July 27, 2022, 4:57pm

1.13.x are experimental releases so we do allow breaking changes in those. We just couldn’t move it to 1.12 or earlier.

kittisopikulm · July 27, 2022, 5:59pm

I’m basing my statements on an early proposed patch to address this matter. @epourmal (@epourmal1 ?) stated the following then:

On Friday, March 27, [2015], we reviewed the patch and concluded that we cannot accept it as it is. Applications that rely on the current behavior will be broken.

If this could be changed now that would be great. Having the external files be relative to the HDF5 file as the default makes a lot of sense to me especially with heterogeneous networked file systems.

epourmal1 · July 27, 2022, 7:31pm

I support the change. There is enough time to alert community members after HDF5 1.14.0 is out.

kittisopikulm · August 5, 2022, 4:15pm

I’ve seen some posts about memory mapping, and I’ve been thinking about how users may be using this.

Abstracting this, what would be useful is separation between the layout and actual reading and writing of the data. Working near instrumentation and data acquisition, often I have my files will have exactly the same layout and the data will be much larger than the metadata.

Thus, I would like to have layout phase where I just describe the type and shape of the attributes and data.

At some later point in time, I would like to actually write the data without changing the layout. The layout and metadata (and perhaps compact datasets) become read-only while the data can be modified in place. I think this could greatly simplify parallel I/O. If I know that the layout is immutable, then it should be relatively safe to memory map the data portion of the file. This memory mapped data section can then be easily interfaced with using multithreading.

Utilities such as h5ls do help to elucidate the file structure and layout. Ultimately I would like to see made more accessible via the API. Perhaps one gripe I’ve observed is how the tools often end up using private APIs. It would be great if all the tools could be made to use only public APIs. If additional APIs need to be made public to support the tools, then I think this would be an overall benefit. In part, this is driving my interest in H5Dchunk_iter.

An overall concept is to make it very clear to the user about where the data lives in the file and how it is stored. While the HDF5 specification is open, it can be very hard for a normal user to interpret, forcing reliance on the API for all I/O.

kittisopikulm · August 5, 2022, 4:19pm

Another thought I’ve had is about Morton coding or using Z curves. This might be doable before 2.0 with clever allocation of chunks.

Essentially the idea would be to encapsulate the functionality of neuroglancer precomputed for example:

github.com

google/neuroglancer/blob/master/src/neuroglancer/datasource/precomputed/sharded.md

# Sharded format

The unsharded [multiscale volume](./volume.md#unsharded-chunk-storage),
[mesh](./meshes.md#unsharded-storage-of-multi-resolution-mesh-manifest) and [skeleton
formats](./skeletons.md#unsharded-storage-of-encoded-skeleton-data) store each volumetric chunk or per-object
mesh/skeleton in a separate file; in general a single file corresponds to a single unit of data that
Neuroglancer may retrieve.  Separate files are simple to read and write; however, if there are a
large number of chunks, the resulting large number of small files can be highly inefficient with
storage systems that have a high per-file overhead, as is common in many distributed storage
systems.  The "sharded" format avoids that problem by combining all "chunks" into a fixed number of
larger "shard" files.  There are several downsides to the sharded format, however:
- It requires greater complexity in the generation pipeline.
- It is not possible to re-write the data for individual chunks; the entire shard must be
  re-written.
- There is somewhat higher read latency due to the need to retrieve additional index information
  before retrieving the actual chunk data, although this latency is partially mitigated by
  client-side caching of the index data in Neuroglancer.

The sharded format uses a two-level index hierarchy:
- There are a fixed number of shards, and a fixed number of minishards within each shard.

This file has been truncated. show original

miller86 · August 5, 2022, 5:55pm

[quote=“derobins, post:2, topic:10003, full:true”]
Some things I can think of right off the bat:

Smarter about library-allocated memory (esp. filters & API calls that return library-allocated buffers)
can you include space performance testing of this too
Actually remove calls marked as “deprecated”
+1
Retire the multi VFD (but keep the split VFD aspect - we just don’t need multiple metadata channels)
+1
Sanitize the metadata read code (the source of most CVE issues)
+1
a time and space performance test suite
in-memory groups or “windowed” groups where a whole group subtree is handled by keeping entirely in memory once opened, can be explicitly sync’d w/disk by caller and is synced when closed.
more read-agnosticism…to the extent possible, readers should be able to be somewhat blind to the types used by the writer and still succesfully read. An example (https://github.com/visit-dav/visit/blob/4fdb19fac58d28725c61f833e34ba04556281671/src/databases/Chombo/avtChomboFileFormat.C#L614-L640) is where an array of 3 doubles vs. a struct of 3 doubles.
thread-parallel reads (and maybe writes) for same and different datasets in same file
decouple parallel HDF5 from serial HDF5 so that only a single install point (-L/path/to/install -lhdf5 -lpar_hdf5 gets parallel features) serves both
A cook-book suite of real-world examples (e.g. not contrived for testing purposes but from real-world use cases) which are documented (and linked to other documentation sources like API ref, design, etc.) which demonstrate how to use HDF5 for common cases as well as how NOT to use HDF5. Best how-not-to-use example I have (https://github.com/markcmiller86/hdf5stuff/tree/master/graph_of_udts) is serialization of hierarchal data structures…naive users often wind up using HDF5 groups as the nodes in their hierarchy and this has huge negative performance implications
A way for apps to specify default properties (which may be different from the lib deployed/installed defaults) to be followed within the current executable (somewhat related to next item)
compression “strategies”…where caller’s don’t having to manipulate compression directly on each and every dataset written but can tell the lib what “strategy” they wish to follow and then on each write, it does something useful (e.g. compress int types with gzip but compress float types with zfp)
A simplified mode for error stack reporting to report just caller’s failed call (not internals)
A compression test-suite test-bed with appropriate raw-data files where compression of the same raw data via hdf5 is compared, routinely, with compression via common unix command-line tools and the performance differences are understood.
routine (3-4 x per year) scalability testing to tens of thousands of parallel tasks (we can provide compute resources)

miller86 · August 10, 2022, 11:31pm

A better way to handle the “direct write” (or read) case (so that it behaves like an ordinary write (or read)) so that objects which are compressed in memory can be written (or read) compressed to files without having to uncompress and recompress them. Bottom line, some consumers might want the data uncompressed when read while others might want the data to remain compressed even after having been read. On writes, if the data is already compressed in memory (maybe the caller needs to tell HDF5 that it is with a property), it should just go to disk compressed.

kittisopikulm · August 11, 2022, 7:36am

H5Dwrite_chunk (formerly H5DOwrite_chunk) allows you to write compressed or uncompressed data directly to disk. H5Dread_chunk allows you to read compressed data directly from disk.

https://support.hdfgroup.org/HDF5/doc/Advanced/DirectChunkWrite/

ajelenak · August 22, 2022, 2:41pm

Here’s another suggestion that is hopefully worthy of the “2.0” version…

Alternative storage representations of HDF5 data in cloud object stores indicate there is very little difference between a contiguous dataset and a chunked dataset with only one chunk and no filters applied. How about removing the contiguous storage layout and only have chunked and compact?

Aleksandar

kittisopikulm · August 22, 2022, 7:05pm

You’ve made me realize that there is an internal HDF5 issue at the moment since internally chunk sizes are stored as 32-bit values. The number of bytes in a chunk are stored in 32-bit integers.

github.com/HDFGroup/hdf5

src/H5Dpkg.h

ae414872f


      
          * The fastest-varying dimension is assumed to reference individual bytes of
          * the array, so a 100-element 1-D array of 4-byte integers would really be a
          * 2-D array with the slow varying dimension of size 100 and the fast varying
          * dimension of size 4 (the storage dimensionality has very little to do with
          * the real dimensionality).
          *
          * The chunk's file address, filter mask and size on disk are not key values.
          */
          typedef struct H5D_chunk_rec_t {
             hsize_t  scaled[H5O_LAYOUT_NDIMS]; /* Logical offset to start */
             uint32_t nbytes;                   /* Size of stored data */
             uint32_t filter_mask;              /* Excluded filters */
             haddr_t  chunk_addr;               /* Address of chunk in file */
          } H5D_chunk_rec_t;
          
          /*
          * Common data exchange structure for indexed storage nodes.  This structure is
          * passed through the indexing layer to the methods for the objects
          * to which the index points.
          */
          typedef struct H5D_chunk_common_ud_t {

Furthermore the H5D_chunk_iter_op_t is about to expose this 32-bit value to the public API rather than as hsize_t:

github.com/HDFGroup/hdf5

src/H5Dpublic.h

21ec33785


      
          typedef int (*H5D_chunk_iter_op_t)(const hsize_t *offset, uint32_t filter_mask, haddr_t addr, uint32_t size,
                                             void *op_data);

Edit, issue created: [BUG] H5D_chunk_iter_op_t exposes `size` (`nbytes`) as `uint32_t` rather than `hsize_t` · Issue #2056 · HDFGroup/hdf5 · GitHub

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

What do you want to see in "HDF5 2.0"?