What do you want to see in "HDF5 2.0"?

I’ve seen some posts about memory mapping, and I’ve been thinking about how users may be using this.

Abstracting this, what would be useful is separation between the layout and actual reading and writing of the data. Working near instrumentation and data acquisition, often I have my files will have exactly the same layout and the data will be much larger than the metadata.

Thus, I would like to have layout phase where I just describe the type and shape of the attributes and data.

At some later point in time, I would like to actually write the data without changing the layout. The layout and metadata (and perhaps compact datasets) become read-only while the data can be modified in place. I think this could greatly simplify parallel I/O. If I know that the layout is immutable, then it should be relatively safe to memory map the data portion of the file. This memory mapped data section can then be easily interfaced with using multithreading.

Utilities such as h5ls do help to elucidate the file structure and layout. Ultimately I would like to see made more accessible via the API. Perhaps one gripe I’ve observed is how the tools often end up using private APIs. It would be great if all the tools could be made to use only public APIs. If additional APIs need to be made public to support the tools, then I think this would be an overall benefit. In part, this is driving my interest in H5Dchunk_iter.

An overall concept is to make it very clear to the user about where the data lives in the file and how it is stored. While the HDF5 specification is open, it can be very hard for a normal user to interpret, forcing reliance on the API for all I/O.

1 Like

Another thought I’ve had is about Morton coding or using Z curves. This might be doable before 2.0 with clever allocation of chunks.

Essentially the idea would be to encapsulate the functionality of neuroglancer precomputed for example:

[quote=“derobins, post:2, topic:10003, full:true”]
Some things I can think of right off the bat:

  • Smarter about library-allocated memory (esp. filters & API calls that return library-allocated buffers)
    can you include space performance testing of this too

  • Actually remove calls marked as “deprecated”
    +1

  • Retire the multi VFD (but keep the split VFD aspect - we just don’t need multiple metadata channels)
    +1

  • Sanitize the metadata read code (the source of most CVE issues)
    +1

  • a time and space performance test suite

  • in-memory groups or “windowed” groups where a whole group subtree is handled by keeping entirely in memory once opened, can be explicitly sync’d w/disk by caller and is synced when closed.

  • more read-agnosticism…to the extent possible, readers should be able to be somewhat blind to the types used by the writer and still succesfully read. An example (https://github.com/visit-dav/visit/blob/4fdb19fac58d28725c61f833e34ba04556281671/src/databases/Chombo/avtChomboFileFormat.C#L614-L640) is where an array of 3 doubles vs. a struct of 3 doubles.

  • thread-parallel reads (and maybe writes) for same and different datasets in same file

  • decouple parallel HDF5 from serial HDF5 so that only a single install point (-L/path/to/install -lhdf5 -lpar_hdf5 gets parallel features) serves both

  • A cook-book suite of real-world examples (e.g. not contrived for testing purposes but from real-world use cases) which are documented (and linked to other documentation sources like API ref, design, etc.) which demonstrate how to use HDF5 for common cases as well as how NOT to use HDF5. Best how-not-to-use example I have (https://github.com/markcmiller86/hdf5stuff/tree/master/graph_of_udts) is serialization of hierarchal data structures…naive users often wind up using HDF5 groups as the nodes in their hierarchy and this has huge negative performance implications

  • A way for apps to specify default properties (which may be different from the lib deployed/installed defaults) to be followed within the current executable (somewhat related to next item)

  • compression “strategies”…where caller’s don’t having to manipulate compression directly on each and every dataset written but can tell the lib what “strategy” they wish to follow and then on each write, it does something useful (e.g. compress int types with gzip but compress float types with zfp)

  • A simplified mode for error stack reporting to report just caller’s failed call (not internals)

  • A compression test-suite test-bed with appropriate raw-data files where compression of the same raw data via hdf5 is compared, routinely, with compression via common unix command-line tools and the performance differences are understood.

  • routine (3-4 x per year) scalability testing to tens of thousands of parallel tasks (we can provide compute resources)

2 Likes

A better way to handle the “direct write” (or read) case (so that it behaves like an ordinary write (or read)) so that objects which are compressed in memory can be written (or read) compressed to files without having to uncompress and recompress them. Bottom line, some consumers might want the data uncompressed when read while others might want the data to remain compressed even after having been read. On writes, if the data is already compressed in memory (maybe the caller needs to tell HDF5 that it is with a property), it should just go to disk compressed.

1 Like

H5Dwrite_chunk (formerly H5DOwrite_chunk) allows you to write compressed or uncompressed data directly to disk. H5Dread_chunk allows you to read compressed data directly from disk.

https://support.hdfgroup.org/HDF5/doc/Advanced/DirectChunkWrite/

Here’s another suggestion that is hopefully worthy of the “2.0” version…

Alternative storage representations of HDF5 data in cloud object stores indicate there is very little difference between a contiguous dataset and a chunked dataset with only one chunk and no filters applied. How about removing the contiguous storage layout and only have chunked and compact?

Aleksandar

2 Likes

You’ve made me realize that there is an internal HDF5 issue at the moment since internally chunk sizes are stored as 32-bit values. The number of bytes in a chunk are stored in 32-bit integers.

Furthermore the H5D_chunk_iter_op_t is about to expose this 32-bit value to the public API rather than as hsize_t:

Edit, issue created: https://github.com/HDFGroup/hdf5/issues/2056

Actually, we are almost there after new indexing for chunked datasets were introduced in 1.10.0.

Current APIs and programming model still require to use H5Pset_chunk but this call could be omitted if there is only one chunk (i.e, contiguos storage). Then compression can also be used on “contiguous” dataset.

I couldn’t convince Quincey to introduce this change in 1.10.0 and the rest is history.

Elena

1 Like

Requests:

  • Set filters with ID code strings, not numbers.
  • Set filter parameters with string keyword arguments.
  • Official registry for filter ID code strings. The current registry is a good start, if made official for existing ID code strings, not just the numbers.

The use of contiguous storage is well entrenched in the HDF5 universe. It has certain advantages such as plain simplicity, and optimal subset access on local storage. I would prefer that support for contiguous is sustained. If compression is desired, just go to chunked storage, as intended by design.

Why should contiguous storage not just be a simple case of chunked storage? Contiguous storage is basically just chunked storage with a single chunk and no filters, no?

Could you give an example of an ID code string?

I think I understand and agree with the spirit of this request. Its much easier to remember “gzip” or maybe “lzma2” as the identifier for a filter than “032105”. That said, can’t this already be achieved by adding a layer on top of the existing interface that keeps a mapping between strings and numbers? I don’t think the table would ever get so large that a linear search of it would have a negative performance impact. And, I honestly don’t think this needs to wait for an HDF5-2.0 or for THG to implement it to make it happen. It may already be implemented somewhere in the world :wink:

1 Like

“Filter ID code strings”. That is a mouthful, sorry, but I was trying to be complete in more than one way. I am referring to the “Name” column in the HDF5 registry, in addition to simple names for the built-in filters. Here are a few examples.

gzip, n-bit, scale-offset, shuffle, BZIP2, LPC-Rice

You mean like h5py? https://docs.h5py.org/en/stable/high/dataset.html#filter-pipeline

dset = f.create_dataset("zipped", (100, 100), compression="gzip")
1 Like

Yes, like that, but build the name registry into the HDF5 core library, so that standard filter names are available and guaranteed, outside of python.

1 Like

Another thing to think about for HDF5 2.0…if you examine a lot of the functionality needed to manage metadata in an HDF5 file, you will find it is quite similar to what file systems have to do to manage their “media”. This kind of layering of the same abstractions, one upon the other, for purposes of providing ever more abstract storage objects is very similar to the IP protocol stack. On the way hand, this layering permits implementation of various abstractions. On the other hand, it feels duplicative, wasteful and unnecessarily complex. Are there ways that an HDF5 2.0 could avoid this and instead utilize some of the pieces of lower-level media abstractions directly instead of reimplementing their own to produce an HDF5 “container” inside of a file system container (inside of a spinning disk container), etc.

thread-parallel reads (and maybe writes) for same and different datasets in same file

+1 This is the only thing I wish for :pray:

In general, thread-efficiency across the library, not just thread-safety through the mega-lock.

3 Likes

Why should contiguous storage not just be a simple case of chunked storage?

Thread-parallel reads is not too hard to pull off now or soon.

  1. Locate the data (H5Dget_offset or H5Dchunk_iter).
  2. Memory map the file (you may need to turn off locking).
  3. Read the memory mapped file with multiple threads.
  4. Do parallel decompression if needed

What I really need the library to do is to efficiently tell me where to find or put the data rather than actually doing it.

2 Likes