What do you want to see in "HDF5 2.0"?

Actually, we are almost there after new indexing for chunked datasets were introduced in 1.10.0.

Current APIs and programming model still require to use H5Pset_chunk but this call could be omitted if there is only one chunk (i.e, contiguos storage). Then compression can also be used on “contiguous” dataset.

I couldn’t convince Quincey to introduce this change in 1.10.0 and the rest is history.

Elena

1 Like

Requests:

  • Set filters with ID code strings, not numbers.
  • Set filter parameters with string keyword arguments.
  • Official registry for filter ID code strings. The current registry is a good start, if made official for existing ID code strings, not just the numbers.

The use of contiguous storage is well entrenched in the HDF5 universe. It has certain advantages such as plain simplicity, and optimal subset access on local storage. I would prefer that support for contiguous is sustained. If compression is desired, just go to chunked storage, as intended by design.

Why should contiguous storage not just be a simple case of chunked storage? Contiguous storage is basically just chunked storage with a single chunk and no filters, no?

Could you give an example of an ID code string?

I think I understand and agree with the spirit of this request. Its much easier to remember “gzip” or maybe “lzma2” as the identifier for a filter than “032105”. That said, can’t this already be achieved by adding a layer on top of the existing interface that keeps a mapping between strings and numbers? I don’t think the table would ever get so large that a linear search of it would have a negative performance impact. And, I honestly don’t think this needs to wait for an HDF5-2.0 or for THG to implement it to make it happen. It may already be implemented somewhere in the world :wink:

1 Like

“Filter ID code strings”. That is a mouthful, sorry, but I was trying to be complete in more than one way. I am referring to the “Name” column in the HDF5 registry, in addition to simple names for the built-in filters. Here are a few examples.

gzip, n-bit, scale-offset, shuffle, BZIP2, LPC-Rice

You mean like h5py? https://docs.h5py.org/en/stable/high/dataset.html#filter-pipeline

dset = f.create_dataset("zipped", (100, 100), compression="gzip")
1 Like

Yes, like that, but build the name registry into the HDF5 core library, so that standard filter names are available and guaranteed, outside of python.

1 Like

Another thing to think about for HDF5 2.0…if you examine a lot of the functionality needed to manage metadata in an HDF5 file, you will find it is quite similar to what file systems have to do to manage their “media”. This kind of layering of the same abstractions, one upon the other, for purposes of providing ever more abstract storage objects is very similar to the IP protocol stack. On the way hand, this layering permits implementation of various abstractions. On the other hand, it feels duplicative, wasteful and unnecessarily complex. Are there ways that an HDF5 2.0 could avoid this and instead utilize some of the pieces of lower-level media abstractions directly instead of reimplementing their own to produce an HDF5 “container” inside of a file system container (inside of a spinning disk container), etc.

thread-parallel reads (and maybe writes) for same and different datasets in same file

+1 This is the only thing I wish for :pray:

In general, thread-efficiency across the library, not just thread-safety through the mega-lock.

3 Likes

Why should contiguous storage not just be a simple case of chunked storage?

Thread-parallel reads is not too hard to pull off now or soon.

  1. Locate the data (H5Dget_offset or H5Dchunk_iter).
  2. Memory map the file (you may need to turn off locking).
  3. Read the memory mapped file with multiple threads.
  4. Do parallel decompression if needed

What I really need the library to do is to efficiently tell me where to find or put the data rather than actually doing it.

2 Likes

UTF-16 for all strings.

I’m curious, why UTF-16 rather than UTF-8?

.NET uses some form of UTF-16 internally, not sure about other platforms, so it would make translation to-from easier. Also UTF-8 requires 1-4 bytes per character which means that when creating fixed length strings we ideally should allocate 4x max string length to be safe. UTF-16 should hopefully mean only 2x although I’m aware it’s not that simple.

I am not an HDF5 library developer but I think support for UTF-16 can likely be done even before “HDF5 2.0”. Strings in HDF5 are stored as bytes of a specific encoding. Currently only two encodings are recognized when defining a string datatype: ASCII and UTF-8. UTF-16 would just be another constant to add. It’s up to the client application to decode these bytes according to the datatype encoding information.

1 Like

Yes agreed, in my understanding the ‘encoding’ is just a hint to the client how to decode the stored bytes, it doesn’t affect anything in the library.
I’ve gone for assuming everything it UTF-8 (and setting all appropriate encoding flags) since that’s a superset of ASCII.

The problem with UTF-16 is the not-uncommon issue that what is said to be UTF-16 is invalid UTF-16 (see https://utf8everywhere.org/ and https://simonsapin.github.io/wtf-8/). UTF-8 is less commonly invalidly labelled (and can more easily be recognised as such through libraries like chardet).

.NET and Qt represent strings internally in UTF-16 because UTF-16 used to be the unicode representation of Windows NT and later versions. Though, UTF-8 is much more common today and even Windows is slowly moving to using UTF-8 as the default encoding, see https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page Thus, I am not sure if UTF-16 support is worth it.