Why do filters have an upper limit for cd_nelmts?

Greetings,

It looks like I’ve been trying to push the limits of the filter API too far :slight_smile:

Users can provide H5Pset_filter() with an auxiliary data that must be given as an array of integers. The number of elements on that array is determined by the cd_nelmts argument. At first I though it was odd to have an array of integers (as opposed to an opaque void *), but then I realized that this probably comes from the common use case of auxiliary data, which is to provide compression settings for the corresponding filter.

The filter I’m working with expects to receive some rich meta-data, and this is where things start to get interesting. It’s not possible to cast an arbitrary block of data to int * and provide it to H5Pset_filter(), as underneath the HDF5 library reallocates that data if “cd_nelmnts > H5Z_COMMON_CD_VALUES” (defined as 4). Since the internal copy-after-reallocation handles the data as unsigned, things can get messy quite easily (i.e., data corruption.)

Further, I found that the pipeline decoder callback (H5Pocpl.c) expects no more than 256 bytes of auxiliary data to have been previously provided to the filter.

I have some ideas on how to workaround this limitation, but first I’d like to have a better understanding of the reason why these constraints exist. Could someone shed some light here?

Thanks in advance!
Lucas

While I was not aware of any upper limit on the size of the cd_nelmts array, I have used the array to pass a handful of doubles and made it work ok as in the generic interface for the H5Z-ZFP filter using type puning. That, however, won’t work for Fortran.

Also note that you do not need to use HDF5’s generic H5Pset_filter() interface to manipulate the filter. You can create your own, properly typed, custom properties as I do also for H5Z-ZFP filter. The only downside to that is that you break the ability to use the filter solely as a dynamically loaded plugin. That is because the caller then needs the implementation of the custom property list functions.

Interesting, I didn’t realize I could use H5Pinsert2() and friends. Thanks for the hint, Miller!

@miller86 On the note of H5Pinsert: when sneaking extra data along dataset hid_t the dapl is not preserved across calls. Did you have similar experience?

Well, depending on your flow of interactions with the lib, when you add custom properties, you might also need to define the HDF5 API equivalent of copy constructors for them. See the detailed docs for H5pinsert and friends for generic property lists. In my case, I needed the properties only so far as the hand-off from caller to dataset writes and so didn’t define a copy constructor. In more complicated workflows (e.g. library interactions), maybe I should do that. In fact, now that I think about it, I should probably add an issue ticket to H5Z-ZFP for that.

Hi Mark, thank you for getting back to me! Did you want to look at this sketch? prop.cpp (3.8 KB)

Basically it creates a custom dapl, then registers it with H5P_DATASET_ACCESS class. Verifies if property present – so far everything checks out; then opens a dataset, and checks if the dapl retrieved from dset id has the same property set: (no) .

In my understanding once the property registered, the call backs provided should be invoked at some point. Probably I am doing it wrong, just can’t seem to find why, and how. Also tried to register an entire class, similar result. In fact I spent some time on this in January, when I worked on a custom pipeline with BLAS level 3 like blocking. – and possibly multi threaded compression. Didn’t find solution, then I forgot about it until you mentioned the technique.

best: steve

Hmmm…thanks for the code! I compiled and ran it and get same behavior.

I am thinking that dataset ACCESS properties are NOT persisted to the file. CREATION properties, yes, but not ACCESS properties as those are allowed to vary.

So, I tried modifying your code, prop.cpp (4.6 KB), in some key ways

  • As a test for sanity, switched from ACCESS to CREATE properties as per above reasoning.
  • I noticed you created the dataset with H5P_DEFAULT properties before you registered the new sub-class and so I changed that order of those operations thinking it would have the effect of changing meaning of H5P_DEFAULT for creation properties. It didn’t.
  • So, I explicitly used the creation properties created with H5Pcreate call in place of H5P_DEFAULT in H5Dcreate and that indeed ensured the dataset so created had the new sub-class on the create props.
  • I added writing to the dataset in case HDF5 optimized away any thing for an empty dataset.

Nonetheless, after all of that, I still don’t get a result where the new CREATE properties appear to be persisted to the file.

Thanks for the update to the code and for the observations! I am not expecting the properties to persists within the file, only to keep / maintain its state while opened. If the dapl followed dcpl behaviour then one could pass along additional state information to opened datasets.

Currently dapl:

  1. data set is opened with dapl property
  2. dapl property can be retrieved from opened dataset
  3. retrieved dapl doesn’t match with the one passed passed along H5Dopen

dcpl:

  1. data set is created with dcpl property
  2. dcpl property can be retrieved from freshly created dataset
  3. custom property callbacks are called: at least copy_prop
  4. properties do match

It is an interesting find indeed that dcpl does follow the advertised H5Pregister2 behaviour, and indeed maintain its state.

@miller86 I further modified the original file, as well as tracked down why dapl is not propagated, and what minimal modification may be implemented to have dapl copied over to H5Dopen and H5Dcreate calls.

This github page has the details of the problem, and a possible solution is based on the observation @miller86 made: dcpl propagates custom properties with H5Dopen and H5Dcreate

dapl.patch (4.6 KB)
prop.cpp (8.5 KB)

Wow! Hopefully THG will have a quick look at your code to see if it is sensible to integrate into the HDF5 library proper.

I am not expecting the properties to persists within the file, only to keep / maintain its state while opened.

Based on your original example code you posted, your second call to H5Dget_create_plist after you closed and then re-opened the dataset is suggestive that that is what you want. Your first call just after the dataset creation does indicate the custom properties are there. But, once you close that dataset, the only record of it the HDF5 library has it what is then in the file (e.g. persisted). So, your second call to H5Dopen followed by H5Dget_create_plist expecting to still find those custom properties is an expectation that your custom create properties should be persisted.

FWIW, I believe you may have uncovered 1 or maybe 2 bugs here.

  1. Should customization of an HDF5 standard property class change the meaning of H5P_DEFAULT when it is used as an argument for a property of that class? At best, I think the documentation is vague on this point and I think it is wholly reasonable to conclude that maybe it should. However, your example code demonstrates that is NOT the case.
  2. Should customized dataset CREATE properties be persisted to the file? I believe this is the root of the problem in your existing code where you open the dataset you previously created and cannot find your custome properties there. Standard CREATE properties certainly are persisted. So, I think it is reasonable that if a caller has customized the dataset CREATE properties to expect those customized properties to also get persisted right along with the standard properties.

On issue 2, I am wondering how that unpredictably sized and shaped customized data would wind up getting stored? I mean, suppose for some strange reason, that your customized properties were really, really large? How/where might HDF5 library store that data? It isn’t attribute data. It isn’t dataset data. Its kinda/sorta object header/meta data.

So, after reading the documentation a bit more, while I do think there is room for improvement in the documentation, it is clear that H5Pcreate_class is creating a new class and returning an hid_t for that newly created class. That seems pretty clear that if you want the customized class you’ve defined, you have to use that hid_t and not HP5_DEFAULT.

Now, I believe THG has considered at one time or another a way of re-defining what HP5_DEFAULT means for a dapl or a dcpl or wherever else H5P_DEFAULT can be used and there might even be an RFC for an API enhancement. I briefly searched but couldn’t find it though.

Based on your original example code you posted, your second call to H5Dget_create_plist after you closed and then re-opened the dataset is suggestive that that is what you want.

The H5Dclose is there to make certain I can re-open the dataset from a clean state with H5Dopen(..., dapl_id). The emphasis is on the passed dapl_id to H5Dcreate and H5Dopen API calls, and how it interacts with H5Dget_create_plist and H5Dget_access_plist. Out of the two the dapl is more interesting, since it is passed to both API calls, and if they preserved their state – which at this point don’t – then the H5Dget_access_plist call would return the same property list.

Returning a property list only makes sense if they are the same you saved. This way one can save raw pointers to objects, and retrieve them later in a different call.

To give you an example how I am using this in H5CPP one could pass custom data access properties as a list to h5::open:

auto fd = h5::open("some_file.h5", H5F_ACC_RDWR );
auto ds = h5::open(fd, "dataset",  
     h5::high_throughput | h5::julia );

To choose filtering pipeline h5::builtin | h5::high_throughput and how objects such as std::hashmap<K,V> should be persisted h5::julia | h5::python | h5::matlab way, so they can be retrieved from respected systems without additional code. If this worked out, then opened up the possibility to use HDF5 format for general compiler assisted persistence for modern C++.

Although In H5CPP I can work around this problem by attaching the dapl to h5::ds_t dataset, this breaks the contract: H5CPP having binary compatible with the underlying HDF5 CAPI.

I hope the above argument makes sense to THG.

  1. Should customized dataset CREATE properties be persisted to the file? I believe this is the root of the problem in your existing code where you open the dataset you previously created and cannot find your custome properties there. Standard CREATE properties certainly are persisted. So, I think it is reasonable that if a caller has customized the dataset CREATE properties to expect those customized properties to also get persisted right along with the standard properties

In my opinion it should not. There are attributes to color datasets and groups and their lifespan is indeed tied to their parent. I argue that properties are the ‘attributes’ of object handles/dataset descriptors, and the property lists lifespan should be tied to their parent objects, the dataset descriptors.

For instance if I want to replace the HDF5 provided filter chain with one on my own – which has different design criteria, and performance properties – then the easiest way of tying this experimental pipeline is to use a property list. Using the advertised property list interface one can initialize, and shut down the object proper. Unfortunately this is not possible with the current implementation of dapl.

H5P_DEFAULT are unaffected by the proposed changes. The code path is only replicating/mirroring existing mechanism, the one used with dcpl.
Although in the H5CPP the h5::default-s are initialised with sensible values with good result/acceptance currently I don’t see if it is relevant to the original problem: HDF5 CAPI doesn’t preserve Data Access Property List instead makes one up on the go.

Mark and Steven,

Thank you for great discussion!

I created a JIRA issue HDFFV-10934 and will bring this up with the developers at the developers meeting this week.

Report is timely since we are currently looking into PL implementation to address performance issues and are reviewing existing PL implementation, documentation,etc.

Thanks again!

Elena

1 Like

Steven’s proposed changes have been merged, and will be included in the next releases of versions 1.12, 1.10, and 1.8. Thanks for your work Steven. We will attribute your contribution in the next RELEASE.txt file.