There are more details there, but to summarise: There’s a clear, albeit niche, use case where you don’t want timestamps, if you want to be able to generate byte-for-byte equal HDF5 files. But it’s not obvious how to get any timestamps from h5py, or with the HDF5 command line tools (h5ls, h5dump, h5stat). And the docs for H5Oget_info3 say that only ctime is implemented anyway. And, going by discussions here and on h5py, it looks like no-one cares much about these limitations - maybe because timestamps aren’t used much.
But maybe I’m missing something? If there’s some use case where the timestamps that are implemented are useful, and the existing ways to access them are sufficient, I’d like to know, so we can weigh that up against the use case where we know timestamps are a nuisance.
@thomas1 hope you are well!? We met virtually on the last HUG event, I enjoyed your presentation! IMHO this is a very specific feature, that one would possibly take on a day trip on few special occasions, but in most cases would just take up room in the rucksack – so to speak.
Truth to be told, I am not major consumer for python, my use cases are always across platforms: write in one language (c++) consume from other: h5py, julia, matlab, R, … And my point having niche features with cost associated have less values. By cost I mean maintenance, performance, … .
Personally when I use timestamp, I add it directly as an attribute, or a field in the dataset, to signal intent.
I should have said explicitly that I’m interested in use cases no matter what tools & languages you use to work with HDF5, because files written by h5py may well be read by other tools. This is why I didn’t post in the h5py category.
Maybe a bit late to the discussion, but I’ll add that I took a similar decision to disable object timestamps as the default when using rhdf5. I had several users grumble that different md5 sums were generated on “identical” files which confused some part of their workflow, and no one gave me a the counter example you’re looking for either.
Thanks, that’s useful to know (and not too late ). Can I ask how long ago you made that change and whether you’ve had any complaints since? I think that we’re most likely to find out about use cases from people complaining after we’ve changed the default, so if someone else has already tried the experiment…
I think that might be a good idea, and I’d say changing a default option is acceptable in a major release. But obviously it’s up to you how you weigh the balance between sensible defaults and compatibility. HDF5 is widely used enough that there’s probably someone out there using the timestamps.
You’re obviously welcome to take what happens with h5py as another data point for that decision. We’ve merged the change now, so it will probably be in the next release.
Back in 2014, HDF5 object time stamping was disabled in the netCDF library, after occasional, recurring requests for bit-for-bit reproducibility. Since then, I have seen appreciation for BFB reproducibility, but no laments about loss of time stamping, as best I can recall.
I will add my opinion to change the default for object time stamping to “disabled”.
Probably I am one of the few people who care about these timestamps. Since our users create a lot of data (stored over various files which are not always HDF5 files) within a processing project, a timestamp gives them additional information of the context of the data within the project.
If the default for the object track times will be changed, please communicate this clearly. It is easily to re-enable the tracking but it is hard to correct the HDF5 files without the proper object timestamps.
Furthermore, for me, it is bit unfortunate that not all timestamps have yet been implemented.