Hi Matt,
3) Not sure I agree with the desirability implied by the objection noted. Having it non-opaque would be desirable for provenance.
I think version 4 UUID in addition to the creation date, mac/computer name, & user account.
We could allow an application to choose which type of UUID to store. I've filed a bug for adding a UUID to a file and will amend it to suggest giving the application the choice of which version of the UUID to store.
It would be good to put two UUIDs in the datasets as part of
http://www.hdfgroup.org/HDF5/doc/RM/RM_H5O.html#Object-GetInfo
the first UUID would be written once, during creation. (UUID-DS-birth, corresponding to birth time)
The second UUID would be update everytime the dataset is written. (UUID-DS-modified, corresponding to mod or change time)
Hmm, I'm still reluctant to add a UUID to each object... Wouldn't the offset of the object in the file combined with it's birth/change time be unique?
This all might be unnecessary if the time_t value is inherently unique.
Unfortunately, time_t's are in units of seconds, so it's certainly possible to create many objects that would have the same timestamp. :-/
If I copy the HDF5 "internal" UUID from one file to another along with the accessible contents, I do not need to manipulate the time stamp of the target file to be the same with that of the source file, so that applications that depend on the stamp will not tell the difference. I ask all applications to check the UUID.
Hmm, I don't think copying the UUID to the new file is a good idea - the UUID for each file created by the HDF5 library should be unique. The new file should get its own UUID...
I agree if you are doing it through the HDF API.
If you are doing it with a cp a.hdf b.hdf, then it is unavoidable.
Yup.
So there are several potential states of a HDF file UUID:
1) the original file with the original UUID
2) a duplicate file with the original UUID (e.g. cp), that is byte for byte identical.
3) a duplicate file with a different UUID, but other than the UUID is byte for byte identical
4) a duplicate file with a different UUID, but has been modified (e.g. new datasets added).
5) the original file with the original UUID, but has been later modified (e.g. new datasets added).
the main concern is ending up with hdf files that have the same UUID and have different data, with no way to figure what the relationship is between them.
Yes, that would be my primary concern also.
A lesser concern is having two files that are identical, except for UUID, with no way to figure what the relationship is between them.
Agree.
possible methods to manage the states
1) keep a log within the HDF file, noting the date every time it was opened
2) keep a log within the HDF file, noting the date every time it was modified.
I imagine that just keeping the last modification time would be sufficient to distinguish between files with the same UUID. (Although the other information might be useful in certain circumstances where a stronger log of provenance was desired).
If I am doing a high level copy of a dataset from one HDF file to another (not sure if there is an API call for this)
Yes, we introduced H5Ocopy() in HDF5 1.8.0.
, it would be useful to maintain the original UUID-DS-birth, but the UUID-DS-modified would change.
Hmm, I think we should make it optional which times to preserve - similar to the '-p' flag for UNIX cp, only possibly with some more control about what aspects to preserve.
as opposed to, if I am creating a new dataset, reading the data out of the old dataset, writing it into the new dataset, and doing a closed.
Yes, the HDF5 library wouldn't have any idea of what you were doing in that case.
Quincey
···
On Apr 3, 2009, at 5:02 PM, Matthew Dougherty wrote: