provenance

when an hdf file is created, are there any unique metadata embedded into the file, such as a date?

Matthew Dougherty
713-433-3849
National Center for Macromolecular Imaging
Baylor College of Medicine/Houston Texas USA

···

=========================================================================

Each object in an HDF5 file should have its own "birth time, modified time, change
time, access time":

http://www.hdfgroup.org/HDF5/doc/RM/RM_H5O.html#Object-GetInfo

Not sure how far this is stored in the file as well; I guess an "access time"
would not work well for a read-only file.

  Werner

···

On Tue, 24 Mar 2009 00:38:37 -0500, Dougherty, Matthew T. <matthewd@bcm.tmc.edu> wrote:

when an hdf file is created, are there any unique metadata embedded into the file, such as a date?

Matthew Dougherty
713-433-3849
National Center for Macromolecular Imaging
Baylor College of Medicine/Houston Texas USA

--
___________________________________________________________________________
Dr. Werner Benger <werner@cct.lsu.edu> Visualization Research
Laboratory for Creative Arts and Technology (LCAT)
Center for Computation & Technology at Louisiana State University (CCT/LSU)
239 Johnston Hall, Baton Rouge, Louisiana 70803
Tel.: +1 225 578 4809 Fax.: +1 225 578-5362

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

thanks, Werner.

not sure what access time means, or the difference between modification time and change time.

I would guess the modified time is the latest, and previous times are not kept.

Matthew Dougherty
713-433-3849
National Center for Macromolecular Imaging
Baylor College of Medicine/Houston Texas USA

···

=========================================================================

-----Original Message-----
From: Werner Benger [mailto:werner@cct.lsu.edu]
Sent: Tue 3/24/2009 2:00 AM
To: Dougherty, Matthew T.; hdf-forum@hdfgroup.org
Subject: Re: [hdf-forum] provenance

Each object in an HDF5 file should have its own "birth time, modified
time, change
time, access time":

http://www.hdfgroup.org/HDF5/doc/RM/RM_H5O.html#Object-GetInfo

Not sure how far this is stored in the file as well; I guess an "access
time"
would not work well for a read-only file.

  Werner

On Tue, 24 Mar 2009 00:38:37 -0500, Dougherty, Matthew T. <matthewd@bcm.tmc.edu> wrote:

when an hdf file is created, are there any unique metadata embedded into
the file, such as a date?

Matthew Dougherty
713-433-3849
National Center for Macromolecular Imaging
Baylor College of Medicine/Houston Texas USA

=========================================================================

--
___________________________________________________________________________
Dr. Werner Benger <werner@cct.lsu.edu> Visualization Research
Laboratory for Creative Arts and Technology (LCAT)
Center for Computation & Technology at Louisiana State University (CCT/LSU)
239 Johnston Hall, Baton Rouge, Louisiana 70803
Tel.: +1 225 578 4809 Fax.: +1 225 578-5362

Hi Matthew,

thanks, Werner.

not sure what access time means, or the difference between modification time and change time.

  A caveat before I explain these further: only the birth time is fully working and the modification time is only updated for certain dataset operations. These fields were not fully finished before the 1.8.0 release and we haven't had a chance to come back and flesh them out.

  Here's the time fields that are available in the object header, for each object in the file:
    Birth Time (btime) - When the object was created
    Access Time (atime) - When the object was last accessed (read or write)
    Modification Time (mtime) - When the object's raw data was last modified
    Change Time (ctime) - When the object's metadata was last modified

  These should correspond to the fields in the POSIX "stat" call ("man 2 stat"). Also, currently the mtime field is tracking metadata modifications on datasets, but that's a bug and will be corrected when this feature is wrapped up.

  BTW, tracking these time for an object can be disabled by calling H5Pset_obj_track_times() with the 'track_times' parameter set to FALSE.

I would guess the modified time is the latest, and previous times are not kept.

  Yes, that's true.

    Quincey

···

On Mar 24, 2009, at 2:15 AM, Dougherty, Matthew T. wrote:

Matthew Dougherty
713-433-3849
National Center for Macromolecular Imaging
Baylor College of Medicine/Houston Texas USA

=========================================================================

-----Original Message-----
From: Werner Benger [mailto:werner@cct.lsu.edu]
Sent: Tue 3/24/2009 2:00 AM
To: Dougherty, Matthew T.; hdf-forum@hdfgroup.org
Subject: Re: [hdf-forum] provenance

Each object in an HDF5 file should have its own "birth time, modified
time, change
time, access time":

http://www.hdfgroup.org/HDF5/doc/RM/RM_H5O.html#Object-GetInfo

Not sure how far this is stored in the file as well; I guess an "access
time"
would not work well for a read-only file.

        Werner

On Tue, 24 Mar 2009 00:38:37 -0500, Dougherty, Matthew T. > <matthewd@bcm.tmc.edu> wrote:

> when an hdf file is created, are there any unique metadata embedded into
> the file, such as a date?
>
> Matthew Dougherty
> 713-433-3849
> National Center for Macromolecular Imaging
> Baylor College of Medicine/Houston Texas USA
> =========================================================================
> =========================================================================
>

--
___________________________________________________________________________
Dr. Werner Benger <werner@cct.lsu.edu> Visualization Research
Laboratory for Creative Arts and Technology (LCAT)
Center for Computation & Technology at Louisiana State University (CCT/LSU)
239 Johnston Hall, Baton Rouge, Louisiana 70803
Tel.: +1 225 578 4809 Fax.: +1 225 578-5362

As far as I remember there is no time written in the file. the first thing
that is written after the user block is the signature of the superblock. I
think Matthew should use OS functions to check the times Werner described.

···

2009/3/24 Dougherty, Matthew T. <matthewd@bcm.tmc.edu>

thanks, Werner.

not sure what access time means, or the difference between modification
time and change time.

I would guess the modified time is the latest, and previous times are not
kept.

Matthew Dougherty
713-433-3849
National Center for Macromolecular Imaging
Baylor College of Medicine/Houston Texas USA

=========================================================================

-----Original Message-----
From: Werner Benger [mailto:werner@cct.lsu.edu <werner@cct.lsu.edu>]
Sent: Tue 3/24/2009 2:00 AM
To: Dougherty, Matthew T.; hdf-forum@hdfgroup.org
Subject: Re: [hdf-forum] provenance

Each object in an HDF5 file should have its own "birth time, modified
time, change
time, access time":

http://www.hdfgroup.org/HDF5/doc/RM/RM_H5O.html#Object-GetInfo

Not sure how far this is stored in the file as well; I guess an "access
time"
would not work well for a read-only file.

        Werner

On Tue, 24 Mar 2009 00:38:37 -0500, Dougherty, Matthew T. > <matthewd@bcm.tmc.edu> wrote:

> when an hdf file is created, are there any unique metadata embedded into
> the file, such as a date?
>
>
>
> Matthew Dougherty
> 713-433-3849
> National Center for Macromolecular Imaging
> Baylor College of Medicine/Houston Texas USA
> =========================================================================
> =========================================================================
>
>

--
___________________________________________________________________________
Dr. Werner Benger <werner@cct.lsu.edu> Visualization
Research
Laboratory for Creative Arts and Technology (LCAT)
Center for Computation & Technology at Louisiana State University (CCT/LSU)
239 Johnston Hall, Baton Rouge, Louisiana 70803
Tel.: +1 225 578 4809 Fax.: +1 225 578-5362

Is there a unique identifier internally within an HDF file when it is created?

Something that would distinguish it from another HDF file created two seconds later, and cannot be disabled or modified.
Do not want to use the operating system/file system creation date or parameters in fstat, prefer a HDF api.

···

On Mar 24, 2009, at 5:11 AM, Quincey Koziol wrote:

  BTW, tracking these time for an object can be disabled by calling H5Pset_obj_track_times() with the 'track_times' parameter set to FALSE.

Hi Matthew,

Is there a unique identifier internally within an HDF file when it is created?

Something that would distinguish it from another HDF file created two seconds later, and cannot be disabled or modified.
Do not want to use the operating system/file system creation date or parameters in fstat, prefer a HDF api.

  That's a good idea... Would a UUID fit what you are thinking? (http://en.wikipedia.org/wiki/Uuid)

    Quincey

···

On Mar 24, 2009, at 1:32 PM, Matthew Dougherty wrote:

On Mar 24, 2009, at 5:11 AM, Quincey Koziol wrote:

  BTW, tracking these time for an object can be disabled by calling H5Pset_obj_track_times() with the 'track_times' parameter set to FALSE.

few ideas:

1) should be allowable to have more than one UUID.
They may be independent of each other, and added at different times.

2) UUID conforms to Open Software Foundation standards, which is good.
looking at the website mentioned: Version 1 scheme has been criticized in that it is not sufficiently 'opaque'; it reveals both the identity of the computer that generated the UUID and the time at which it did so.

3) Not sure I agree with the desirability implied by the objection noted. Having it non-opaque would be desirable for provenance.
I think version 4 UUID in addition to the creation date, mac/computer name, & user account.

4) have a non changeable flag set in the HDF creation that would override calls to H5Pset_obj_track_times ignoring 'track_times' parameter set to FALSE.
set it at creation and modifications & changes are always noted.

5) A dataset, accessed only by internal HDF infrastructure (that is not directly writeable using HDF apis) that centrally logs all the changes.

···

On Mar 24, 2009, at 2:21 PM, Quincey Koziol wrote:

Is there a unique identifier internally within an HDF file when it is created?

Something that would distinguish it from another HDF file created two seconds later, and cannot be disabled or modified.
Do not want to use the operating system/file system creation date or parameters in fstat, prefer a HDF api.

  That's a good idea... Would a UUID fit what you are thinking? (http://en.wikipedia.org/wiki/Uuid)

    Quincey

Hi Quincey

if it would be possible to copy it, I would also be interested. My use case
is this: when deleting large objects from the file, I copy all objects to a
new file and delete the old one. But some applications may depend on the
time stamp of the file and therefore I would have to modify the time stamp
of the new file using OS functions. Any unique ID would solve this.

thanks!

-- dimitros

···

2009/3/24 Quincey Koziol <koziol@hdfgroup.org>

Hi Matthew,

On Mar 24, 2009, at 1:32 PM, Matthew Dougherty wrote:

Is there a unique identifier internally within an HDF file when it is

created?

Something that would distinguish it from another HDF file created two
seconds later, and cannot be disabled or modified.
Do not want to use the operating system/file system creation date or
parameters in fstat, prefer a HDF api.

       That's a good idea... Would a UUID fit what you are thinking? (
http://en.wikipedia.org/wiki/Uuid)

               Quincey

On Mar 24, 2009, at 5:11 AM, Quincey Koziol wrote:

       BTW, tracking these time for an object can be disabled by calling
H5Pset_obj_track_times() with the 'track_times' parameter set to FALSE.

Would we like to have shared UUID's, such as to specify that "this set of HDF5
files belongs together"? This could be very useful for files produced during
some multiprocessor simulation.

  Werner

···

On Tue, 24 Mar 2009 15:41:27 -0500, Dimitris Servis <servisster@gmail.com> wrote:

Hi Quincey

if it would be possible to copy it, I would also be interested. My use case
is this: when deleting large objects from the file, I copy all objects to a
new file and delete the old one. But some applications may depend on the
time stamp of the file and therefore I would have to modify the time stamp
of the new file using OS functions. Any unique ID would solve this.

thanks!

-- dimitros

2009/3/24 Quincey Koziol <koziol@hdfgroup.org>

Hi Matthew,

On Mar 24, 2009, at 1:32 PM, Matthew Dougherty wrote:

Is there a unique identifier internally within an HDF file when it is

created?

Something that would distinguish it from another HDF file created two
seconds later, and cannot be disabled or modified.
Do not want to use the operating system/file system creation date or
parameters in fstat, prefer a HDF api.

       That's a good idea... Would a UUID fit what you are thinking? (
http://en.wikipedia.org/wiki/Uuid)

               Quincey

On Mar 24, 2009, at 5:11 AM, Quincey Koziol wrote:

       BTW, tracking these time for an object can be disabled by calling
H5Pset_obj_track_times() with the 'track_times' parameter set to FALSE.

--
___________________________________________________________________________
Dr. Werner Benger <werner@cct.lsu.edu> Visualization Research
Laboratory for Creative Arts and Technology (LCAT)
Center for Computation & Technology at Louisiana State University (CCT/LSU)
239 Johnston Hall, Baton Rouge, Louisiana 70803
Tel.: +1 225 578 4809 Fax.: +1 225 578-5362

Hi Matthew,

few ideas:

1) should be allowable to have more than one UUID.
They may be independent of each other, and added at different times.

  Hmm, what do you mean here? Below you were asking for a single unique ID for each HDF5 file...

2) UUID conforms to Open Software Foundation standards, which is good.
looking at the website mentioned: Version 1 scheme has been criticized in that it is not sufficiently 'opaque'; it reveals both the identity of the computer that generated the UUID and the time at which it did so.

3) Not sure I agree with the desirability implied by the objection noted. Having it non-opaque would be desirable for provenance.
I think version 4 UUID in addition to the creation date, mac/computer name, & user account.

  We could allow an application to choose which type of UUID to store. I've filed a bug for adding a UUID to a file and will amend it to suggest giving the application the choice of which version of the UUID to store.

4) have a non changeable flag set in the HDF creation that would override calls to H5Pset_obj_track_times ignoring 'track_times' parameter set to FALSE.
set it at creation and modifications & changes are always noted.

  Hmm, I don't think that's very helpful, really. We don't have any other "override" properties like this...

5) A dataset, accessed only by internal HDF infrastructure (that is not directly writeable using HDF apis) that centrally logs all the changes.

  This is a _lot_ more intensive to implement, I don't think we can go in this direction without some real funding for the effort. :slight_smile:

  Quincey

···

On Mar 24, 2009, at 3:05 PM, Matthew Dougherty wrote:

On Mar 24, 2009, at 2:21 PM, Quincey Koziol wrote:

Is there a unique identifier internally within an HDF file when it is created?

Something that would distinguish it from another HDF file created two seconds later, and cannot be disabled or modified.
Do not want to use the operating system/file system creation date or parameters in fstat, prefer a HDF api.

  That's a good idea... Would a UUID fit what you are thinking? (http://en.wikipedia.org/wiki/Uuid)

    Quincey

Hi Dimitris,

Hi Quincey

if it would be possible to copy it, I would also be interested. My use case is this: when deleting large objects from the file, I copy all objects to a new file and delete the old one. But some applications may depend on the time stamp of the file and therefore I would have to modify the time stamp of the new file using OS functions. Any unique ID would solve this.

  Are you wanting to copy objects' access/modification/etc times, or the UUID for the file?

    Quincey

···

On Mar 24, 2009, at 3:41 PM, Dimitris Servis wrote:

thanks!

-- dimitros

2009/3/24 Quincey Koziol <koziol@hdfgroup.org>
Hi Matthew,

On Mar 24, 2009, at 1:32 PM, Matthew Dougherty wrote:

Is there a unique identifier internally within an HDF file when it is created?

Something that would distinguish it from another HDF file created two seconds later, and cannot be disabled or modified.
Do not want to use the operating system/file system creation date or parameters in fstat, prefer a HDF api.

       That's a good idea... Would a UUID fit what you are thinking? (http://en.wikipedia.org/wiki/Uuid)

               Quincey

On Mar 24, 2009, at 5:11 AM, Quincey Koziol wrote:

       BTW, tracking these time for an object can be disabled by calling H5Pset_obj_track_times() with the 'track_times' parameter set to FALSE.

Hi Werner,

Would we like to have shared UUID's, such as to specify that "this set of HDF5
files belongs together"? This could be very useful for files produced during
some multiprocessor simulation.

  Hmm, that might be better off being specified in some other way, perhaps with an attribute on the root group, or as information in each file's user block.

  Quincey

···

On Mar 24, 2009, at 3:57 PM, Werner Benger wrote:

  Werner

On Tue, 24 Mar 2009 15:41:27 -0500, Dimitris Servis <servisster@gmail.com > > wrote:

Hi Quincey

if it would be possible to copy it, I would also be interested. My use case
is this: when deleting large objects from the file, I copy all objects to a
new file and delete the old one. But some applications may depend on the
time stamp of the file and therefore I would have to modify the time stamp
of the new file using OS functions. Any unique ID would solve this.

thanks!

-- dimitros

2009/3/24 Quincey Koziol <koziol@hdfgroup.org>

Hi Matthew,

On Mar 24, 2009, at 1:32 PM, Matthew Dougherty wrote:

Is there a unique identifier internally within an HDF file when it is

created?

Something that would distinguish it from another HDF file created two
seconds later, and cannot be disabled or modified.
Do not want to use the operating system/file system creation date or
parameters in fstat, prefer a HDF api.

      That's a good idea... Would a UUID fit what you are thinking? (
http://en.wikipedia.org/wiki/Uuid)

              Quincey

On Mar 24, 2009, at 5:11 AM, Quincey Koziol wrote:

      BTW, tracking these time for an object can be disabled by calling
H5Pset_obj_track_times() with the 'track_times' parameter set to FALSE.

--
___________________________________________________________________________
Dr. Werner Benger <werner@cct.lsu.edu> Visualization Research
Laboratory for Creative Arts and Technology (LCAT)
Center for Computation & Technology at Louisiana State University (CCT/LSU)
239 Johnston Hall, Baton Rouge, Louisiana 70803
Tel.: +1 225 578 4809 Fax.: +1 225 578-5362

Hi Quincey

sorry it was not clear. I meant the UUID. If I copy the HDF5 "internal" UUID
from one file to another along with the accessible contents, I do not need
to manipulate the time stamp of the target file to be the same with that of
the source file, so that applications that depend on the stamp will not tell
the difference. I ask all applications to check the UUID.

thanks!

-- dimitris

···

2009/3/24 Quincey Koziol <koziol@hdfgroup.org>

Hi Dimitris,

On Mar 24, 2009, at 3:41 PM, Dimitris Servis wrote:

Hi Quincey

if it would be possible to copy it, I would also be interested. My use
case is this: when deleting large objects from the file, I copy all objects
to a new file and delete the old one. But some applications may depend on
the time stamp of the file and therefore I would have to modify the time
stamp of the new file using OS functions. Any unique ID would solve this.

       Are you wanting to copy objects' access/modification/etc times, or
the UUID for the file?

               Quincey

thanks!

-- dimitros

2009/3/24 Quincey Koziol <koziol@hdfgroup.org>
Hi Matthew,

On Mar 24, 2009, at 1:32 PM, Matthew Dougherty wrote:

Is there a unique identifier internally within an HDF file when it is
created?

Something that would distinguish it from another HDF file created two
seconds later, and cannot be disabled or modified.
Do not want to use the operating system/file system creation date or
parameters in fstat, prefer a HDF api.

      That's a good idea... Would a UUID fit what you are thinking? (
http://en.wikipedia.org/wiki/Uuid)

              Quincey

On Mar 24, 2009, at 5:11 AM, Quincey Koziol wrote:

      BTW, tracking these time for an object can be disabled by calling
H5Pset_obj_track_times() with the 'track_times' parameter set to FALSE.

Hi Dimitris,

Hi Quincey

sorry it was not clear. I meant the UUID. If I copy the HDF5 "internal" UUID from one file to another along with the accessible contents, I do not need to manipulate the time stamp of the target file to be the same with that of the source file, so that applications that depend on the stamp will not tell the difference. I ask all applications to check the UUID.

  Hmm, I don't think copying the UUID to the new file is a good idea - the UUID for each file created by the HDF5 library should be unique. The new file should get its own UUID...

    Quincey

···

On Mar 24, 2009, at 5:14 PM, Dimitris Servis wrote:

thanks!

-- dimitris

2009/3/24 Quincey Koziol <koziol@hdfgroup.org>
Hi Dimitris,

On Mar 24, 2009, at 3:41 PM, Dimitris Servis wrote:

Hi Quincey

if it would be possible to copy it, I would also be interested. My use case is this: when deleting large objects from the file, I copy all objects to a new file and delete the old one. But some applications may depend on the time stamp of the file and therefore I would have to modify the time stamp of the new file using OS functions. Any unique ID would solve this.

       Are you wanting to copy objects' access/modification/etc times, or the UUID for the file?

               Quincey

thanks!

-- dimitros

2009/3/24 Quincey Koziol <koziol@hdfgroup.org>
Hi Matthew,

On Mar 24, 2009, at 1:32 PM, Matthew Dougherty wrote:

Is there a unique identifier internally within an HDF file when it is created?

Something that would distinguish it from another HDF file created two seconds later, and cannot be disabled or modified.
Do not want to use the operating system/file system creation date or parameters in fstat, prefer a HDF api.

      That's a good idea... Would a UUID fit what you are thinking? (http://en.wikipedia.org/wiki/Uuid)

              Quincey

On Mar 24, 2009, at 5:11 AM, Quincey Koziol wrote:

      BTW, tracking these time for an object can be disabled by calling H5Pset_obj_track_times() with the 'track_times' parameter set to FALSE.

1) should be allowable to have more than one UUID.
They may be independent of each other, and added at different times.

  Hmm, what do you mean here? Below you were asking for a single unique ID for each HDF5 file...

1) other scientific groups may have their own UID schemes (eg, LSID-life science IDs, DOI)

2) definitely need an HDF created UID that is not easy to change.

3) to track the provenance of an HDF file might be accomplished by logging a UID (eg time) when an HDF file is opened.
then the HDF file has collection of open times, which are unique to that file.
If any write activity occurs after opening, then the open UID is flagged as such.
when a file is copied, then the files diverge and are identified by different open UIDs.

such a provenance scheme should be automatic and optional.
some instances you don't want the overhead, such as you might be doing a million opens.
In such a case you get one UID when the HDF was created.

  We could allow an application to choose which type of UUID to store. I've filed a bug for adding a UUID to a file and will amend it to suggest giving the application the choice of which version of the UUID to store.

sounds good, would like to have one to choose from that is not opaque, should include time, computer, username.
audit trails are key to provenance.

4) have a non changeable flag set in the HDF creation that would override calls to H5Pset_obj_track_times ignoring 'track_times' parameter set to FALSE.
set it at creation and modifications & changes are always noted.

  Hmm, I don't think that's very helpful, really. We don't have any other "override" properties like this...

main concern is the audit trail gets turned off.

···

On Mar 24, 2009, at 5:02 PM, Quincey Koziol wrote:

Hi Dimitris,

Hi Quincey

sorry it was not clear. I meant the UUID. If I copy the HDF5 "internal"
UUID from one file to another along with the accessible contents, I do not
need to manipulate the time stamp of the target file to be the same with
that of the source file, so that applications that depend on the stamp will
not tell the difference. I ask all applications to check the UUID.

   Hmm, I don&#39;t think copying the UUID to the new file is a good idea \-

the UUID for each file created by the HDF5 library should be unique. The
new file should get its own UUID...

This seems to come down to the question:
Does the UUID refer to:
a) the file at a point in time, irrespective of contents.
b) the file contents at a point in time.
c) the file contents irrespective of time.
d) whatever the user wants it to refer to as 'unique'.

I may misunderstand the intended usage. However it would be useful to
document which is the intended use case.

I'd be interested in (d) since it opens the scope for domain specific
use cases. I think the first 3 can be accommodated in some way by OS
level data?

Mark

···

On Wed, Mar 25, 2009 at 9:23 AM, Quincey Koziol <koziol@hdfgroup.org> wrote:

On Mar 24, 2009, at 5:14 PM, Dimitris Servis wrote:

           Quincey

thanks!

-- dimitris

2009/3/24 Quincey Koziol <koziol@hdfgroup.org>
Hi Dimitris,

On Mar 24, 2009, at 3:41 PM, Dimitris Servis wrote:

Hi Quincey

if it would be possible to copy it, I would also be interested. My use
case is this: when deleting large objects from the file, I copy all objects
to a new file and delete the old one. But some applications may depend on
the time stamp of the file and therefore I would have to modify the time
stamp of the new file using OS functions. Any unique ID would solve this.

  Are you wanting to copy objects&#39; access/modification/etc times, or

the UUID for the file?

          Quincey

thanks!

-- dimitros

2009/3/24 Quincey Koziol <koziol@hdfgroup.org>
Hi Matthew,

On Mar 24, 2009, at 1:32 PM, Matthew Dougherty wrote:

Is there a unique identifier internally within an HDF file when it is
created?

Something that would distinguish it from another HDF file created two
seconds later, and cannot be disabled or modified.
Do not want to use the operating system/file system creation date or
parameters in fstat, prefer a HDF api.

 That&#39;s a good idea\.\.\.  Would a UUID fit what you are thinking?

(http://en.wikipedia.org/wiki/Uuid)

         Quincey

On Mar 24, 2009, at 5:11 AM, Quincey Koziol wrote:

 BTW, tracking these time for an object can be disabled by calling

H5Pset_obj_track_times() with the 'track_times' parameter set to FALSE.

3) Not sure I agree with the desirability implied by the objection noted. Having it non-opaque would be desirable for provenance.
I think version 4 UUID in addition to the creation date, mac/computer name, & user account.

  We could allow an application to choose which type of UUID to store. I've filed a bug for adding a UUID to a file and will amend it to suggest giving the application the choice of which version of the UUID to store.

It would be good to put two UUIDs in the datasets as part of

http://www.hdfgroup.org/HDF5/doc/RM/RM_H5O.html#Object-GetInfo

the first UUID would be written once, during creation. (UUID-DS-birth, corresponding to birth time)

The second UUID would be update everytime the dataset is written. (UUID-DS-modified, corresponding to mod or change time)

This all might be unnecessary if the time_t value is inherently unique.

If I copy the HDF5 "internal" UUID from one file to another along with the accessible contents, I do not need to manipulate the time stamp of the target file to be the same with that of the source file, so that applications that depend on the stamp will not tell the difference. I ask all applications to check the UUID.

  Hmm, I don't think copying the UUID to the new file is a good idea - the UUID for each file created by the HDF5 library should be unique. The new file should get its own UUID...

I agree if you are doing it through the HDF API.

If you are doing it with a cp a.hdf b.hdf, then it is unavoidable.

So there are several potential states of a HDF file UUID:

1) the original file with the original UUID
2) a duplicate file with the original UUID (e.g. cp), that is byte for byte identical.
3) a duplicate file with a different UUID, but other than the UUID is byte for byte identical
4) a duplicate file with a different UUID, but has been modified (e.g. new datasets added).
5) the original file with the original UUID, but has been later modified (e.g. new datasets added).

the main concern is ending up with hdf files that have the same UUID and have different data, with no way to figure what the relationship is between them.
A lesser concern is having two files that are identical, except for UUID, with no way to figure what the relationship is between them.

possible methods to manage the states

1) keep a log within the HDF file, noting the date every time it was opened
2) keep a log within the HDF file, noting the date every time it was modified.

If I am doing a high level copy of a dataset from one HDF file to another (not sure if there is an API call for this), it would be useful to maintain the original UUID-DS-birth, but the UUID-DS-modified would change.

as opposed to, if I am creating a new dataset, reading the data out of the old dataset, writing it into the new dataset, and doing a closed.

Matt

···

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Hi Matthew,

1) should be allowable to have more than one UUID.
They may be independent of each other, and added at different times.

  Hmm, what do you mean here? Below you were asking for a single unique ID for each HDF5 file...

1) other scientific groups may have their own UID schemes (eg, LSID-life science IDs, DOI)

  Yes, they could add those IDs to objects as metadata. Perhaps we could come up with a suggested standard, but I don't think we could easily work with every group's scheme in particular.

2) definitely need an HDF created UID that is not easy to change.

  Yes.

3) to track the provenance of an HDF file might be accomplished by logging a UID (eg time) when an HDF file is opened.
then the HDF file has collection of open times, which are unique to that file.
If any write activity occurs after opening, then the open UID is flagged as such.
when a file is copied, then the files diverge and are identified by different open UIDs.

such a provenance scheme should be automatic and optional.
some instances you don't want the overhead, such as you might be doing a million opens.
In such a case you get one UID when the HDF was created.

  Good ideas toward provenance features, yes.

  We could allow an application to choose which type of UUID to store. I've filed a bug for adding a UUID to a file and will amend it to suggest giving the application the choice of which version of the UUID to store.

sounds good, would like to have one to choose from that is not opaque, should include time, computer, username.
audit trails are key to provenance.

  I think we are working to different purposes here. I'm just trying to get a unique ID into the file (and perhaps for each object) and don't want to tie it into any provenance effort. I also want to pursue the provenance idea, but it should be a separate, probably higher-level, project (which might use the UUID for some purpose).

4) have a non changeable flag set in the HDF creation that would override calls to H5Pset_obj_track_times ignoring 'track_times' parameter set to FALSE.
set it at creation and modifications & changes are always noted.

  Hmm, I don't think that's very helpful, really. We don't have any other "override" properties like this...

main concern is the audit trail gets turned off.

  Sure, I understand.

    Quincey

···

On Mar 24, 2009, at 6:30 PM, Matthew Dougherty wrote:

On Mar 24, 2009, at 5:02 PM, Quincey Koziol wrote:

Hi Mark,

Hi Dimitris,

Hi Quincey

sorry it was not clear. I meant the UUID. If I copy the HDF5 "internal"
UUID from one file to another along with the accessible contents, I do not
need to manipulate the time stamp of the target file to be the same with
that of the source file, so that applications that depend on the stamp will
not tell the difference. I ask all applications to check the UUID.

       Hmm, I don't think copying the UUID to the new file is a good idea -
the UUID for each file created by the HDF5 library should be unique. The
new file should get its own UUID...

This seems to come down to the question:
Does the UUID refer to:
a) the file at a point in time, irrespective of contents.
b) the file contents at a point in time.
c) the file contents irrespective of time.
d) whatever the user wants it to refer to as 'unique'.

I may misunderstand the intended usage. However it would be useful to
document which is the intended use case.

  Good points! :slight_smile: Yes, we should refine what this can be used for before implementing it. I was mostly thinking of case c) where the UUID just allows one HDF5 file to be differentiated from any other one (and the same idea for each object, if the application chooses to store a UUID for the object). Of course, a user could just copy the file entirely and then modify it, but that would be outside of the "system", since it was [at least partially] done outside of the HDF5 library's knowledge.

I'd be interested in (d) since it opens the scope for domain specific
use cases. I think the first 3 can be accommodated in some way by OS
level data?

  Hmm, I'm not certain if the first three could be handled easily when the file moves around. Case d) seems like it would be fraught with uncertainly for applications that attempted to use the UUID, unless it was an attribute that an application stored on the file/object.

  Quincey

···

On Mar 24, 2009, at 7:05 PM, Mark V wrote:

On Wed, Mar 25, 2009 at 9:23 AM, Quincey Koziol > <koziol@hdfgroup.org> wrote:

On Mar 24, 2009, at 5:14 PM, Dimitris Servis wrote:

Mark

               Quincey

thanks!

-- dimitris

2009/3/24 Quincey Koziol <koziol@hdfgroup.org>
Hi Dimitris,

On Mar 24, 2009, at 3:41 PM, Dimitris Servis wrote:

Hi Quincey

if it would be possible to copy it, I would also be interested. My use
case is this: when deleting large objects from the file, I copy all objects
to a new file and delete the old one. But some applications may depend on
the time stamp of the file and therefore I would have to modify the time
stamp of the new file using OS functions. Any unique ID would solve this.

      Are you wanting to copy objects' access/modification/etc times, or
the UUID for the file?

              Quincey

thanks!

-- dimitros

2009/3/24 Quincey Koziol <koziol@hdfgroup.org>
Hi Matthew,

On Mar 24, 2009, at 1:32 PM, Matthew Dougherty wrote:

Is there a unique identifier internally within an HDF file when it is
created?

Something that would distinguish it from another HDF file created two
seconds later, and cannot be disabled or modified.
Do not want to use the operating system/file system creation date or
parameters in fstat, prefer a HDF api.

     That's a good idea... Would a UUID fit what you are thinking?
(http://en.wikipedia.org/wiki/Uuid)

             Quincey

On Mar 24, 2009, at 5:11 AM, Quincey Koziol wrote:

     BTW, tracking these time for an object can be disabled by calling
H5Pset_obj_track_times() with the 'track_times' parameter set to FALSE.