provenance

Ray · April 4, 2009, 12:23am

It would be good to put two UUIDs in the datasets as part of

http://www.hdfgroup.org/HDF5/doc/RM/RM_H5O.html#Object-GetInfo

the first UUID would be written once, during creation.
(UUID-DS-birth, corresponding to birth time)

This is a good idea.

The second UUID would be update everytime the dataset is written.
(UUID-DS-modified, corresponding to mod or change time)

It would be nice if this is optional. When in use on high useage data sets,
it might add unwanted drive head movement to keep this updated.

···

This all might be unnecessary if the time_t value is
inherently unique.

--
Scanned for viruses and dangerous content at
http://www.oneunified.net and is believed to be clean.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Quincey_Koziol · April 5, 2009, 4:44am

Hi Matt,

3) Not sure I agree with the desirability implied by the objection noted. Having it non-opaque would be desirable for provenance.
I think version 4 UUID in addition to the creation date, mac/computer name, & user account.

We could allow an application to choose which type of UUID to store. I've filed a bug for adding a UUID to a file and will amend it to suggest giving the application the choice of which version of the UUID to store.

It would be good to put two UUIDs in the datasets as part of

http://www.hdfgroup.org/HDF5/doc/RM/RM_H5O.html#Object-GetInfo

the first UUID would be written once, during creation. (UUID-DS-birth, corresponding to birth time)

The second UUID would be update everytime the dataset is written. (UUID-DS-modified, corresponding to mod or change time)

Hmm, I'm still reluctant to add a UUID to each object... Wouldn't the offset of the object in the file combined with it's birth/change time be unique?

This all might be unnecessary if the time_t value is inherently unique.

Unfortunately, time_t's are in units of seconds, so it's certainly possible to create many objects that would have the same timestamp. :-/

If I copy the HDF5 "internal" UUID from one file to another along with the accessible contents, I do not need to manipulate the time stamp of the target file to be the same with that of the source file, so that applications that depend on the stamp will not tell the difference. I ask all applications to check the UUID.

Hmm, I don't think copying the UUID to the new file is a good idea - the UUID for each file created by the HDF5 library should be unique. The new file should get its own UUID...

I agree if you are doing it through the HDF API.

If you are doing it with a cp a.hdf b.hdf, then it is unavoidable.

Yup.

So there are several potential states of a HDF file UUID:

1) the original file with the original UUID
2) a duplicate file with the original UUID (e.g. cp), that is byte for byte identical.
3) a duplicate file with a different UUID, but other than the UUID is byte for byte identical
4) a duplicate file with a different UUID, but has been modified (e.g. new datasets added).
5) the original file with the original UUID, but has been later modified (e.g. new datasets added).

the main concern is ending up with hdf files that have the same UUID and have different data, with no way to figure what the relationship is between them.

Yes, that would be my primary concern also.

A lesser concern is having two files that are identical, except for UUID, with no way to figure what the relationship is between them.

Agree.

possible methods to manage the states

1) keep a log within the HDF file, noting the date every time it was opened
2) keep a log within the HDF file, noting the date every time it was modified.

I imagine that just keeping the last modification time would be sufficient to distinguish between files with the same UUID. (Although the other information might be useful in certain circumstances where a stronger log of provenance was desired).

If I am doing a high level copy of a dataset from one HDF file to another (not sure if there is an API call for this)

Yes, we introduced H5Ocopy() in HDF5 1.8.0.

, it would be useful to maintain the original UUID-DS-birth, but the UUID-DS-modified would change.

Hmm, I think we should make it optional which times to preserve - similar to the '-p' flag for UNIX cp, only possibly with some more control about what aspects to preserve.

as opposed to, if I am creating a new dataset, reading the data out of the old dataset, writing it into the new dataset, and doing a closed.

Yes, the HDF5 library wouldn't have any idea of what you were doing in that case.

Quincey

···

On Apr 3, 2009, at 5:02 PM, Matthew Dougherty wrote:

werner · March 25, 2009, 3:50pm

Hi Quincey & Matthew,

3) to track the provenance of an HDF file might be accomplished by
logging a UID (eg time) when an HDF file is opened.
then the HDF file has collection of open times, which are unique to
that file.
If any write activity occurs after opening, then the open UID is
flagged as such.
when a file is copied, then the files diverge and are identified by
different open UIDs.

such a provenance scheme should be automatic and optional.
some instances you don't want the overhead, such as you might be
doing a million opens.
In such a case you get one UID when the HDF was created.

Good ideas toward provenance features, yes.

I'm wondering how "doable" it is to add/implement something like a list
of UUID's instead of a single one. Maybe it's not more effort to add many
than adding a single one. Especcially with copied files I'm think about
something like a hierarchical scheme here, where there is a unique UUID
per file, but there also are "traces" of other UUID's from previous files
from which this one has been copied. Of course, such a copy operation would
need to be done by some HDF5 API call, a unix copy would not do that
(unix copy would screw any uniqueness of HDF5 file UUID's anyway...).

I could imagine a creation property of some H5Fcreate() call, where you
add some id's of HDF5 source files, and all their UUID's are included
in the newly created HDF5 file as "children" of the new-born unique one.

Maybe it's too much effort, maybe it's easy...?

We could allow an application to choose which type of UUID to
store. I've filed a bug for adding a UUID to a file and will amend
it to suggest giving the application the choice of which version of
the UUID to store.

sounds good, would like to have one to choose from that is not
opaque, should include time, computer, username.
audit trails are key to provenance.

I think we are working to different purposes here. I'm just trying
to get a unique ID into the file (and perhaps for each object) and
don't want to tie it into any provenance effort. I also want to
pursue the provenance idea, but it should be a separate, probably
higher-level, project (which might use the UUID for some purpose).

Anonymization of data is as important as keeping personalized information
as long as possible, especcially for medical data. I'd think an anonymous
UUID can always be stored, a personal one that allows to trace down
the source and generator (personal computer, IP/MAC address, exact time
of creation) then might be an optional addition. During an anonymization
process, the personalized UUID could be removed from the file (e.g. during
a copy process) and stored in a high-security external database. Just some
ideas...

Werner

···

--
___________________________________________________________________________
Dr. Werner Benger <werner@cct.lsu.edu> Visualization Research
Laboratory for Creative Arts and Technology (LCAT)
Center for Computation & Technology at Louisiana State University (CCT/LSU)
239 Johnston Hall, Baton Rouge, Louisiana 70803
Tel.: +1 225 578 4809 Fax.: +1 225 578-5362

Quincey_Koziol · April 5, 2009, 4:46am

Hi Ray,

···

On Apr 3, 2009, at 7:23 PM, Ray Burkholder wrote:

It would be good to put two UUIDs in the datasets as part of

http://www.hdfgroup.org/HDF5/doc/RM/RM_H5O.html#Object-GetInfo

the first UUID would be written once, during creation.
(UUID-DS-birth, corresponding to birth time)

This is a good idea.

The second UUID would be update everytime the dataset is written.
(UUID-DS-modified, corresponding to mod or change time)

It would be nice if this is optional. When in use on high useage data sets,
it might add unwanted drive head movement to keep this updated.

Definitely. Something like the 'noatime' option for mounting file systems in UNIX, only applying to the access/change/modification times on objects in the HDF5 file.

Quincey

Dimitris_Servis · April 5, 2009, 1:15pm

Hi all, my 2 pennies again on that. To start with my main 2 concerns besides
provenance is file deflation and ACID properties. For provenance a UUID is
needed mostly at the file level.

It would be good to put two UUIDs in the datasets as part of

http://www.hdfgroup.org/HDF5/doc/RM/RM_H5O.html#Object-GetInfo

the first UUID would be written once, during creation. (UUID-DS-birth,
corresponding to birth time)

The second UUID would be update everytime the dataset is written.
(UUID-DS-modified, corresponding to mod or change time)

Hmm, I'm still reluctant to add a UUID to each object... Wouldn't
the offset of the object in the file combined with it's birth/change time be
unique?

I also agree and maybe don't see the point in that... an application cannot
fully rely on that because there is room for improvement in the way write
failures are dealt with at the moment.

This all might be unnecessary if the time_t value is inherently unique.

Unfortunately, time_t's are in units of seconds, so it's certainly
possible to create many objects that would have the same timestamp. :-/

Still I think either the offset of the object in the file (which is somehow
volatile I guess) or a hash of the creation path at creation time plus the
object header can uniqely identify the object I guess.

If I copy the HDF5 "internal" UUID from one file to another along with the

accessible contents, I do not need to manipulate the time stamp of the
target file to be the same with that of the source file, so that
applications that depend on the stamp will not tell the difference. I ask
all applications to check the UUID.

Hmm, I don't think copying the UUID to the new file is a good idea
- the UUID for each file created by the HDF5 library should be unique. The
new file should get its own UUID...

Then, please provide a a function in H5F that deep-copies the root object

along with the UUID to a new file in an atomic transaction... this is
currently the only albeit expensive way to deflate files.

So there are several potential states of a HDF file UUID:

1) the original file with the original UUID
2) a duplicate file with the original UUID (e.g. cp), that is byte for
byte identical.
3) a duplicate file with a different UUID, but other than the UUID is byte
for byte identical
4) a duplicate file with a different UUID, but has been modified (e.g. new
datasets added).
5) the original file with the original UUID, but has been later modified
(e.g. new datasets added).

the main concern is ending up with hdf files that have the same UUID and
have different data, with no way to figure what the relationship is between
them.

Yes, that would be my primary concern also.

There are drawbacks that come with the benefits of having stand alone files
that don't come along with a full fledged management system. I guess what's
even worse is ending up with a file with any number of UUIDs the same and
different data? If there is no system to actually keep track of the files in
its own space and of course audit trails I am afraid it is difficult to
guarantee consistency...

Best regards and have a nice Sunday (fianlly without snow)

--dimitris

Ray · April 5, 2009, 1:36pm

Quoting Dimitris Servis <servisster@gmail.com>:

There are drawbacks that come with the benefits of having stand alone files
that don't come along with a full fledged management system. I guess what's
even worse is ending up with a file with any number of UUIDs the same and
different data? If there is no system to actually keep track of the files in
its own space and of course audit trails I am afraid it is difficult to
guarantee consistency...

So with all the talk of UUID's and where to put them, (or not to put them), it
may be acceptable to have a single UUID at the root level, assigned at file
creation time, and then perhaps to have application specific object level
UUID's created as optional object attributes and created/managed in an
application specific manner.

···

-------------------------------------------------
Sent from http://www.oneunified.net via IMP: http://horde.org/imp/

--
Scanned for viruses and dangerous content at
http://www.oneunified.net and is believed to be clean.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Dimitris_Servis · April 5, 2009, 1:50pm

Hi Ray

>
> There are drawbacks that come with the benefits of having stand alone
files
> that don't come along with a full fledged management system. I guess
what's
> even worse is ending up with a file with any number of UUIDs the same and
> different data? If there is no system to actually keep track of the files
in
> its own space and of course audit trails I am afraid it is difficult to
> guarantee consistency...
>

So with all the talk of UUID's and where to put them, (or not to put them),
it
may be acceptable to have a single UUID at the root level, assigned at file
creation time, and then perhaps to have application specific object level
UUID's created as optional object attributes and created/managed in an
application specific manner.

In principle the whole issue can be handled in an application-specific
manner using attributes or comments. However the library can - to some
extent - guarantee atomicity and consistency when generating id's and
contents. Maybe to that extend it would be more useful to leave open ends
for a third party system to manage transactions, track UUIDs and maintain
audit trails.

Regards

-- dimitris

Quincey_Koziol · April 6, 2009, 2:11pm

Hi Dimitris,

Hi all, my 2 pennies again on that. To start with my main 2 concerns besides provenance is file deflation and ACID properties. For provenance a UUID is needed mostly at the file level.

It would be good to put two UUIDs in the datasets as part of

http://www.hdfgroup.org/HDF5/doc/RM/RM_H5O.html#Object-GetInfo

the first UUID would be written once, during creation. (UUID-DS-birth, corresponding to birth time)

The second UUID would be update everytime the dataset is written. (UUID-DS-modified, corresponding to mod or change time)

Hmm, I'm still reluctant to add a UUID to each object... Wouldn't the offset of the object in the file combined with it's birth/change time be unique?

I also agree and maybe don't see the point in that... an application cannot fully rely on that because there is room for improvement in the way write failures are dealt with at the moment.

FYI - we are progress (somewhat slowly at the moment) on adding metadata journaling to the HDF5 library, which might relieve some of your concerns on this aspect.

This all might be unnecessary if the time_t value is inherently unique.

Unfortunately, time_t's are in units of seconds, so it's certainly possible to create many objects that would have the same timestamp. :-/

Still I think either the offset of the object in the file (which is somehow volatile I guess) or a hash of the creation path at creation time plus the object header can uniqely identify the object I guess.

The offset of an object isn't volatile, strictly speaking, but it is possible to delete on object and then create another, which might end up getting assigned the same offset in the file. Probably a hash of the object header address and the creation time would be useful enough as a unique identifier of objects in a file (as long as objects weren't being created & deleted & re-created faster than once a second

If I copy the HDF5 "internal" UUID from one file to another along with the accessible contents, I do not need to manipulate the time stamp of the target file to be the same with that of the source file, so that applications that depend on the stamp will not tell the difference. I ask all applications to check the UUID.

Hmm, I don't think copying the UUID to the new file is a good idea - the UUID for each file created by the HDF5 library should be unique. The new file should get its own UUID...

Then, please provide a a function in H5F that deep-copies the root object along with the UUID to a new file in an atomic transaction... this is currently the only albeit expensive way to deflate files.

Have you looked at the new H5Ocopy() routine, introduced with 1.8.0? That might do what you want.

So there are several potential states of a HDF file UUID:

1) the original file with the original UUID
2) a duplicate file with the original UUID (e.g. cp), that is byte for byte identical.
3) a duplicate file with a different UUID, but other than the UUID is byte for byte identical
4) a duplicate file with a different UUID, but has been modified (e.g. new datasets added).
5) the original file with the original UUID, but has been later modified (e.g. new datasets added).

the main concern is ending up with hdf files that have the same UUID and have different data, with no way to figure what the relationship is between them.

Yes, that would be my primary concern also.

There are drawbacks that come with the benefits of having stand alone files that don't come along with a full fledged management system. I guess what's even worse is ending up with a file with any number of UUIDs the same and different data? If there is no system to actually keep track of the files in its own space and of course audit trails I am afraid it is difficult to guarantee consistency...

I agree. But, that's not what the HDF5 library is providing. The iRODS project (www.irods.org) might fit the bill though, and we've been working on integrating HDF5 into their infrastructure.

Best regards and have a nice Sunday (fianlly without snow)

Hah! We just got another snowfall here in Champaign this morning.

Quincey

···

On Apr 5, 2009, at 8:15 AM, Dimitris Servis wrote:

Dimitris_Servis · April 6, 2009, 2:45pm

Hi Quincey,

in principle I agree with all tyour comments and the news on journaling are
definitely welcome.

The offset of an object isn't volatile, strictly speaking, but it is

possible to delete on object and then create another, which might end up
getting assigned the same offset in the file. Probably a hash of the object
header address and the creation time would be useful enough as a unique
identifier of objects in a file (as long as objects weren't being created &
deleted & re-created faster than once a second

I meant it may be volatile if one uses H5Ocopy

Have you looked at the new H5Ocopy() routine, introduced with 1.8.0?

That might do what you want.

Yes, I have implemented deflation like this: iterate over all objects under
root and use H5Ocopy(). However, this will not retain the UUID, and one
cannot simply write

H5Ocopy(handle_to_file_1,"/",handle_to_file_2,"/",H5P_DEFAULT,H5P_DEFAULT);

It would be useful to have a function like H5Fclose_deflate(hid_t f_id);
that would close the file and internally copy all objects to a new file,
retain UUIDs and swap filenames in one atomic transaction.

There are drawbacks that come with the benefits of having stand alone
files that don't come along with a full fledged management system. I guess
what's even worse is ending up with a file with any number of UUIDs the same
and different data? If there is no system to actually keep track of the
files in its own space and of course audit trails I am afraid it is
difficult to guarantee consistency...

I agree. But, that's not what the HDF5 library is providing. The
iRODS project (www.irods.org) might fit the bill though, and we've been
working on integrating HDF5 into their infrastructure.

My point exactly. It should definitely be a 3d party system, warehouse DB or
other. Maybe HDF5 could leave those open ends so that transactions are
controlled by an external application. If some kind of journaling is being
implemented this could be part of it.

Best regards and have a nice Sunday (fianlly without snow)

Hah! We just got another snowfall here in Champaign this morning.

I though we had the worst weather possible over here... wish you all the
best (weather)

-- dimitris

Scott_Murman · April 6, 2009, 3:41pm

since stamps (UUID, etc.) cannot be guaranteed w/o going through the API, then this does not guarantee provenance, and the whole idea seems fundamentally flawed. the question to me is whether application domains can support provenance within the current api (e.g. using a consistent attribute convention) or not. in other words, provenance support is a responsibility above HDF, not something intrinsic to it.

-SM-

···

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Mark_V · April 7, 2009, 12:17am

since stamps (UUID, etc.) cannot be guaranteed w/o going through the API,
then this does not guarantee provenance, and the whole idea seems
fundamentally flawed. the question to me is whether application domains
can support provenance within the current api (e.g. using a consistent
attribute convention) or not. in other words, provenance support is a
responsibility above HDF, not something intrinsic to it.

I don't know enough to say it is a flawed idea but I had found my self
thinking maybe HDF5 just wants to _conveniently_, and with good
performance, provide the functionality that some provenance tracking
system needs.

Before settling on anything it may be worthwhile discussing this
further with the Kepler provenance interest group, or at least seeing
if they have known use cases to form the basis for an initial feature
description:

https://kepler-project.org/developers/interest-groups/provenance-interest-group/archive/kepler-provenance-framework

I'm not aware of any other 'provenance' efforts.

HTH

Mark

···

On Tue, Apr 7, 2009 at 2:41 AM, Scott Murman <smurman@segosha.net> wrote:

-SM-

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to
hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Quincey_Koziol · April 7, 2009, 3:05pm

Hi Mark,

since stamps (UUID, etc.) cannot be guaranteed w/o going through the API,
then this does not guarantee provenance, and the whole idea seems
fundamentally flawed. the question to me is whether application domains
can support provenance within the current api (e.g. using a consistent
attribute convention) or not. in other words, provenance support is a
responsibility above HDF, not something intrinsic to it.

I don't know enough to say it is a flawed idea but I had found my self
thinking maybe HDF5 just wants to _conveniently_, and with good
performance, provide the functionality that some provenance tracking
system needs.

I agree with you, but I think it's a fair amount of work that needs some concrete thought and effort (read: needs some funding :-). So, unless we can make some moves in that direction, it would take work by a group of HDF5 users or an outside organization to make some progress on provenance in HDF5.

Before settling on anything it may be worthwhile discussing this
further with the Kepler provenance interest group, or at least seeing
if they have known use cases to form the basis for an initial feature
description:

https://kepler-project.org/developers/interest-groups/provenance-interest-group/archive/kepler-provenance-framework

I'm not aware of any other 'provenance' efforts.

Thanks for the pointer, it looks useful!

Quincey

···

On Apr 6, 2009, at 7:17 PM, Mark V wrote:

On Tue, Apr 7, 2009 at 2:41 AM, Scott Murman <smurman@segosha.net> > wrote:

HTH

Mark

-SM-

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to
hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Mark_V · April 7, 2009, 11:57pm

Hi Mark,

since stamps (UUID, etc.) cannot be guaranteed w/o going through the API,
then this does not guarantee provenance, and the whole idea seems
fundamentally flawed. the question to me is whether application domains
can support provenance within the current api (e.g. using a consistent
attribute convention) or not. in other words, provenance support is a
responsibility above HDF, not something intrinsic to it.

I don't know enough to say it is a flawed idea but I had found my self
thinking maybe HDF5 just wants to _conveniently_, and with good
performance, provide the functionality that some provenance tracking
system needs.
   I agree with you, but I think it&#39;s a fair amount of work that needs
some concrete thought and effort (read: needs some funding :-). So, unless

Yep, know the story well.
Provenance is not a need I have but I did think it appears the Kepler
users have the need and are already moving down the path - They do
seem to mention data provenance as a distinct facet of the provenance
problem.

Perhaps they have the resources if HDF5 scratches their itch....?

HTH

···

On Wed, Apr 8, 2009 at 2:05 AM, Quincey Koziol <koziol@hdfgroup.org> wrote:

On Apr 6, 2009, at 7:17 PM, Mark V wrote:

On Tue, Apr 7, 2009 at 2:41 AM, Scott Murman <smurman@segosha.net> wrote:

we can make some moves in that direction, it would take work by a group of
HDF5 users or an outside organization to make some progress on provenance in
HDF5.

Before settling on anything it may be worthwhile discussing this
further with the Kepler provenance interest group, or at least seeing
if they have known use cases to form the basis for an initial feature
description:

https://kepler-project.org/developers/interest-groups/provenance-interest-group/archive/kepler-provenance-framework

I'm not aware of any other 'provenance' efforts.
   Thanks for the pointer, it looks useful\!

   Quincey
HTH

Mark

-SM-

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to
hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to
hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Dougherty_Matthew_T1 · April 21, 2009, 10:41am

looking over the Kepler documents, it would appear that any provenance methods developed by HDF would be operating at a different level and thus not conflicted.

Kepler is more of a process oriented system to manage scientific workflow, which needs to draw upon provenance methods of the data inputs.
If the data files do not have things like UUIDs it will have limited methods to test the provenance of files.
It would be good to get the views of the Kepler developers on how HDF could pass on this information to Kepler, or more generically what would be desirable for a design.
Any one want to contact them?

Regarding UUIDs, I would urge that they be automatically created not only at the file creation level, but also in the data object structure.

Because HDF files can contain more than a single dataset, these datasets can be extracted and transferred to other HDF files; so it would be good to know the origin/UUID of a dataset.

Unless there is a significant overhead or compatibility problem, what is the downside?

Keep it simple and stick with random UUIDs, unless someone has a convincing argument why the other methods have advantages requiring the additional complexity.

It would be good to have a RFC on this.

Matthew Dougherty
713-433-3849
National Center for Macromolecular Imaging
Baylor College of Medicine/Houston Texas USA

···

=========================================================================

-----Original Message-----
From: Mark V [mailto:mvyver@gmail.com]
Sent: Tue 4/7/2009 6:57 PM
To: Quincey Koziol
Cc: hdf-forum Forum
Subject: Re: [hdf-forum] provenance/followup

On Wed, Apr 8, 2009 at 2:05 AM, Quincey Koziol <koziol@hdfgroup.org> wrote:

Hi Mark,

On Apr 6, 2009, at 7:17 PM, Mark V wrote:

On Tue, Apr 7, 2009 at 2:41 AM, Scott Murman <smurman@segosha.net> wrote:

since stamps (UUID, etc.) cannot be guaranteed w/o going through the API,
then this does not guarantee provenance, and the whole idea seems
fundamentally flawed. the question to me is whether application domains
can support provenance within the current api (e.g. using a consistent
attribute convention) or not. in other words, provenance support is a
responsibility above HDF, not something intrinsic to it.

I don't know enough to say it is a flawed idea but I had found my self
thinking maybe HDF5 just wants to _conveniently_, and with good
performance, provide the functionality that some provenance tracking
system needs.
   I agree with you, but I think it&#39;s a fair amount of work that needs
some concrete thought and effort (read: needs some funding :-). So, unless

Yep, know the story well.
Provenance is not a need I have but I did think it appears the Kepler
users have the need and are already moving down the path - They do
seem to mention data provenance as a distinct facet of the provenance
problem.

Perhaps they have the resources if HDF5 scratches their itch....?

HTH

we can make some moves in that direction, it would take work by a group of
HDF5 users or an outside organization to make some progress on provenance in
HDF5.

Before settling on anything it may be worthwhile discussing this
further with the Kepler provenance interest group, or at least seeing
if they have known use cases to form the basis for an initial feature
description:

https://kepler-project.org/developers/interest-groups/provenance-interest-group/archive/kepler-provenance-framework

I'm not aware of any other 'provenance' efforts.
   Thanks for the pointer, it looks useful\!

   Quincey
HTH

Mark

-SM-

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to
hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to
hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

werner · April 21, 2009, 4:24pm

Hm, it might actually create some performance overhead if at each
modification of even a single number (e.g. hyperslab write) into some
dataset a new "modified" UUID is created on each write call...

But if we'd have UUID's on datasets, we might also want to have UUIDs
shared among dataset, since a bunch of datasets might be connected
in one or another way. For instance some mesh data structure, which
requires at least two datasets for the coordinates and the triangle
information. Could also place both into the same group, and give an
UUID to that group.

Werner

···

On Tue, 21 Apr 2009 05:41:35 -0500, Dougherty, Matthew T. <matthewd@bcm.tmc.edu> wrote:

looking over the Kepler documents, it would appear that any provenance methods developed by HDF would be operating at a different level and thus not conflicted.

Kepler is more of a process oriented system to manage scientific workflow, which needs to draw upon provenance methods of the data inputs.
If the data files do not have things like UUIDs it will have limited methods to test the provenance of files.
It would be good to get the views of the Kepler developers on how HDF could pass on this information to Kepler, or more generically what would be desirable for a design.
Any one want to contact them?

Regarding UUIDs, I would urge that they be automatically created not only at the file creation level, but also in the data object structure.

Because HDF files can contain more than a single dataset, these datasets can be extracted and transferred to other HDF files; so it would be good to know the origin/UUID of a dataset.

Unless there is a significant overhead or compatibility problem, what is the downside?

Keep it simple and stick with random UUIDs, unless someone has a convincing argument why the other methods have advantages requiring the additional complexity.

It would be good to have a RFC on this.

Matthew Dougherty
713-433-3849
National Center for Macromolecular Imaging
Baylor College of Medicine/Houston Texas USA

-----Original Message-----
From: Mark V [mailto:mvyver@gmail.com]
Sent: Tue 4/7/2009 6:57 PM
To: Quincey Koziol
Cc: hdf-forum Forum
Subject: Re: [hdf-forum] provenance/followup
On Wed, Apr 8, 2009 at 2:05 AM, Quincey Koziol <koziol@hdfgroup.org> > wrote:
Hi Mark,

On Apr 6, 2009, at 7:17 PM, Mark V wrote:

On Tue, Apr 7, 2009 at 2:41 AM, Scott Murman <smurman@segosha.net> >>> wrote:

since stamps (UUID, etc.) cannot be guaranteed w/o going through the API,
then this does not guarantee provenance, and the whole idea seems
fundamentally flawed. the question to me is whether application domains
can support provenance within the current api (e.g. using a consistent
attribute convention) or not. in other words, provenance support is a
responsibility above HDF, not something intrinsic to it.

I don't know enough to say it is a flawed idea but I had found my self
thinking maybe HDF5 just wants to _conveniently_, and with good
performance, provide the functionality that some provenance tracking
system needs.
   I agree with you, but I think it&#39;s a fair amount of work that needs
some concrete thought and effort (read: needs some funding :-). So, unless
Yep, know the story well.
Provenance is not a need I have but I did think it appears the Kepler
users have the need and are already moving down the path - They do
seem to mention data provenance as a distinct facet of the provenance
problem.

Perhaps they have the resources if HDF5 scratches their itch....?

HTH
we can make some moves in that direction, it would take work by a group of
HDF5 users or an outside organization to make some progress on provenance in
HDF5.

Before settling on anything it may be worthwhile discussing this
further with the Kepler provenance interest group, or at least seeing
if they have known use cases to form the basis for an initial feature
description:

https://kepler-project.org/developers/interest-groups/provenance-interest-group/archive/kepler-provenance-framework

I'm not aware of any other 'provenance' efforts.
   Thanks for the pointer, it looks useful\!

   Quincey
HTH

Mark

-SM-

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to
hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to
hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.
----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

--
___________________________________________________________________________
Dr. Werner Benger <werner@cct.lsu.edu> Visualization Research
Laboratory for Creative Arts and Technology (LCAT)
Center for Computation & Technology at Louisiana State University (CCT/LSU)
239 Johnston Hall, Baton Rouge, Louisiana 70803
Tel.: +1 225 578 4809 Fax.: +1 225 578-5362

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

provenance

=========================================================================

Matthew Dougherty 713-433-3849 National Center for Macromolecular Imaging Baylor College of Medicine/Houston Texas USA

Matthew Dougherty
713-433-3849
National Center for Macromolecular Imaging
Baylor College of Medicine/Houston Texas USA