⇐ ⇒

[CF-metadata] Indicating data lineage or provenance

From: John Graybeal <graybeal>
Date: Sat, 6 Jan 2007 18:04:46 -0800

Roy,

Based on our experience so far with provenance-aware data systems, I suspect it is a very good (read: powerful) solution to this problem.

There are multiple standards that encode provenance information. Would the CF project support an evaluation of the application of those standards; or are you looking for an embedded (into netCDF) solution; or is that a question to be discussed on the TWiki?

Note also that one aspect of Mike's requirement, namely referencing a subset of a data set, is not so fully addressed (that I know of); participants of a recent AGU session hopes to kick off a discussion on this topic. But we have imagined some reasonably effective approaches using existing encoding standards.

John


At 4:20 PM +0000 1/6/07, Roy Lowry wrote:
>Dear All,
>
>This issue is also of great concern to the SeaDataNet project, particularly in the case where multiple operational centres have grabbed a common raw dataset off the GTS and processed it independently creating 'near duplicates', which are difficult to identify. Standardised encoded provenance metadata has occurred to me as a possible solution tothis problem.
>
>We all seem to need the same thing, so I think collaboration is the order of the day. Could this be a candidate for a CF Twiki project advertised to other interested communities?
>
>Cheers, Roy.
>
>>>> John Graybeal <graybeal at mbari.org> 01/05/07 12:47 AM >>>
>To provide some data in response to Mike's question, and then a question of my own:
>
>I, along with Maureen Edwards of the UK, are tasked by OceanSITES with presenting a nominal solution to provenance in netCDF. How far we can get, and how quickly, is definitely TBD, but the notion I have devolves to separate files. (Yes I do hate that, but provenance on a whole mooring system is pretty complicated to put into a netCDF file). So I'd probably suggest a link (URL) from netCDF to a registered SensorML instance (registrations of which are being pursued on another project I'm involved with). Similar to Mike's solution but with important differences.
>
>One point being, this is a more general problem than just model provenance. Observation and processing provenance is also desirable to represent in netCDF files.
>
>So the question is, how much of this does the CF standard want to take on directly, and how much does it want to defer to other standards or efforts?
>
>(No I really didn't put Mike up to this, and he really is only 8 doors from me. But neither of us knew...)
>
>John
>
>At 4:31 PM -0800 1/4/07, Godin, Michael wrote:
>>Content-class: urn:content-classes:message
>>Content-Type: multipart/alternative;
>> boundary="=_reb-r50C4DCF4-t459D9D0C"
>>
>>I am heartened by all the work this group has put into standardizing the metadata for representing multiple models as an ensemble. However, a particularly thorny issue has been for the most part ignored (I think it has been called a "nightmare"), so I'd like to see if some of the list participants would be willing to work together to form a proposal for indicating the provenance of derived data (for example, initial conditions, larger nested grids, and assimilated data that go into models).
>>
>>So here are the (draft) requirements that I believe need to be addressed:
>>- derived data users need to be provided the information they need to understand the differences between data (covering the same temporal/spatial region) from different models and different realizations of the same model.
>>- skeptics (public, governmental, other modelers, observationalists) should be able to request specific observational data that went into a model realization (granted, the request may be for data that would not otherwise be made publicly available).
>>- the specification of source data should not only indicate the source data files (or URLs) and variables, but also the temporal/spatial/realization bounds on the supplied data.
>>
>>I don't know if such a set of requirements can be addressed in a netCDF file, or if it would require a link to an external XML (or other format) file. I am also unsure if any other community has solved the above set of requirements - both the OGC's Layer definition within their Web Map Context Document standard, and the FGDC's Lineage definition within their Content Standard for Digital Geospatial Metadata allow one to specify a lot of metadata about lineage and provenance, but neither really meets the requirements above.
> >
>>My initial thought for doing this within a netCDF file would be to specify a global multi-line string attribute called something like "lineage" or "provenance" and populate it with a series of DAP2.0-like URIs (of course, this would not be global in the case of ensembles -- it would have to be a 3D set of strings!). The DAP2.0 URIs would not have to be publicly accessible, and the syntax would have to allow combinations of hyperslab operators and queries -- which I do not believe any DAP server actually allows -- but would allow one to specify precise data ranges.
>>
>>Thanks for your consideration,
>>Mike
>>
>>_____________________________________________
>>
>>Michael A. Godin
>>
>>Software Engineer
>>
>>Monterey Bay Aquarium Research Institute
>>
>>Phone: 831-775-2063 <http://www.mbari.org/>http://www.mbari.org
>>
>>
>>
>>_______________________________________________
>>CF-metadata mailing list
>>CF-metadata at cgd.ucar.edu
>>http://www.cgd.ucar.edu/mailman/listinfo/cf-metadata
>
>
>--
>----------
>John Graybeal <mailto:graybeal at mbari.org> -- 831-775-1956
>Monterey Bay Aquarium Research Institute
>Marine Metadata Initiative: http://marinemetadata.org || Shore Side Data System: http://www.mbari.org/ssds
>_______________________________________________
>CF-metadata mailing list
>CF-metadata at cgd.ucar.edu
>http://www.cgd.ucar.edu/mailman/listinfo/cf-metadata
>
>
>--
>This message (and any attachments) is for the recipient only. NERC
>is subject to the Freedom of Information Act 2000 and the contents
>of this email and any reply you make may be disclosed by NERC unless
>it is exempt from release under the Act. Any material supplied to
>NERC may be stored in an electronic records management system.
>
>
>_______________________________________________
>CF-metadata mailing list
>CF-metadata at cgd.ucar.edu
>http://www.cgd.ucar.edu/mailman/listinfo/cf-metadata


-- 
----------
John Graybeal   <mailto:graybeal at mbari.org>  -- 831-775-1956
Monterey Bay Aquarium Research Institute
Marine Metadata Initiative: http://marinemetadata.org   ||  Shore Side Data System: http://www.mbari.org/ssds
Received on Sat Jan 06 2007 - 19:04:46 GMT

This archive was generated by hypermail 2.3.0 : Tue Sep 13 2022 - 23:02:40 BST

⇐ ⇒