⇐ ⇒

[CF-metadata] Indicating data lineage or provenance

From: Roy Lowry <rkl>
Date: Sat, 06 Jan 2007 16:20:15 +0000

Dear All,

This issue is also of great concern to the SeaDataNet project, particularly in the case where multiple operational centres have grabbed a common raw dataset off the GTS and processed it independently creating 'near duplicates', which are difficult to identify. Standardised encoded provenance metadata has occurred to me as a possible solution tothis problem.

We all seem to need the same thing, so I think collaboration is the order of the day. Could this be a candidate for a CF Twiki project advertised to other interested communities?

Cheers, Roy.

>>> John Graybeal <graybeal at mbari.org> 01/05/07 12:47 AM >>>
To provide some data in response to Mike's question, and then a question of my own:

I, along with Maureen Edwards of the UK, are tasked by OceanSITES with presenting a nominal solution to provenance in netCDF. How far we can get, and how quickly, is definitely TBD, but the notion I have devolves to separate files. (Yes I do hate that, but provenance on a whole mooring system is pretty complicated to put into a netCDF file). So I'd probably suggest a link (URL) from netCDF to a registered SensorML instance (registrations of which are being pursued on another project I'm involved with). Similar to Mike's solution but with important differences.

One point being, this is a more general problem than just model provenance. Observation and processing provenance is also desirable to represent in netCDF files.

So the question is, how much of this does the CF standard want to take on directly, and how much does it want to defer to other standards or efforts?

(No I really didn't put Mike up to this, and he really is only 8 doors from me. But neither of us knew...)

John

At 4:31 PM -0800 1/4/07, Godin, Michael wrote:
>Content-class: urn:content-classes:message
>Content-Type: multipart/alternative;
> boundary="=_reb-r50C4DCF4-t459D9D0C"
>
>I am heartened by all the work this group has put into standardizing the metadata for representing multiple models as an ensemble. However, a particularly thorny issue has been for the most part ignored (I think it has been called a "nightmare"), so I'd like to see if some of the list participants would be willing to work together to form a proposal for indicating the provenance of derived data (for example, initial conditions, larger nested grids, and assimilated data that go into models).
>
>So here are the (draft) requirements that I believe need to be addressed:
>- derived data users need to be provided the information they need to understand the differences between data (covering the same temporal/spatial region) from different models and different realizations of the same model.
>- skeptics (public, governmental, other modelers, observationalists) should be able to request specific observational data that went into a model realization (granted, the request may be for data that would not otherwise be made publicly available).
>- the specification of source data should not only indicate the source data files (or URLs) and variables, but also the temporal/spatial/realization bounds on the supplied data.
>
>I don't know if such a set of requirements can be addressed in a netCDF file, or if it would require a link to an external XML (or other format) file. I am also unsure if any other community has solved the above set of requirements - both the OGC's Layer definition within their Web Map Context Document standard, and the FGDC's Lineage definition within their Content Standard for Digital Geospatial Metadata allow one to specify a lot of metadata about lineage and provenance, but neither really meets the requirements above.
>
>My initial thought for doing this within a netCDF file would be to specify a global multi-line string attribute called something like "lineage" or "provenance" and populate it with a series of DAP2.0-like URIs (of course, this would not be global in the case of ensembles -- it would have to be a 3D set of strings!). The DAP2.0 URIs would not have to be publicly accessible, and the syntax would have to allow combinations of hyperslab operators and queries -- which I do not believe any DAP server actually allows -- but would allow one to specify precise data ranges.
>
>Thanks for your consideration,
>Mike
>
>_____________________________________________
>
>Michael A. Godin
>
>Software Engineer
>
>Monterey Bay Aquarium Research Institute
>
>Phone: 831-775-2063 <http://www.mbari.org/>http://www.mbari.org
>
>
>
>_______________________________________________
>CF-metadata mailing list
>CF-metadata at cgd.ucar.edu
>http://www.cgd.ucar.edu/mailman/listinfo/cf-metadata


-- 
----------
John Graybeal   <mailto:graybeal at mbari.org>  -- 831-775-1956
Monterey Bay Aquarium Research Institute
Marine Metadata Initiative: http://marinemetadata.org   ||  Shore Side Data System: http://www.mbari.org/ssds
_______________________________________________
CF-metadata mailing list
CF-metadata at cgd.ucar.edu
http://www.cgd.ucar.edu/mailman/listinfo/cf-metadata
-- 
This message (and any attachments) is for the recipient only. NERC
is subject to the Freedom of Information Act 2000 and the contents
of this email and any reply you make may be disclosed by NERC unless
it is exempt from release under the Act. Any material supplied to
NERC may be stored in an electronic records management system.
Received on Sat Jan 06 2007 - 09:20:15 GMT

This archive was generated by hypermail 2.3.0 : Tue Sep 13 2022 - 23:02:40 BST

⇐ ⇒