Opened 13 years ago

Last modified 10 years ago

#24 new enhancement

Common Concept: vocabulary with mapping to CF attributes

Reported by: frato Owned by: cf-conventions@…
Priority: medium Milestone:
Component: cf-conventions Version: 1.0
Keywords: Common Concept, attributes, namespaces, semantics Cc:

Description

1. Title

common concept (vocabulary) available from CF website

2. Proposing

new optional CF variable attribute: common_concept = "{namespace}scope_name;URN"

and

common concept (vocabulary) available from CF website

3. Motivation and Benefits

Different scientific communities work in different semantic domains and need to keep their legacy vocabularies, while sharing common software concepts and data. At the same time, within these domains they need to define the characteristics of specific phenomena by well-defined combinations of CF attributes and/or utilise commonly known abbreviations and synonyms. So as stated earlier (cf. CF mailing list 2007-07-12) the proposed vocabulary would be for indexing and searching the contents of files.

This proposal outlines a simple (optional) extension to the CF standard to accommodate a new CF variable attribute, to be called the common_concept. The common_concept should bundle together properties of a CF variable (e.g. the standard_name and specific cell_methods or scalar coordinate variable values) with a scoped name and a CF registered universal resource name (URN).

For example, the British Atmospheric Data Centre might propose a scope_name of 2m_air_temperature, which bundles the CF standard_name air_temperature and the scalar coordinate variable 2m to the common_concept

  • {badc.nerc.ac.uk}2m_air_temperature

which would be identified by the CF common concept registry as urn:cf-cc:blah123. At the same time, the World Data Center for Climate might wish to use the common_concept scope_name air_temperature_at_2m, and GFDL might want to call it 2mAirTemp.

All of these could be registered with CF mapping the common concept onto the same URN (urn:cf-cc:blah123), with namespaces and scoped names

  • {badc.nerc.ac.uk}2m_air_temperature
  • {wdc-climate.de}air_temperature_at_2m
  • {gfdl.noaa.gov}airTemp2m.

Use of the common_concept will reduce the necessity for the proliferation of some classes of standard names, and allow multiple different communities to reuse the same bundle of CF standard_name and CF attributes with different scope_names (and/or abbreviations), as well as simple searches and indices on complex (but common) combinations of CF attributes.
Software would be able to utilise the URN to ensure that searches using any one of the synonyms would return matches to datasets using any of the common_concept scope_names. Furthermore, the common concept variable in the NetCDF header allows for comparing and analyzing data with common URN directly. It will be independent of any semantic domain.

4. Technical proposal

The common concept would simply appear as an optional variable attribute, and include the name space, the scoped name, and the CF maintained URN.

float temperature(time,realization,lat,lon) ;
    temperature: standard_name = "air_temperature" ;
    temperature: long_name = "annual mean 2m temperature" ;
    temperature: common_concept =
           "{badc.nerc.ac.uk}2m_air_temperature;urn:cf-cc:blah123";

5. Maintenance

We would expect communities to propose common concepts to the CF mailing list, and for the standard-names secretary (or software acting on her behalf) to provide a new (or existing) URN for identifying that concept.

In the case of existing URNs, the proposers could

  • choose to withdraw the proposal, or still
  • choose to register their common_concept, perhaps after minor modifications. In this case the registry would include multiple synonyms.

As long as the common_concept scope_name did not use inappropriate language or introduce spurious ambiguity, it would be assumed that acceptance of the proposal would be automatic after short period of time. The proposers should choose a scoping namespace for which they might reasonably have authority, and the standard-names secretary might reject a proposal for which such authority is not clear.

If after discussion in the CF community it became clear that a larger community would own the common concept and propose it accordingly inside their name space, proposers might choose

  • to withdraw the proposal or
  • to keep it, and to have both scoped names versions alongside one URN in the registry.
  • The CF secretary might propose to put the new concept in a "common CF namespace" as, e.g., {cf_common}<name>.

The common concept registry (list) would be available from the CF website. Proposals for common concepts should take a machine readable form. They should contain

  • a suggested name in the format of {namespace}scope_name
  • a list of pseudo-cdl formatted constraints
  • a plain text description of the constraints.

See below for some worked examples.

Once an URN has been assigned and registered by CF, the URN will never be withdrawn, although the common_concept scope_names associated with it may be withdrawn by the namespace owners after application to the CF standard names secretary.
Data files using the deprecated common concept would still be valid since the URN still exists, the reason for withdrawing it might be that the namespace owners no longer wish to use that specific scoped name in new data files, as, for example, might happen should the concept gain wider prevalence and be an international scoped name be registered for the concept. No new URN can be assigned to the deprecated namespace|scope name pair.

The scoping concept specifically allows not only multiple different common_concepts (as expressed in the scope_name) to share the same URN (and hence CF attribute combinations). Also multiple common concepts sharing the same scope_name might point to different URNs (and hence CF attribute combinations), depending on the accompanying name spaces.

The same scope_name might have different URNs. For example: {badc.nerc.ac.uk}precip might point to urn:cf-cc:234 (standard_name lwe_precipitation_rate) and {wdc-climate.de}precip might point to urn:cf-cc:235 (standard_name precipitation_amount).

It should be understood that CF is operating as a registry: holding the mappings between community common concepts (as expressed in their scoped names), the CF URN, and the cdl description which defines the concept. Of course, this is a convenience activity. The actual definitions in the fileheaders conform to the underlying CF attributes, the common concept simply minimizes the difficulty of comparison and provides easy to use commonly interpretable short form names.

The introduction of common_concepts is expected to lead to the necessity for fewer standard name proposals, but not to preclude them.

common_concepts are expected to refer to at least standard names. If a concept can be described solely by a new standard name, then a standard name proposal should be put forward. If the concept can be described using a combination of existing CF attributes then a common_concept should be used (but standard name proposals in this situation are not forbidden). If the concept is already described in either CF standard_names or CF common_concepts, but a community wishes to introduce a new synonym, then a common_concept should be used (and CF will ensure a common underlying URN).

6. Use Cases

  1. Communities wish to define a common short form name (see ticket 11). For example, a GFDL modelling group might want to define high cloud. They would propose

Common Concept: {gfdl.noaa.gov}high_cloud
Defined as:

dimensions:
    hgt=1;
variables:
    float x(unconstrained);
        x:standard_name:"cloud_area_fraction" ;
        x:units="1";
        x:coordinates="height + unconstrained";
        x:common_concept:"{gfdl.noaa.gov}high_cloud;tbd";
    float height (unconstrained);
        height:units="m";
        height:valid_min=7000.;
        height:valid_max=14000.;

Note that in this case, the height variable may be geospatially varying or not, the common concept does not limit this aspect.

  1. A very similar example is Near Surface Temperature. Some parts of the world use 2m air temperature, some 6 foot air temperature and some work with 1.5 m. Near surface temperature is commonly understood to be a temperature from less than 10m.

Here we want simply to associate a coordinate variable, but one with some restricted properties.

Common Concept: {badc.nerc.ac.uk}near_surface_air_temperature
Defined as:

variables:
    float tsurf(unconstrained) ;
        tsurf:standard_name = "air_temperature" ;
        tsurf:units = "K" ;
        tsurf:coordinates = "height + unconstrained" ;
        tsurf:common_concept = 
          "{badc.nerc.ac.uk}near_surface_air_temperature; urn:cf-cc:blah123" ;
    float height;
        height:units="m";
        height:valid_min=0.0;
        height:valid_max=10.0;

Valid data files simply indicate which height they actually use, but are limited to one!

  1. The IPCC AR5 wishes to declare a bunch of new common variables, including climate statistics such as frost days, tropical days. These are not easily amenable to introduction as standard names, yet they are frequently used concepts.

While they could be introduced as standard names, with the addition of common concepts, one new standard name, and one new cell method, a wide range of these types of data can be included without difficulty.

The AMS glossary says: frost day: An observational day on which frost occurs; one of a family of climatic indicators (e.g., thunderstorm day, rain day).
The definition is somewhat arbitrary, depending upon the accepted criteria for a frost observation. Thus, it may be
1) a day on which the minimum air temperature in the thermometer shelter falls below 0degC (32degF);
2) a day on which a deposit of white frost is observed on the ground;
3) in British usage, a day on which the minimum temperature at the level of the ground or on the tops of low, close-growing vegetation falls to -0.9degC (30.4degF) or below (also called a "day with ground frost"); and perhaps others. The present trend is to drop such terms in favor of something less ambiguous, such as "day with minimum temperature below 0degC (32degF)".

Potentially these could be all introduced as standard names, but they all depend on two underlying factors: something we could characterise with a standard name of "number_of_occurrences", and something we can characterise as a cell_method_modifier of threshold crossing.
It would seem simpler and more scalable to introduce these new features, and use the common concepts, rather than proliferate standard names. However, the discussion on statistical standards in CF has not yet finished.

Specifically, we need to cope with the temperature threshold, the period over which that threshold might have occurred, and where the temperature is measured.

  1. An organisation holds millions of legacy data files that are either not CF or not in NetCDF format. The use of CF common concepts could be used to map the variable names used in those files to a CF definition of what they mean and used in an interoperable catalogue.

7. Anticipated Problems

  1. It could be argued that having both standard_name plus other attributes (cell_methods, etc.) and common_concept attached to the same variable introduces

redundancy into the meta data and therefore also the possibility for inconsistency. While CF in general abhors redundancy, it is expected the advantages of this proposal outweigh the problems introduced by the redundancy.

A valid CF file should not have an incompatibility between a labeled common concept and the underlying attributes!

So, the following would not be valid CF:

float height
    height:long_name = "height regarded as 'near surface'" ;
    height:standard_name = "height" ;
    height:units = "m" ;
float tsurf(lat,lon) ;
    tsurf:standard_name = "air_temperature" ;
    tsurf:units = "K" ;
    tsurf:common_concept = "{badc.nerc.ac.uk}near_surface_air_temperature;
          urn:cf-cc:blah123" ;
    tsurf:coordinates = "height" ;
data:
    height = 20.

However, it is recommended that applications processing files which are invalid in this way respect the underlying attribute definitions.

  1. We have introduced the (unconstrained) notation to cdl, as there needs to be a method for indicating what part of the definition is constrained and what is not. The specific reason for it is that where the constraints do not cause a limit on dimensionality it needs to be specified. For example, in the high-cloud and near surface temperature, one might allow the height variables to be functions of lat-lon or to be fixed to a specific value. The common concepts as defined in these examples are restricted to a scalar_coordinate in the case of near_surface_temperature and are unconstrained for high cloud.

8. Status Quo

The only option available is to us the long_name and overload it with commonly understood semantics.

Currently it is difficult for data producers to build valid CF files. The use of the common concepts will provide a ready set of "pre-defined" valid combinations of CF attributes for many applications, mitigating against this difficulty.

Change History (58)

comment:1 Changed 13 years ago by apamment

I have volunteered to act as moderator for the discussion of this ticket.

Alison Pamment

comment:2 Changed 13 years ago by jonathan

Dear Frank

Thank you for this proposal. I support the introduction of the common_concept attribute to name a combination of CF metadata. I believe there is a need for this. Other conventions, such as PCMDI and GRIB, identify some quantities which could be equivalenced to common concepts, and could not be described by a standard name alone e.g. your example of surface air temperature, called "tas" by PCMDI.

However, I tend to think the proposal as it stands is somewhat more complicated than it needs to be, and permits unnecessary diversity which introduces potential for confusion.

  • What is the need for the namespace? Obviously it is the case that BADC and GFDL (following your example) both recognise the concept of surface air temperature, but why does it help for them to be able to register different names for the same concept? It seems to me that it can be named in the way the first proposer suggests, and then others could use that name for it. Furthermore, if two different centres give a different definition to "precip", I think that will simply be confusing (despite the different namespaces), tending to undermine the clarity which standard names try to enforce.
  • What is the need for the URN? Would it not be sufficient just to give it a name? The URN is a human-unfriendly string, and it is not necessary for identification of the concept; a registered name would be unique.

I feel that common_concept="surface_air_temperature" would be fine, for instance. Perhaps there should be a requirement for the name to be self-explanatory, like that. On the other hand, since the metadata into which it translates is self-explanatory, that might be unnecessary, in which case "tas" would be acceptable. Some standard names are rather long, but the mechanism of common concepts is not intended as a way to introduce short equivalents for them. Personally, I think that problem would be better addressed by providing string-matching facilities in analysis software that uses standard names (but that's another subject).

I think your technical proposal needs a complete and precise description of how the common_concept definition will be recorded. Your examples are indicative, but I don't think they are sufficient to be definitive.

The problem of redundancy leading to inconsistency could be mitigated if these definitions could be processed by software, because the CF-checker (for example) could then detect errors.

Best wishes

Jonathan

comment:3 Changed 13 years ago by bnl

Jonathan

It's not just Frank who subscribed to this suggestion: Alison. Balaji, myself, Roy Lowry, Michael Lautenschlager and Heinke Hoeck were all involved in the drafting ... not surprisingly therefor, I support it as it stands. (I hasten to add that Alison's involvement was in helping us write it properly, I believe she can still operate as an impartial moderator).

OK, taking your points:

  • I would argue that you are flat out wrong on this thread, and on previous threads to assume that is possible in all cases to find an unambiguous short form text string that will suffice for all possible consumers of netcdf files. There will be cases were the same name means different things to different communities, and the namespace is the standard mechanism for dealing with this issue in all metadata problems. (I will return to the issue of previous threads at a later date).

The reality is that we do want to use short strings within communities: we (the proposers) have many use cases (only some of which are in the proposal), and arguing that it undermines clarity is not sufficient, because if you achieve clarity at the expense of utility you achieve nothing.

  • Your resistance to URNs is well known, but again, this is a standard mechanism to avoid overloading semantics (which can be misunderstood) onto something which is simply an unambiguous identifier. In our use cases we specifically want to support many different things mapping onto the same "thing". If "thing" was that easily and unambiguously defineable and of use in it's own right then of course we wouldn't need to do this. But there is plenty of previous work that shows that a semantic-free identifier (URN) is the appropriate way to go here. Indeed our use case of allowing common concept names to evolve while keeping the same identifier is a great example of the utility of this approach.

Note also this methodology also supports internationalisation quite naturally.

comment:4 Changed 13 years ago by lowry

Hello Jonathan,

Just to make public my total support for the proposal as submitted by Frank and Bryan's response to your comments.

Cheers, Roy.

comment:5 Changed 13 years ago by jonathan

Dear Bryan

I said "Frank" because the email was distributed with Frank's name on it, and no-one is named in the proposal, so I didn't know who else was responsible - that's all. Sorry! I extend my thanks to you and the rest of its authors.

I agree it is possible there may be some situations where we need to give different names to the same concept for use in different communities. I don't think that the majority of the discussions about standard names arise because of different terminology in different communities, however; they are generally about being clear to non-specialists. Often we conclude the discussion by agreeing not to use terms which have restricted usage or which are "jargon" in a particular community. I don't think that means utility has been lost. If standard names were not useful, people wouldn't keep requesting them to be defined, as they do.

The proposal as it stands also does not give examples of the need you describe. The example given of three different climate centres proposing minor variants on the same name (2m_air_temperature, air_temperature_at_2m, airTemp2m) doesn't look to me like diversity which is really needed. The other example, of two centres giving the same name (precip) to two different quantities, looks like diversity which could be potentially confusing. These aren't cases of different specialist usages. Perhaps you can give examples which motivate your first point?

Common concepts aren't about aliases for standard names, because they are proposed only for cases where the concept is a combination of a standard name with other metadata. I agree with that need, as I said.

As for your second point, it is not known to me that I am opposed to URNs on principle! But URNs are not easy for humans to deal with, and one aim of CF is to provide human-readable metadata. Hence I think strings, even if not obviously meaningful (like "tas"), are better than URNs in CF attributes. Strings can be registered and unique within CF, and that serves the purpose, doesn't it?

Best wishes

Jonathan

comment:6 Changed 13 years ago by benno

  • Keywords semantics added

I believe making semantically-precise statements about data is what CF is all about, and this is an important proposal. From my point of view, this proposal has a lot of parts, some of which might be worth modifying to fit into a grander scheme of things.

  1. add an attribute common_concept that allows one to connect a CF variable to a concept represented as a URI
  1. use a special parsing scheme {namespace}scoped-name; URI to (redundantly) specify that concept
  1. establish a CF URN space urn:cf-cc: to label a set of agreed-to concepts
  1. establish a CF concept registry to maintain a list of these concepts and mappings to legacy and alternate representations. This should be available in at least one machine-readable representation.

I think to establish a CF concept registry is a natural extension of CF's original goal of providing netcdf metadata, particularly in light of its goverance mechanism and its extended discussions on standard names, but also because of the need to establish interoperability with other standards. There are many ways of documenting data (OGC standards come to mind), and it only enhances CF's goals to provide clear mappings between its variables documented with CF attributes and other standard ways of documenting variables with the same information. Such a registry could hold alternate projection representations as well (e.g. ticket 18). I would hope/insist that one of the machine readable representations would be OWL/XML, and would volunteer to see that happen.

establish a CF URI space is essential for creating machine-readable documentation for CF (in the sense that the machines can actually manipulate the information), doing it as a URN is a matter of taste. But I can think of at least three additional subdomains that the CF namespace should have: the attributes themselves in cf1.0, the standard-names, and the underlying concepts such as dataset, variable, Non-Coordinate_Variable, variable with location, that are the underlying concepts that our datasets are trying to express. There is a technical glitch in that the urn registry has reserved two-letter names for countries so they would not let CF register urn:cf:, but urn:cfns:sn:air_temperature would be clear enough, as would urn:cfns:att1p0:grid_mapping or urn:cfns:obj:Non-Coordinate_Variable. And we could have the above urn:cfns:cc:surface_air_temperature. As a side note, note that people are already making OWL/XML statements about CF concepts (MMI has an ontology of standard name objects, I have ontologies that explicate CF attributes and objects), so not establishing URI/URN's for CF concepts just means that someone else will in a less-than-controlled fashion.

Just to be explicit, establishing a URN space is very much about readability, and somewhat about permanence. Alternatively, CF could simply establish a URI, e.g. http://cf-pcmid.llnl.gov/concepts/sn#air_temperature which also gets the job done, or one could stick with MMI's http://marinemetadata.org/cf#air_temperature, at some cost in human readability. Of course, if you don't think you can keep a server going forever, there is a problem.

Note that since MMI was only interested in standard names, they used cf to refer to the set of standard names, ignoring all the other concepts in CF. Part of what happens if CF does not take responsiblity for its identifiers.

To summarize, establishing CF URI/URN identifiers is essential for writing down relations between CF and other metadata standards, even without extending CF so that it can connect data to a larger semantic space than it spans at present.

As for the special parsing scheme -- I am not a fan. I would much prefer a simple comma-separated list of URI/URNs (multiple lines in OPeNDAP and future versions of netcdf). I think a URI like http://badc.nerc.ac.uk/cc#2m_air_temperature is not so horrible, and urn:cfns:ccns:badc.nerc.ac.uk:2m_air_temperature would also be both readable and a legal URI. And if one has a CF registry that establishes the equivalence of urn:cfns:ccns:badc.nerc.ac.uk:2m_air_temperature to urn:cfns:cc:2m_air_temperature, then it seems unnecessary to require a provider to give exactly two versions in the netcdf file.

An attribute common_concept is an important idea, but I think some clarification is in order here. Since the proposers clearly have semantic ideas in mind, it would be nice to be explicit about the properties of common_concept, partly by relating it to existing concepts like skos:subject (which SKOS might drop, demonstrating a remarkable lack of interest in implementation). I certainly see the need for such a property: I have been using my own term:isDescribedBy, which connects an item (possibly a dataset or variable) to a term (possibly a cf:standard_name), clearly (I hope) similar.

What particularly needs clarification is the range of common_concept. I think it is clear that the domain of common_concept is intended to be a variable, though the example is of a Non-Coordinate_Variable, though it is possible one might want to apply it to a dataset. More importantly is the range: is it any concept represented by a URI, or just concepts that are equivalent to objects in the CF concept registry. Many concepts could be attached to datasets, possibly before CF has set standards for the concepts, and a property that allows specifying a namespace as well as a tag allows the data to be correctly tagged and later correctly interpreted (once the standard mapping has been established). This is important for the scientific enterprise where the conceptual space is hopefully permanently expanding, and we would like to include cutting-edge data in our data framework.

In any case, if the range is only objects already in the CF concept registry, then Jonathan has point: the cfcc identifier is sufficient because the range of common_concept is defined explicitly to be within that namespace. But that destroys the proposal: moving the controlled vocabulary a.k.a. namesspace to the attribute value means that the data can be correctly labelled at the time of generation, meanwhile the negotiation of mapping to a standard can proceed. A fixed range also prevents labelling of data with a vast array of semantics that is beyond the current scope of CF, e.g. particular algorithms for calculating derived quantities in a model, or literature references.

Please keep in mind that it is human nature to name things, and having a single name that corresponds to a collection of CF attributes and their values is a great enhancement in human readibility for most of us. As long as we can look up the one-to-one correspondence. From that point-of-view, even the weak version of common_concept would be useful.

Just trying to help,

Benno

comment:7 Changed 13 years ago by Heinke

Dear Jonathan,

I made some comments to your reply and try to answer some questions:

  • What is the need for the namespace? Obviously it is the case that BADC and GFDL (following your example) both >recognise the concept of surface air temperature, but why does it help for them to be able to register different names for >the same concept? It seems to me that it can be named in the way the first proposer suggests, and then others could use >that name for it. Furthermore, if two different centres give a different definition to "precip", I think that will simply >be confusing (despite the different namespaces), tending to undermine the clarity which standard names try to enforce.

With this restriction mapping of different standards to CF is not possible. We want to bring domains together without deleting their vocabulary. Different domains have different vocabulary. e.g. sea our discussions about SST: http://www.cgd.ucar.edu/pipermail/cf-metadata/2007/001828.html

  • What is the need for the URN? Would it not be sufficient just to give it a name? The URN is a human-unfriendly string, >and it is not necessary for identification of the concept; a registered name would be unique.

It is a technical index and therefor it must not be a human-friendly string.

Some standard names are rather long, >but the mechanism of common concepts is not intended as a way to introduce short >equivalents for them.

That is not true. This was one of our intentions.

I think your technical proposal needs a complete and precise description of how the common_concept definition will be >recorded.

That is true. But I think that is a technical problem and we are open to discuss this. (table, XML ...)

Your examples are indicative, but I don't think they are sufficient to be definitive.

Sorry, but I don't understand what you mean. Examples are never definitive. (?)

The problem of redundancy leading to inconsistency could be mitigated if these definitions could be processed by software, >because the CF-checker (for example) could then detect errors.

Good idea.

I agree it is possible there may be some situations where we need to give different names to the same concept for use in >different communities.

I gave the SST example above. Do you need more examples ?

I don't think that the majority of the discussions about standard names arise because of different terminology in >different communities, however; they are generally about being clear to non-specialists.

I agree, but to make it 'human-readable' and unique with 'definitions' in the name is not so easy.

Often we conclude the discussion by agreeing not to use terms which have restricted usage or which are "jargon" in a >particular community. I don't think that means utility has been lost. If standard names were not useful, people wouldn't >keep requesting them to be defined, as they do.

The standard names are the basis for the cc. We don't intend to substitute the cf system. We try to build a translation with the definitions of the standard names for better communication.

The proposal as it stands also does not give examples of the need you describe. The example given of three different >climate centres proposing minor variants on the same name (2m_air_temperature, air_temperature_at_2m, airTemp2m) doesn't >look to me like diversity which is really needed.

In our system we use air_temperature_at2m http://cera-www.dkrz.de/WDCC/ui/BrowseTopics.jsp?topiid=2002808

To find the same cf+metadata combination to compare the data with data at badc we need a central translation. The point is not the minor variants. For searching we need the mapping. This is an example for 'sharing common software concepts and data'.

The other example, of two centres giving the same name (precip) to two different quantities, looks like diversity which >could be potentially confusing. These aren't cases of different specialist usages. Perhaps you can give examples which >motivate your first point?

see our SST problem.

Common concepts aren't about aliases for standard names, because they are proposed only for cases where the concept is a >combination of a standard name with other metadata. I agree with that need, as I said.

I don't agree. I like to give an example: humidity_mixing_ratio is the standard name but we would like to use water_vapor_mixing_ratio. So, with the cc system we are able to use water_vapor_mixing_ratio without leaving the cf standard.

As for your second point, it is not known to me that I am opposed to URNs on principle! But URNs are not easy for humans to deal with, and one aim of CF is to provide human-readable metadata.

see above: index

Best wishes

Heinke

comment:8 Changed 13 years ago by jonathan

Dear all

Reading the postings again, I see that I was wrong to describe this proposal as not being about aliases for standard names - sorry about that. In fact the proposal has more than one purpose in mind for common_concept. It seems to me that we can distinguish these two purposes:

  • Provide an identifier for a combination of a standard name with other CF metadata.
  • Provide aliases for standard names or existing common concepts to support alternative terminologies.

As I've said, I think the first of these would definitely be valuable. I agree that it would be helpful to users of data to label common concepts such as daily maximum temperature, windspeed at 10 m, high cloud amount and frost days. This is redundant, but it could be checked for inconsistency. Heinke questioned my remark:

I think your technical proposal needs a complete and precise description of how the common_concept definition will be recorded. Your examples are indicative, but I don't think they are sufficient to be definitive.

By this I meant that the technical proposal needs a syntax definition for the translation of a common concept into other CF metadata. This would be part of the CF standard document and have corresponding entries in the conformance document. You give a couple of examples but don't say exactly what the proposed syntax is or what could be included.

I remarked that URNs are human-unfriendly because the ones I have seen are (such as in the discussion about projections in Phil's ticket). However following Benno's contribution I understand they don't have to be cryptic. I agree with him that if we had a common concept of (say) surface_air_temperature it could have a URN of (say) urn:cfcc:surface_air_temperature. But if the URN could be derived like that by an obvious rule, there would be no need to include the URN in the attribute as a well as the name of the common concept.

The reason why the proposal includes the URN is for the second purpose, to support alternative names for common concepts and hence standard names. I am still not convinced by this need. I appreciate of course that there are many naming conventions in use. No doubt most centres do have their own sets of names or codes, like Heinke's example from the DKRZ CERA system. Nothing prevents these identifiers from being included as additional attributes in CF-netCDF files if that is convenient for local use. But the main purpose of CF metadata is to facilitate exchange of data. Surely it must be better for interoperability if all centres use the same name for a given common concept, just as they use the same string for a given standard name.

It could be argued that with the scope and URN specified then software will be able to recognise a given common concept, regardless of which alternative name is used for it. However, I expect that the majority of users of the data will not have software available to do that. They will be analysing the data by inspecting the attributes using the netCDF library, or with some existing package which doesn't have automatic translation of common concepts. For such users it would much easier if surface_air_temperature always has the same common concept name. On the other hand, if an institute chooses to do so, it can write software which would recognise the standardised common concept names and present them to its users translated into their own local vocabulary. Thus, DKRZ users could have familiar CERA names for common concepts, or maybe CDAT could support automatic translation to PCMDI variable names.

As I've said, I agree there could be situations where there is a true divergence of terminology and a need to have alternate standard names or names for common concepts. But I really cannot see that various climate centres' different names for common concepts such as surface air temperature fall into that category. They are only different because they grew up separately, not because we don't understand one another's vocabulary. It doesn't seem to be a great hardship to decide a common name for a common concept, as we have done with standard names. Nor do I think that previous debates about standard names have exposed cases where different communities have different specialist terminology. Our discussions are usually concerned with deciding on names which are unambiguous and as far as possible intelligible to non-specialists.

Best wishes

Jonathan

comment:9 Changed 13 years ago by bnl

Hi Jonathan

One of those days when we need to discuss something in public for the sake of due process ...

I think we've (you, the proposers, and Benno) converged on acceptance of the bundling characteristic (your first bullet). What we're now discussing is a) the naming convention for bundles, and your second bullet, b) why and how a local name should occur in a CF compliant file.

Let me give you one more example of why we want b, and why it adds value.

We (BADC) hold data which we mandate to be CF-compliant. But we also want to put in the correct local names for multiple different communities who want access to the same file. Having the namespace means your DKRZ and PCMDI users would have their familiar abbreviations (linguistic versions or whatever) available to them via their software from the same file which we hold ... because there would be a mechanism for them to find it. (This use case is real!)

Before I get back to a). Benno actually gave an example that we didn't anticipate, and one which requires thinking about. Registering a common concept in advance of support within CF, allowing one to write files before the standard name is decided upon. Given recent history (how long we take to approve things) I think that's a VERY important use case. Do others think this is an important use case that CF should find a mechanism to support? If so, is it a different thread, or does it belong here?

OK, and now a). If we (CF) were to want an unambiguous (to all communities) human readable identifier for a bundled common concept I think we would end up with yet another list which would be as hard to maintain as the standard names. I think that would be a retrograde step since we have dealt with the semantics already in the underlying concepts and we all want to avoid nugatory work. So having an opaque identifier removes that problem. But yes, they're impossible to understand, so the consumer needs something so that they would *never* have to see an opaque identifier, and we namespace scope it so the governance of that name is clear, but we support multiple governances (namespaces) so that the fact we haven't agreed on the name is not a problem - we can all use what we want FOR OUR DISPLAY AND QUERY SOFTWARE. Our understanding and manipulation is predicated on the underlying concepts.

Enough for this one ... :-)

comment:10 Changed 13 years ago by balaji

Jonathan Gregory Gregory writes:

#24: Common Concept: vocabulary with mapping to CF attributes


Reporter: frato | Owner: cf-conventions@…

Type: enhancement | Status: new

Priority: medium | Milestone:

Component: cf-conventions | Version: 1.0 Resolution: | Keywords: Common Concept, attributes, namespaces, semantics


Comment (by jonathan):

Dear all

Reading the postings again, I see that I was wrong to describe this proposal as not being about aliases for standard names - sorry about that. In fact the proposal has more than one purpose in mind for common_concept. It seems to me that we can distinguish these two purposes:

  • Provide an identifier for a combination of a standard name with other

CF metadata.

  • Provide aliases for standard names or existing common concepts to

support alternative terminologies.

Dear Jonathan, you've argued that the first is valuable, but you're less convinced of the value of the second. I'll try to convince you of the value of both, which is why I support common_concept. (I also think it supersedes my proposal #11, "standard short name", as it achieves all of those goals, and more besides).

Let me show how I think consumers of data might actually use common_concept, standard_name, etc, for that would highlight the need for comon_concept. The use case assumes a user who wants to perform an analysis on a set of datasets of diverse provenance in a multi-model ensemble, such as the PCMDI IPCC archive. We anticipate some amount of poking around, and then a "batch processing" step in which she repeats the analysis on several datasets and collates them.

  • Provide an identifier for a combination of a standard name with other

CF metadata.

As we've all argued, this is necessary because you could have several variables in the dataset that have the same standard_name attribute, but are different variables (the high/middle/low cloud amount example). There is thus no way to map standard_name->name, resulting in a problem in name resolution.

This highlights the fact that the *name* of the variable is special and not on par with any *attribute*. The name is unique to a namespace, and is what gets used to ID a variable in any batch processing of data. If a user wanted to compare "high cloud amount" from several datasets in a multi-model ensemble, the steps would involve some scripting like this:

foreach dataset in archive

var = scan_for_variable( "high cloud amount" ) run_analysis -v var

end

... and it's not obvious how to build scan_for_variable() with the current combination of standard_name and other attributes. However common_concept could be a holder for an additional name: we could propose that a variable with a certain combination of attributes is called "high cloud amount": data producers would all agree on this common concept and make the variable straightforward to consume with an appropriate scan_for_variable() function.

The added versatility comes if you allow different communities to come up with agreed-upon "soft convention" build for a specific purpose. As a data producer, I find it inevitable that we'll be contributing model output to campaigns where we have little or no control over metadata standards. "Soft conventions" not encoded in any standard are already in use, you'll find, if you review how users actually interact with the data. Most of our IPCC archive users rely on the "PCMDI-AR4 short name" rather than standard_name in their actual practice: that's a "soft convention" and therefore fragile. It's fragile because the next experiment we contribute model output to may not use the same "PCMDI short name": in fact, I can already point to other experiments where the AR4 soft convention is violated.

I think we should leave behind the mindset of a single over-arching standard but instead figure out how to live in a world of intersecting overlapping imprecise conventions. This brings us to:

  • Provide aliases for standard names or existing common concepts to

support alternative terminologies.

I really believe this would add tremendously to the viability of CF: I would feel much more confident about using CF for everything if there were in fact a placeholder for expressing synonymity of a term between two independently developed vocabularies. There isn't now: which is why I find common_concept with URNs such an appealing idea.

I've gone on longer than I intended: in summary:

  • common concept supersedes my short name proposal #1
  • standard_name by itself does not uniquely specify a variable
  • different coordinated experiments may have different terminologies beyond what our community could hope to mandate
  • in the absence of mechanisms to specify a variable, and synonymity of variables in different datasets, we are currently relying on soft conventions
  • common_concept breaks this deadlock.

Thanks, --

  1. Balaji Office: +1-609-452-6516

Head, Modeling Systems Group, GFDL Home: +1-212-253-6662 Princeton University Email: v.balaji@…

comment:11 Changed 13 years ago by jonathan

Dear Bryan and Balaji

Thanks for your points and patience.

I have to say I don't see the point of putting the local names and namespaces into the attribute if we have the URN in there. The URN alone would be sufficient for software to provide the user with the translation into their own familiar name (CERA, PCMDI, UM stash code). Bryan wants to serve various user groups in their own vocabularies, but the attribute can't hold all possible alternative names, so why pick just one of them? If different centres want to store their own local names in the file, they can do that in the long_name or non-standardised attributes of their own invention.

A simple approach would be to provide just a URN, that implied (in the external table maintained by CF) a translation into standard name and other attributes. It is up to software to do that translation. As Bryan says, the URN could be assigned provisionally to something before a standard name is agreed, if it an opaque URN. This approach would certainly work. It would be just like using a code table, and I feel that that is a retrograde step, but maybe it is not so bad since there is other metadata in the file that the user can read and understand. But in practice users may get quite good at recognising opaque URNs and choose to use them just like they use PCMDI variable names. I agree with Balaji that that is the main way people identify CMIP3 quantities.

A common concept which translates to a standard name alone is possible, so we may as well assign a URN to each standard name as well as to the combinations of metadata. I would propose that the PCMDI variable names be adopted as URNs where applicable. Most PCMDI variable names do translate to a standard name alone; some translate to a combination of standard name and other metadata. These names are "opaque" in that they are not obvious and not very systematic, but they are short and somewhat memorable. I think they are easier for humans to remember than a cryptic string like a typical DOI, and on that ground preferable.

Balaji also questions (in the email list) whether standard names are useful at all. We could indeed abandon them and just have definitions of quantities in the URN definition. That would save the work which has to do with the exact choice of words or phrasing, and that is sometimes difficult. However, I would argue

  • that if we did not have standard names, the file would not have any human-readable information in it about what the physical quantity was, and I definitely think that would be a retrograde step, as I don't think we can depend on all or most users having access to analysis software that will translate the URNs. To repeat myself, I think the files should be self-describing to a reasonable extent. I do not think this is a mirage at all; I think it's a welcome oasis of information that I can ncdump a CF file to see what is in it without having to look anything up, unlike a GRIB or a UM file. An alternative would be to put the definition in the file instead of the standard name, of course.
  • that in deciding on standard names, quite a lot of what we are actually doing is clarifying what the quantity really is and deciding whether some of the name belongs in other CF metadata (such as coordinates). If we didn't have the discipline of deciding on a name, we would still have to compare the new proposals with all the existing definitions to avoid redundancy and achieve consistency of approach. That task might even be more difficult than it is now without the relative short systematic summary expressed in the standard name.
  • that despite some of them being a bit long, they are actually pretty useful as phrases that one can list that describe quantities succintly, when discussing what data to collect, etc.

But maybe I am out of line with general opinion. I speak principally an analyst of data, not as a producer or archivist of data or a provider of software. I also speak as someone who is spending far too much time in CF debates!

Best wishes

Jonathan

comment:12 Changed 13 years ago by bnl

Hi Jonathan

I'll say it again. While you have the time and energy to disagree with me (and the reasoning to do so), I want you to do it ... I'm wrong often enough to value disagreement ... and yes, this proposal still needs some fine tuning, and your objections are exposing those issues very nicely. (I hope that eventually we'll dispose of them all though :-)

I think your new points/questions boil down to :

  1. I for one have been ambiguous about how many common concept attributes could be applied to one variable. Personally, I'd be happy to have a variable with multiple common concepts entries each with different scoped names but the same urn. This would address my use case above. The point here is *more* metadata is good. The easier it is in CF to add more information the better. Yes, we could use CF long_names, but it's not so obvious what the content is meant to represent.
  2. Q: Why have the scoped name and the identifier? A: to avoid having to (in this case use an external resolver unless the scoped name is not suitable for some reason).
  3. Q: Do "opaque" identifiers have to be inscrutable strings like DOIs? A: No, but we should avoid putting semantics into them though. Putting things which help make them memorable is fine.
  4. We could use the common_concept for a full definition of PCMDI variables (and also the HTAP ones being asked about elsewhere I suppose).
  5. A defence of standard names which I mostly agree with (modulo the difference between identifiers and definitions which is running in a different thread).

Sometime soon someone (maybe even me :-) ought to try and summarise this thread ... because there are some very fair criticisms, new use cases, and clarifications (thanks Benno) that need to go back into a revised proposal.

comment:13 Changed 13 years ago by graybeal

Attempted Summary

Steve Hankin's post pointed to this thread, which I should have read before responding to his post. As penance, I am attempting to summarize the thread. And attempting is the operative word; this is the most challenging set of materials I have worked with in a long time, with lots of complexity, sophistication, and underlying assumptions.

As best I can tell, these are the direct functional goals identified by the discussion so far. Some of them are also ways of achieving other goals; but I have deliberately left out requests that I deemed strictly methods for achieving other goals (they follow, in case someone wants to argue).

Note that just because something is an identified goal, does not mean everyone agrees it should be a goal, nor (especially) that it should be achieved by CF mechanisms.

Functional/Technical? Goals

  1. Provide an identifier for a combination of a standard name with other CF metadata
  2. Provide aliases for standard names or existing common concepts to support alternative terminologies and short names.
  3. Provide a way to establish a correspondence between CF terms and other vocabularies.
  4. Provide a way to register a term in advance of support within CF, allowing one to write files before the standard name is decided upon
  5. Provide a way to reference terms from other vocabularies.
  6. Establish a term (concept) registry dedicated to CF
  7. Provide the ability to create computable identifiers for CF standard names.
  8. Establish URI namespaces for CF terms.

Methods for achieving these goals (that I left out)

  1. Create a proposed parsing scheme. (The final scheme, and need for same, is TBD.)
  2. Use URNs. (Alternative mechanisms are proposed.)
  3. Introducing the (unconstrained) notation to cdl (Need for this is TBD, pending design.)
  4. Provide a *central* translation across multiple systems. (System architecture is TBD.)

Indirect, non-functional, or non-technical goals

  1. Break the current deadlocks in these areas, which currently lead to soft conventions.
  2. Simplify the creation of CF-compliant files.
  3. Allocate to CF the responsibility to referee and maintain (a) alternative vocabularies, and (b) mappings between CF and those vocabularies.
  4. Centralize in CF the responsibility to maintain and serve its own vocabulary information.
  5. Improve readability of NetCDF terms/concepts.
  6. Improve permanence of NetCDF terms/concepts.

This may not be exactly right, but it's not a bad start. Now, where are we? (I provide my own perspective on where we *should* be in the following post, which is somewhat at odds with the status as summarized here.)

Agreement on Goals (So Far)

The word 'yet' could be applied to most of these statements....

  1. Identifier for standard names/concepts + metadata
    Most agree with this goal (in addition to the core proposers); none have disagreed.
  2. Aliases for standard names/common concepts
    Most support the idea of aliases, but not everyone.
  3. Link/map between CF terms and other vocabularies
    General agreement on the value of mapping between CF terms and other vocabularies.
  4. Registering a term before CF agreement
    Some support; no disagreement.
  5. Reference terms in other vocabularies.
    This goal is inferred from use cases; several imply their support.
  6. Term/concept registry for CF.
    Stated by one and (by implication) supported by several.
  7. Computable identifiers for CF standard names.
    Agreed by several, some initial concern but may be heading toward consensus.
  8. URI namespaces for CF standard names.
    Agreed by several, not clear if everyone has fully engaged on this item.

If you accept my summary of the goals and their status, at least to a first approximation, then the discussion of the best technical approach to deal with these issues seems very much in progress. Several variations on the initial proposal have been put forward, and several details (URN v URL, existence/format of parseable name string, opaque vs memorable vs semantic identifiers, number of attributes...) are open.

At a broader yet equally fundamental level, the appropriate architectural role of CF in addressing the above goals has not been explicitly addressed, though its architectural role has been implicitly defined in the proposal and several follow-on comments. This seems to me a recipe for later regrets.

With this in mind, I can not formulate an integrated recommendation. I hope the above, and the questions I raise in my next posting, helps someone else formulate one or more recommendations for further discussion. Because many of these goals are indeed very pressing.

comment:14 follow-up: Changed 13 years ago by graybeal

Some observations about all the above. I include my assumptions (at the end), in case, as a relative newbie, they are off-base.

I apologize for the length of these posts, but hey, I read all the posts that came before, and this stuff ain't trivial.

I am wildly enthusiastic about almost all the goals. But I am not sure exactly what makes CF the obvious way to achieve some of those goals, or to what degree such an approach may increase the scope of CF work beyond what is sustainable.

  • Goals 6, 7, and 8 have been needed for a while; I agree that if CF does not step up, others will. The implication is that CF has lost control of representations like MMI's, but I think it's in the interest of MMI and the community to support CF's control and satisfaction as much as possible; and MMI can 'easily' change what it does and how it does it. Put another way, there are multiple ways to give CF the best possible vocabulary services it can have; CF personally running those services may be a good option, but so might having another provider do so. (Beyond CF, think about the less well funded vocabulary publishers -- do they all have to run comprehensive registries and know about all the best practices? A common service provider has to be considered a reasonable model for at least some vocabulary providers.) But agreed that CF should participate actively in this destiny.
  • Goal 5 is essential, and I like the proposed parseable format, but would be equally or more happy with a URI, any URI. Note that if you use a URI for your own terms, you can choose to make it a legible one if superficial recognition is another goal (I'm aware of the semantic vs opaque argument but choose to defer it, unless prodded); or if not a legible URI, just an associated string, because at that point the relevant namespace is known. (URN v URL: A toss-up. In the end I bet we have to allow both.) (Multiple URIs: Yes, absolutely.)
  • Goal 4, Registering a Term: The issue with registering a term (implicitly with CF) is what you want to achieve in the registration process. A 'preliminary' term registry, for terms pending CF approval, seems useful. There are 3 ways one might do this: (a) the proposed way, (b) CF instituting a provisional status in some way, for immediate use of proposed terms, or (c) registering the term in another vocabulary, e.g., one of your own, without involving CF (this depends on Goal 3, of course). But surely a goal of registration is community acceptance? But community acceptance can only come through review and discussion, and we're back to the waiting period. Sooo, to enable community awareness and common usage, my suggestion is that proposed CF terms can go through 2-stage submission, and while in the first (review) stage, are available via a separate CF vocabulary, each term and its unique ID is versioned by date, and deprecated (but not deleted) once the decision is made. We can talk about technical details of how that would work if anyone likes the idea.
  • Goal 3, Link/map between terms. 3 parts here: the CF standard, the CF standard names, and who owns the process. Of course this goal is valuable, but should it be thrust into the CF architecture and process? Is it uniquely a CF problem? Per the proposed solution, and its initial linkage to CF standard names+metadata, maybe so. And all the exclusive users of CF may like that approach, too. But the exclusive users of GCMD will want them to do it (already do), and the exclusive users of SeaDataNet? platform codes will want them to do it, and...we're on a slippery slope. This design represents a seriously localized solution, and the proposed proceses (particularly requiring any ownership or review by CF) does not scale to the hundreds of vocabularies that will be mapped to CF.
    In the best case scenario mapping is facilitated by a single technique and limited number of distributed hosts; a possible scenario is one or two techniques and a whole lot of hosts; and an untenable scenario, from an interoperability standpoint, is every vocabulary owner managing the mappings from their vocabulary to all the other vocabularies. No?
  • Goal 2: Aliases (particularly to terms in other vocabularies) are inevitable and CF has to accomodate them. Of course, it already does, indirectly, whenever external systems map terms to and from CF. Whatever solution is adopted for Goal 5 will effectively provide most of the stated functions for an alias.
  • Goal 1: Identifier for standard names/concepts + metadata. I'm having a real problem with this one. Unfortunately I can only put it in terms of questions, which relate to My Assumptions below: A) Given that CF will not subsume all terms that anyone ever has a need to create and exchange, should the design solution require a connection to CF variables -- or to the CF processes -- for any of these additional concepts? B) Re the declaration "It would seem simpler and more scalable to introduce these new features, and use the common concepts, rather than proliferate standard names" -- what is the essential difference between proliferating common concepts and proliferating standard names? C) If there is a common concept called air_temperature_at_2m, will the definition specify the precise variances in elevation which are acceptable, so everyone can tell exactly whether their data fits? What about precisely specifying acceptable variances in all the other attributes, and in properties that aren't even in CF attributes yet? (Benno's comment about precisely characterizing properties may be relevant here; I'm not entirely sure.) D) What will happen when you want an essentially equivalent concept, but with less precise or more precise variances in one or more attributes? Semantic naming is likely to run aground here, at least to a first order. E) What is the tradeoff between manual review of each contribution of such an identifier, and ability to keep up with an ever-increasing number of variables? Especially when you realize that many of these variables could be generated automatically, in response to individual observation results or model runs?

In short, embedding the mechanism for a very narrowly defined identifier deep within the CF processes and services can only be supported in an automated, computable way, which raises the question of what CF adds to that architecture. I suspect CF can play a key role, but the role needs to be well understood programatically and technically. Best Guess

If forced to come up with answers, not just issues, I'd say:

  1. If CF wants an identifier for standard names/controlled concepts with metadata, enable an automated submission process that validates submissions before acceptance. First come first served on all names. This solution is really about registering URIs, not about community adoption per se, because of scaling issues.
  2. See #5.
  3. Perform mapping between CF and other vocabularies entirely outside of CF technology. If CF wants to create some of those mappings, great; but don't embed it within custom CF technical implementations.
  4. An interim/proposed vocabulary registration capability would be useful, and might encourage more submissions.
  5. The proposed approach seems constructive; I'd vote for multiple URIs, each with optionally a string label (no namespace).
  6. CF should enter into an agreement to serve its content via a term/concept registry; either itself, or via an agreement with another organization (or two).
  7. CF should create computable identifiers for CF standard names. My vote is to create both URNs and URLs. Work with the partners in #6 to establish requirements for both the service and for CF.
  8. See #7.

My Assumptions

CF, and the standard name conventions, were created to create a common data format that everyone could agree on, and still was useful to exchange data more effectively. Toward this end, adopting a shared vocabulary was seen as a mechanism to ensure meaningful agreement on terms; not just "we'll use this term", but "and here's what we mean by it." A decision was made to agree on both definitions (for detailed understanding of the concepts) and terms (for rapid assimilation of meaning). Whether or not these are the ideal strategies or have been flawlessly carried out, they are our starting point and operating context. (And I think most of us agree we are better off for it; try comparing CF to just about any other vocabulary out there when it comes to community involvement, computability, and value. I've been doing it and not much is comparable.)

It was, and is, predictable that this vocabulary will not serve all purposes, even within the initially targeted communities. Mapping will therefore be necessary, whether explicit or implicit, and whether computable or human-focused (free text). Moreover, it will not always be appropriate for the mapping to be to CF terms, as those terms can not possibly encompass all possible domains or depths of meaning; sometimes two non-CF vocabularies should just be mapped directly to each other.

Further, CF attributes can not possibly encompass all the attributes I might want to search on, and therefore can not encompass all use cases. No matter how much knowledge we pack into it, there will be more that is not encoded (or not yet widely agreed by the community, even if it is encoded). And observation data will invariably produce a larger number of potential attribute combinations than humans can reasonably keep up with.

John

comment:15 in reply to: ↑ 14 Changed 13 years ago by graybeal

Replying to graybeal:

Oops, hit Submit by mistake. The links to #5 and #7 are bogus, and Best Guess belongs on a new line. Sorry!

comment:16 Changed 13 years ago by caron

Let me first say that I support the motivation for this proposal, and agree with much of the thinking behind it. But I have concerns about how to best implement it.

One problem with this proposal is that the people who write data files are not necessarily in the position of knowing (or having the time to find out) how to map their data onto CF "concepts". Whats also worrisome is that concept definitions have to be frozen once they are actually used.

It seems to me better to separate the naming of the variables in the file from the mapping into a concept ontology. (I realize this errs in the direction of the "external table" problem. In this case I think the indirection may be considered a plus).

So instead of explicitly mapping to a fixed-forever URN as in

temperature: common_concept =
           "{badc.nerc.ac.uk}2m_air_temperature;urn:cf-cc:blah123";

just name the variable in your namespace:

temperature: concept_id = "{badc.nerc.ac.uk}2m_air_temperature";

or to make use of ticket #27 notation:

temperature: cf\:concept_id = "badc:2m_air_temperature"

:namespaces = "badc=badc.nerc.ac.uk cf=cf.org";

and badc.nerc.ac.uk is now responsible for registering and maintaining their mapping of the 2m_air_temperature into the cf concept space (and to other ontologies).

Obviously the format of that mapping still needs to be determined. Which discussion might be a good thing, as i am concerned that "{badc.nerc.ac.uk}2m_air_temperature;urn:cf-cc:blah123" might only be able to express identify with no qualifications ("is-a"), whereas in the real world we might also have "is the same with the additional attribute x", "is the same except for y", "is-sorta-like" and of course "is the best i could figure out in the time i had to spend on this problem".

Now the problem is separated into 2 tasks, possibly done by separate people: 1) name the variables (using your own namespace) in your files unambiguously, uniquely and consistently. 2) map your vocabulary into the cf concept ontology, refining your mapping as you understand things better.

All of this is akin to Benno's desire to allow concepts not yet defined at the time the file is written.

comment:17 follow-up: Changed 13 years ago by caron

With respect to the representation of concepts:

My first thought about using CDL to define "concepts" is that CDL is inappropriate because its not semantically rich enough. Here is the example using unconstrained to try to represent which coordinates are constrained and which not:

dimensions:
    hgt=1;
variables:
    float x(unconstrained);
        x:standard_name:"cloud_area_fraction" ;
        x:units="1";
        x:coordinates="height + unconstrained";
        x:common_concept:"{gfdl.noaa.gov}high_cloud;tbd";
    float height (unconstrained);
        height:units="m";
        height:valid_min=7000.;
        height:valid_max=14000.;

The is certainly a good way to guide file writers by example, and it would be great if it really worked, but i fear its not very precise or machine readable. For example, probably the units dont have to match, they only have to be udunits convertible.

The use of unconstrained means that its not parseable CDL. If we are augmenting cdl, I might prefer:

dimensions:
    hgt=1;
variables:
    float x(hgt,...);
        x:standard_name:"cloud_area_fraction" ;
        x:units="1";
        x:coordinates="height";
        x:common_concept:"{gfdl.noaa.gov}high_cloud;tbd";
    float height(hgt);
        height:units="m";
        height:valid_min=7000.;
        height:valid_max=14000.;

but again the semantics are not precise. I also guess that there will be semantics that it cant express (I will try to think of some). But Im willing to be persuaded if I cant come up with some deal-breakers.

It would be useful to gather a dozen or more different examples to work with before deciding CDL can really do an adequate job.

The alternative is obviously RDF and its variants. I personally havent been impressed by the useability of RDF, its seems to me to be an expert-only system. Where are the killer apps and the Web 3.0 sites? However, in order to do automatic reasoning, I assume that whatever we do use will have to be expressed in RDF. So is it possible to write an automatic translater of the proposed CDF notation into RDF? I will stop and let Benno and John and others more knowledgeable about this get in a word if they want, instead of continuing to bark up the wrong flagpole.

comment:18 in reply to: ↑ 17 ; follow-up: Changed 13 years ago by russ

Replying to caron:

...

It would be useful to gather a dozen or more different examples to work with before deciding CDL can really do an adequate job.

The alternative is obviously RDF and its variants. I personally havent been impressed by the useability of RDF, it seems to me to be an expert-only system. Where are the killer apps and the Web 3.0 sites? However, in order to do automatic reasoning, I assume that whatever we do use will have to be expressed in RDF. So is it possible to write an automatic translater of the proposed CDF notation into RDF? I will stop and let Benno and John and others more knowledgeable about this get in a word if they want, instead of continuing to bark up the wrong flagpole.

There is a "Linked Data" community (Stephen Pascoe mentioned this in the "what are standard names for" CF thread in referring to the http://linkeddata.org/ site), who have proposed simpler alternatives than RDF/XML for representing triples in the RDF data model. The alternatives being used currently include Turtle and TriX.

From Linked Data Tutorial

... There are various ways to serialize RDF descriptions. Your data source should at least provide RDF descriptions as RDF/XML which is the only official syntax for RDF. As RDF/XML is not very human-readable, your data source could additionally provide Turtle descriptions when asked for MIME-type application/x-turtle. In situations where your think people might want to use your data together with XML technologies such as XSLT or XQuery, you might additionally also serve a TriX serialization, as TriX works better with these technologies than RDF/XML.

The RDF data model might be expressable in an augmented CDL or NcML using ideas or notation from Turtle (or N3 or TriX) ...

comment:19 in reply to: ↑ 18 Changed 13 years ago by benno

Replying to russ:

There are a number of rdf formats. The help from rapper, a commonly used RDF utility, currently shows

  -i FORMAT, --input FORMAT   Set the input format to one of:
    rdfxml                  RDF/XML (default)
    ntriples                N-Triples
    turtle                  Turtle Terse RDF Triple Language
    rss-tag-soup            RSS Tag Soup
    grddl                   GRDDL over XHTML/XML using XSLT
    guess                   Pick the parser to use using content type and URI
  -o FORMAT, --output FORMAT  Set the output format to one of:
    ntriples                N-Triples (default)
    rdfxml-xmp              RDF/XML (XMP Profile)
    rdfxml-abbrev           RDF/XML (Abbreviated)
    rdfxml                  RDF/XML
    rss-1.0                 RSS 1.0

GRDDL is particularly interesting -- it is a way of instrumented an XML file (or its schema file), so that RDF applications can find a file of XSLT tranformations to translate the XML to RDF. So one file can serve both the original XML clients as well as RDF clients.

comment:20 Changed 13 years ago by jonathan

Dear all

John Caron has suggested that the file should have an attribute which gives the common concept its local name e.g. its name in the BADC namespace, but not its URN. I'd like to make the opposite suggestion, that the common_concept attribute should give only the URN (or URL, or a plain string from which the URI can be worked out), and not the name in a local namespace. The reasons for suggesting this are:

  1. A purpose of the proposal is to provide a way for users to use their own familiar vocabulary to refer to common concepts. However, software cannot depend on a given user's vocabulary (PCMDI names, Met Office stashcodes, CERA names etc.) being recorded in the file, since the data comes from many different sources. No centre will write all the possible current names in the file, and even if it did, equivalents in other namespaces might be invented after the file had been written. Hence any software which is going to support access by local names must instead depend on translating the URN. If it can translate the URN, it can always do that, and there is no need for any local translations to be stored in the file. Indeed, if they were stored in the file, the software would have to be extended to use them. No existing software will look for its local names in the common_concept attribute, since this is a new convention.
  2. If both the local name and the URN are given in the file, they could be inconsistent.
  3. If local names are not stored in the file, it allows them to be updated, and that is under the control of the centre which owns them and their mapping to CF metadata; John Graybeal makes a similar point.
  4. It simplifies the syntax.

I think there are potential problems with preliminary registration of new common concepts. Common concepts may translate to a range of CF metadata, not just standard names e.g. they may require an entry in cell_methods (like daily mean temperature does) or a particular coordinate variable (like surface air temperature does). If preliminary registration of concepts were possible, it might lead to files being written with only the common concept identified, lacking other important CF metadata, not just the standard name. This would make the metadata in the file incomplete and hence less useful, the opposite of what CF has aimed to do. If preliminary registration is suggested because the CF process is slow, I would say that the right solution is to speed up the process by investing more resources in it. If we could achieve decisions on common concepts within six weeks, say, would that be good enough?

Cheers

Jonathan

comment:21 follow-up: Changed 13 years ago by graybeal

I thought John C's proposal was not to give a local name for a given common_concept, but to specify a mechanism to unambiguously refer to local variable naming. It wasn't clear to me how the relationship to the CF standard name, or to the common_concept, is maintained in the context of CF; I thought he was arguing it is *not* maintained in the context of CF. Presumably the same variable could have a local name, and the CF standard name, but that doesn't mean the two are sameAs each other, only that both describe this variable.

That said, your point 1 still makes sense to me -- a URI is much more identifiable and resolvable than simply a local name.

Another advantage of URIs in this context: Those communities that prefer semantically meaningful terms can embed them in the URN; those that prefer semantically neutral codes can use those instead. That may be more important than it seems.

You are right about preliminary registration in general: the registered terms/definitions can always be Bad (in various senses). But let me clarify the intended model: These are meant as an upgradeable handle for the user to reference the *perceived* meaning of the term. Over time, the term's name and intended meaning may evolve, as understanding grows, to address issues like the ones you raised. The point of an interim mechanism is that it can be updated to point to future, better versions; and the resources that point to the Bad initial handle can be auto-forwarded to the better term/definition. The handles themselves persist, but never contaminate the central, approved concepts until they are reviewed.

WIth my data system developer hat on, for optimum progress, I would like to be able to think of a term now, and put it in my data system immediately. I know that *sounds* extreme (and a very different model than the service CF currently provides), but if you think of the role of the term provider as a facilitator for that need, it is not hard to serve the need. Any unique ID registrar is doing the same thing. So, no, 6 weeks is definitely better but for some of my uses, not good enough.

comment:22 in reply to: ↑ 21 Changed 13 years ago by jonathan

Dear John

Replying to graybeal:

I thought John C's proposal was not to give a local name for a given common_concept, but to specify a mechanism to unambiguously refer to local variable naming.

Yes, that's right. I didn't express myself clearly enough. I think John was proposing to record the local name/namespace in the file, but not the URI, whereas the original proposal records both of them. On the other hand, I was suggesting recording just the URI and not the local name/namespace.

the registered terms/definitions can always be Bad (in various senses). But let me clarify the intended model: These are meant as an upgradeable handle for the user to reference the *perceived* meaning of the term.

The problem is that the perceived meaning of the term is also stored explicitly as the other CF metadata in the file, to which it translates i.e. standard name, parts of cell_methods, possible coordinates (and maybe other things). Hence if the translation of the URI changes, as contained within a mapping stored outside the file, it becomes inconsistent with the file. That doesn't seem an acceptable situation to me. The problem I was suggesting was that preliminary registration might lead to incomplete CF metadata in the file; it might also lead to incorrect metadata in this way.

With my data system developer hat on, for optimum progress, I would like to be able to think of a term now, and put it in my data system immediately.

Yes, I can see you would like to do that, but I think it is inconsistent with the need to consider and decide the right CF metadata to describe a given concept. That is bound to take some time, even though we ought to be able to reduce the time it currently takes. You say that for some purposes this delay is OK for you. It does generally take quite a lot of time from initiating a project to creating datasets, for example - long enough for these decisions to be made. Which are the purposes for which you would like an immediate definition to be possible?

Best wishes

Jonathan

comment:23 Changed 13 years ago by graybeal

(Not sure how you get the cool red line...)

"The problem is that the perceived meaning of the term is also stored explicitly as the other CF metadata in the file, to which it translates i.e. standard name, parts of cell_methods, possible coordinates (and maybe other things). Hence if the translation of the URI changes, as contained within a mapping stored outside the file, it becomes inconsistent with the file."

I'm not following you here, I hope I am not astray. Let's take a specific example. I measure water_temperature with my TempSense9000 sensor. The CF standard name is water_temperature, and many other CF attributes also apply. My internal URI is urn:myorg.org:variables:water_temp_inductive_perfecto, reflecting the critical value I put on the quality and method of sensor I'm using. There is no mapping per se between my URN and the standard_name + attributes; they are quasi-independent. (I hope that's OK; I know it violates Goal 1 in my summary, for which I can't envision a clean solution.) However, there IS a concept that I share with other temperature sensor fanatics, and by following a 'standard CF mechanism' we can jointly utilize our concept. If it turns out that someday I realize that a better name is water_temp_inductive_imperfecto, or that I misspelled a word in the definition for that matter, I haven't made any change to the mapping of my term to the CF name + attributes -- they are still quasi-independent.

"I think it [immediate registration and use of just-requested terms] is inconsistent with the need to consider and decide the right CF metadata to describe a given concept"

It's a technological bridge to facilitate CF metadata creation and consideration; nothing in it directly affects the consideration of CF metadata terms/definitions. But it provides a pathway (see below).

"Which are the purposes for which you would like an immediate definition to be possible?"

Anything involving actual implementation of data systems and products that are immediately publicly accessible. When trying to create end-to-end systems that integrate data from multiple sources, or when creating knowledge products like a sensor ontology, one values very fast turnaround on the knowledge components that the development relies on. From a sociological perspective, if the community process can be a part of my development process, I am much more likely to engage in the community process, and the converse is also true. (Put another way, just like in-line documentation is rarely done "later", people rarely want to go back and "do the right community thing" after they have finished implementing that component.

I am open to moving this sub-topic off-line or to another proposal, we may be too deep in it at the expense of the primary question(s). It's a little hard for me to disentangle the goals though!

comment:24 Changed 13 years ago by bnl

Just an attempt at catching the current issues ...

If we wind back to John Caron's posts, we see he raised two substantive issues: 1) persistence and governance for common concept mappings, and 2) syntax for description.

As far as the former goes, Jonathan and John have argued two sides of it.

I'd like to restate the position of the proposers (en mass): common concept registration should be *very* lightweight: the underlying semantics are the bit that CF can and should argue about. If that's so, then cf can govern it, and if someone registers a common concept and wants to change it slightly, then register another one (because the urn is opaque there is no risk of semantic drift, except within the local namespace, where it is up to the registering people to deal with it, not CF). Given that cf is doing the maintenance (and we think we can automate this pretty much), both the urn and the local name can and should appear in the file.

The issue which has been well exposed with what we thought is how that would that work in the case where someone wants to register a common_concept in advance of the underlying CF support. We do need to resolve that. My gut feeling is that we should allow the use of common_concepts to point to new definitions sans standard names in anticipation of registration, but no other underlying cf convention change should be supported in this way. This could be cleanly implemented, cleanly defined, and would have great utility - but wouldn't support all the problems with CF's glacial pace. So what? Let's solve the 80-20 problem first.

As far as syntax description goes, I had hoped we could use CDL (possibly with a minor extension or two), since the great majority of Netcdf users find it easy to understand. I don't know whether RDF per se provides us with a constraint language as such, and that appears to be the problem here. Can Benno or anyone else give us an example of how we can use RDF to express constraints? (The formal alternative might be OCL, but I think that would be an anathema to most of us, and we'd end up wanting to limit the syntax ... mind you if we have to invent something to add to CDL it might be a good place to start).

(Note to John and Jonathan: by all means move the discussion to another topic, but please keep it in the CF domain. It's an important discussion).

comment:25 Changed 13 years ago by bnl

I should have pointed out that whatever constraint mechanism we want to add will have to go into CF since CDL wont include it. In practise that will limit the concepts we can support to only those we can define constraints for. We shouldn't try and think of every possible use case up front. Once we have the principle established, then we can add functionality as necessary. (This also means the CF checker can implement it with confidence since it will be well defined).

comment:26 follow-up: Changed 13 years ago by jonathan

Dear all

In response to points of Bryan's and John Graybeal's.

The common concept, indicated by the URN, translates into other CF metadata (isn't that the proposal?) such as standard name, coordinates, cell_methods. This metadata will recorded be in the file. That means if you change the definition of the concept in terms of these metadata, the metadata in the file will become incorrect, and inconsistent with the URN also recorded in the file. That is my objection to provisional registration.

For example, suppose someone provisionally registers the concept of daily-maximum surface air temperature. They suggest that this concept translates to standard_name="daily_maximum_surface_air_temperature" and it is assigned a an opaque URN with that translation. They write files containing this standard name and URN. Then we debate the proposed new standard name and it is pointed out that in CF metadata we would describe this concept with standard_name="air_temperature", a size-one coordinate variable of height with a value in the range 1.5-2.0 m, and a cell_methods entry for time of maximum within days. But it is too late. The files have been written with an invalid standard name and lacking the other required metadata. The problem is that files cannot usually be provisionally written. Once written, they last forever. Hence I think provisional registration of a concept is only acceptable on the condition that no data will be written until the standard name has been agreed. Perhaps I have misunderstood your arguments why this is not a problem?

I also didn't follow Bryan's argument why the URN and the local name could and should both appear in the file. To summarise my arguments:

  • I think only the URN needs to be in the file, because if software exists that can translate the URN it doesn't need the local name to be in the file ...
  • ... and such software must be written if the point is to allow analysts to use these local names for common concepts, whatever the provenance of the data, since in general the local name in your particular namespace will not have been recorded in the file by the person who generated the data.
  • I think the local name should not be in the file because it might be inconsistent with the URN ...
  • ... and because if it is not recorded in the file, the local name for a given URN can be modified without causing any problem for existing files.

I'm willing to be convinced these arguments are wrong.

Best wishes

Jonathan

comment:27 in reply to: ↑ 26 Changed 13 years ago by graybeal

Responding to Jonathan:

The common concept, indicated by the URN, translates into other CF metadata (isn't that the proposal?) such as standard name, coordinates, cell_methods. This metadata will recorded be in the file. That means if you change the definition of the concept in terms of these metadata, the metadata in the file will become incorrect, and inconsistent with the URN also recorded in the file. That is my objection to provisional registration.

You can't change the definition of _that_ concept, because that concept was registered with a given definition. If you want to change the definition of the concept, you can only mark it 'deprecated in favor of new_concept', but anything that uses the original reference will have to choose whether it wants to replace the original concept with the new_concept. Although the concept has a persistent name, in my scenario it is understood that a better definition (and/or name) may be produced in the near future. So the 'concept identifier' would typically embed (explicitly or otherwise) a version ID. (This concerns the need to update terms in *any* vocabulary, but I don't think we want to start that discussion here.)

If the registered name is considered permanent, I agree with everything you said. For the reasons you give, I should perhaps restate my objection to the original proposal more bluntly (sorry): I do not see the original proposal, as written, being the appropriate approach to address its use cases. My two concerns are that (I) the fundamental purposes for my restated goal 1 (Identifier for standard names/concepts + metadata) seem to me to be unachievable by the method identified, and (II) CF should not be the arbiter, or even registrar, of mappings from/to CF, unless the mapping is integral to the CF mission, or can not be disentangled from the CF process.

I suppose the argument is that this is integral to the original purpose of CF, and so I'd rephrase the question as: What kinds of mappings or reuse of CF should *not* be incorporated into the CF process and ownership? How do we know if something is important enough to CF that it should be added to the CF workload? Given the lack of resources, I'd put this bar pretty high, requiring high community benefit compared to CF team cost, and typically only put things in CF that can not be done another way.

So it's fine if others find a value in creating unique terms that give them the capability to exercise goal 1. I am not convinced (so far) that it is good use of CF time to be an explicit part of that process; even more so if some manual processes are required. I remain open-minded, as there may be as-yet-unposted answers to the issues I raised previously.

I also didn't follow Bryan's argument why the URN and the local name could and should both appear in the file. To summarise my arguments:

I endorse your arguments on this point.

comment:28 follow-up: Changed 13 years ago by benno

Replying to bnl:

Well I started to write down the mapping in RDF/OWL (see http://iridl.ldeo.columbia.edu/ontologies/sampleconceptclass.owl, subject to revision), but I realized that the example I was trying to reproduce (gfdl:high_cloud) was not what I thought it was.

So my basic strategy is to define an OWL class that has two different necessary-and-sufficient conditions, i.e. either one is sufficient to put a variable in the class. On the one hand, it can have a class characterized as common_concept has gfdl:high_cloud. On the other hand, there is the equivalent class standard_name has "cloud_area_fraction", units has "1", and cfobj:hasCoordinate is some a_high_height. (cfobj:hasCoordinate is an explicit property from my attempt to write down the concepts behind CF, i.e. what about data CF is trying to express, and the rules necessary to deduce the abstract relationships from what is found in a netcdf/CF file). The plan is that once a reasoner sees multiple necessary-and_sufficient conditions for a class, having one implies the other(s), and one gets reasoning, i.e. it can associate all the different ways of labeling the variable as being high_cloud with the particular variable at hand.

However, defining the class "a_high_height" does not seem quite right in the example. First of all, it has to have the standard_name "height" -- variable names do not have meaning in CF. But secondly, this fragment says there is some specified height, it just has to be between the two limits. That is more specific that common_concept has gfdl:high_cloud, which does not specify the height. So that would lead to a number of nested classes, with standard_name has "cloud_area_fraction" and units has "1" containing common_concept has "gfdl:high_cloud" containing the specific class standard_name has "cloud_area_fraction" and units has "1" and ncobj:hasCoordinate some a_high_height.

To express the concept of an ambiguous height between two limits in CF would be more of a challenge. I would think you could specify the two limits as the bounds of the height variable, with a cell_method of 'point' -- ideally you would leave the height value as missing, but I suppose you won't do too much damage by picking a value between the limits. I guess this brings up the general question of what the coordinate values mean if bounds are specified, i.e. if they are just the center of the interval, with no commitment as to where the "point" measurement was taken.

If someone could tell me how to do this (vaguely specify height), I could make a set of CF statements precisely equivalent to common_concept has gfdl:high_cloud.

I am also not totally happy with the surface example, for similar reasons (particularly because it names a coordinate variable that does not share a dimension -- seems to me you want to specify a height dimension of length 1).

Can someone straighten me out?

Benno

comment:29 follow-up: Changed 13 years ago by bnl

We so need to simplify this thread. Even I might disagree with myself if I knew which point was being criticised :-(

Benno: I'll need a bit more time to understand your point, so meanwhile, I'll let someone else, Balaji?, take you thru the high cloud example.

So two points:

Firstly, let me agree that I'm not quite sure of how this should work with in the case of provisional things, I don't think we had anticipated that issue, and so I for one have yet to fully think it through . Much of the argument above is rather reactive rather than proactive. So, what I suggest we do is split the support for common concepts describing things that do not yet exist in CF into a separate ticket, and try and focus firstly on original use cases (something I anticipated the need for doing above).

Secondly, as far as the URN goes. Jonathan: if it is really an opaque identifier are you happy not to have any local scoped name in the file?

comment:30 follow-up: Changed 13 years ago by stevehankin

I'll echo Bryan's last remark: "We so need to simplify this thread." Which, if true means that it is time for the moderator(s) to restate the problem based upon what has been learned.

But before they do so, I'd like to offer my 2 cents (muddying of the waters further?). I believe that under the single title of "Common Content" we are actually wrestling with two rather distinct requirements. The resulting ambiguities (plus the necessary discussions of syntax) have contributed to the length and opacity of this discussion.

The 2 topics:

  1. The semantics of the standard_names are not rich enough to capture all that needs to be captured. We see this in the case of high cloud, which needs to combine the semantics of "cloud_area_fraction" with information known through the Z axis of the variable. We also should see very similar connections in the relationship between (say) the standard variable sea_skin_temperature and the standard variable sea_surface_temperature, but we currently fail to capture the relationship between these variables in a machine-accessible way. Conclusion: We need to add richness (ontological information) to our standards name framework.
  1. Private communities have already agree upon names for variables -- names that differ from the CF standard names. This situation will always be the case. We need a way to embed optional, ancillary information into CF files, so that these private communities can create fully standard CF files, while still finding and recognizing their private community terms within those files. All that CF need provide is a standard encoding for doing this. CF should accept no responsibility for quality of the information so-encoded or for "registering" it.

I will stop here in the interest of brevity. I will not provide specific suggestions for how to accomplish these two goals, because the purpose of this message is different -- it is to offer a suggestion to the moderators of this trac ticket -- a way in which they might reformulate the current "common content" proposal into 2 new and more focussed tickets.

comment:31 in reply to: ↑ 30 ; follow-up: Changed 13 years ago by bnl

Replying to stevehankin:

  1. The semantics of the standard_names are not rich enough to capture all that needs to be captured. We see this in the case of high cloud, which needs to combine the semantics of "cloud_area_fraction" with information known through the Z axis of the variable. We also should see very similar connections in the relationship between (say) the standard variable sea_skin_temperature and the standard variable sea_surface_temperature, but we currently fail to capture the relationship between these variables in a machine-accessible way. Conclusion: We need to add richness (ontological information) to our standards name framework.

... without introducing a large overhead in agreeing names for "bundles" of connections ... since the semantics have already been agreed ... and the only way to avoid name arguments is an opaque URI, but we thought some folk wouldn't like that on it's own ... which did lead us to the suggesting that the same mechanism could support registering local name aliases, since it would be sufficiently light weight.

All the discussion so far leads me suggest to split this FOUR ways:

  1. Semantic richness aka bundles (with relatively opaque URIs?).
    • The problems here are a) how to define the bundles, and b) how to name them (URN, opaque or otherwise). (And Benno is probably leading us with a).
  2. Standard encoding for local names mapping onto URIs (both those defined as bundles and those which are just pre-existing standard names).
    • NB: Should probably introduce URIs for versioned standard names ...
  3. Should CF provide a common URI registration service and resolver?
    • Probably yes if we go for opaque URIs for bundles, unless we demand that a local name appear pointing to the bundle URI whenever we see a bundle URI ...
  4. What if anything should we do to provide pre-registration? (Not so much a proposal per se as a discussion which may be informed by the outcome of the first three points).

(NB: Alison has offered to moderate this, but she'll be away for a bit yet).

comment:32 Changed 13 years ago by lowry

Thanks Bryan,

That list brings the thread back to manageable proportions. Items 1&2 are what I signed up for with the original proposal and (3) I would see as pretty much an inevitable consequence of going forward with (1) and (2).

However, (4) wasn't on my radar and I'm pretty sure that I don't want it to be. Providing a short-circuit to the Standard Name discussion process might remove CF user frustration, but as Jonathan so rightly pointed out it will inevitably lead to files carrying invalid Standard Names. I know this problem can be addressed semantically by concept deprecation and mapping, but I feel this isn't the best use of the resources that would be required for its management.

comment:33 in reply to: ↑ 29 Changed 13 years ago by jonathan

Replying to bnl:

Secondly, as far as the URN goes. Jonathan: if it is really an opaque identifier are you happy not to have any local scoped name in the file?

Yes, thanks. I would be happy with an opaque (though possibly memorable) URN if there are no locally scoped names in the file, just the URN. Let me compare this with my original position:

  • I did not like the proposal of local names in the file, since those local names are just local dialect equivalents and it seems to me it does not really help and actually causes inflexibility to have them there (for arguments given above). Like John Graybeal, I think the equivalence between local names and CF concepts should be outside the file because it's not central to CF. It's a convenience (a valuable one, no doubt) for analysts used to particular names. Hence I would be happier with just the URN.
  • I originally didn't like an opaque URN but I have changed my mind because I see the advantage of not implying any extra semantic information, and because the CF metadata to which it translates, which is fairly intelligible to humans and can be processed by analysis programs without external lookup, is also in the file. Hence I am happy with the opaque URN providing a label that external services can use to translate to a familiar name, and that CF can record as indicating a particular bundle of CF metadata.

I also agree that objectives 1 and 2 are distinct. I think this is the same distinction as I made on 04/06/08 14:07:38. Regarding a point of Steve's, I think an idea of the proposal is that there is already more semantic richness in CF, but it does not reside all in the standard name. The common concepts combine standard names with other aspects of CF metadata. I agree with what Bryan says, that URIs should be introduced for standard names by themselves as well. Thus, while continuing to provide self-describing CF metadata, we also provide a hook for translation of CF metadata into other vocabularies.

Cheers

Jonathan

comment:34 in reply to: ↑ 31 Changed 13 years ago by benno

Replying to bnl:

Replying to stevehankin:

I think Bryan just succeeded in splitting part of Steve's first point into four: the second part of the that first point is important as well (and the example is particularly telling: adding sea_skin_temperature invalidates all the earlier correct-at-the-time files that labelled sea_skin_temperature data as sea_surface_temperature. Stating that sea_skin_temperature implies sea_surface_temperature would solve that problem). Also, Steve's second point is a clean distillation of part of the the original proposal (common_concept, or possibly just plain concept): give people a standard way to make their local statements, with mapping between these local concepts and their full CF representation a separate ticket. Both points are important.

comment:35 Changed 13 years ago by bnl

Thanks Benno, when I made the list of four points, I had meant to point out that I thought that the second part of Steve's first point was fodder for a completely new CF activity (one that I obviously think would be important), but not part of the common concept per se.

I also thought Steve's second point was rather bigger than the common concept alone. It has rather a lot in common with ticket:27 and my point 3 (if the answer to the latter is no, then Steve's second point is mandatory for the common_concept, but even if the answer to my point 3 is yes, Steve's point probably has useful application and would need a ticket in it's own right).

comment:36 in reply to: ↑ 28 Changed 13 years ago by benno

Replying to benno:

I'll take a shot at clarifying my question. As an oceanographer guessing as to what gfdl:high_cloud means, I offer the following alternate CF representation

dimensions:
    hgt=1;
variables:
    float x(hgt + unconstrained);
        x:standard_name:"cloud_area_fraction" ;
        x:units="1";
        x:cell_method="hgt: maximum";
        x:common_concept:"{gfdl.noaa.gov}high_cloud;tbd";
    float hgt (hgt);
        hgt:units="m";
        hgt:standard_name="height";
        hgt:bounds="hgt_bounds";
   float hgt_bounds(hgt,2);
  data:
       hgt_bounds= 7000.,14000.;
        

i.e. that gfdl:high_cloud is the maximum cloud_area_fraction between 7000 and 14000m. This CF representation differs from the example presented in the original proposal in that a realization of this template does not additionally specify the height of the cloud_area_fraction within the 7000-14000m layer.

I think in this case the set of CF elements other than common_concept and common_concept=gfdl:high_cloud would be considered equivalent.

This assumes. of course, that I have the meaning of gfdl:high_cloud correct.

comment:37 follow-up: Changed 13 years ago by bnl

Hi Benno

I've got ten minutes ... so to help kick this along this is train of thought on the background high cloud use case ...

Strictly I can't comment on gfdl:high_cloud, but I could on a possible badc:high_cloud which might be cloud_area_fraction above 7km (strictly I might have no upper boundary, but let's put it at an arbitrary tropopause for the purpose of this example, so say 11 km to make the maths easier)

... Now as an aside, let me say that for this thought experiment that it is a model product and what is actually done is use a specific assumption - let's say random_overlap - and integrate the cloud_area_fraction on each model level between these bounds (this might be done to compare with a satellite derived product that can only give broad height bands) ...

Now I somehow want to get random_overlap into my cell methods ... which may be a topic for another day, so let's revert to the satellite case ... but both are the same from the following point of view: it's not from an arbitrary height per se ... each instance has a specific height ... we could write this as data from 9km +/- 2km ... (which I can get into CF without too much hassle) ...

... but I might have another model with data from 7-9km called high_cloud marked at 8 km, but I might be happy to have that marked as badc:high_cloud too ...

So the common concept would have to allow me to write the data from both instances in the normal way, but to mark both as high_cloud, and the cf_checker would have to parse the "constraints" on high_cloud and check that both are valid high_cloud things. I don't think any data processing software would need to know anything other than it could use the high_cloud as a label (for visualisation rather than cloud_area_fraction, or for selection a la Balaji's example which cannot otherwise be done automagically with CF data files) ...

Hope that helps.

Cheers Bryan

comment:38 in reply to: ↑ 37 Changed 13 years ago by jonathan

Dear Bryan

Replying to bnl:

but I might have another model with data from 7-9km called high_cloud marked at 8 km, but I might be happy to have that marked as badc:high_cloud too ... So the common concept would have to allow me to write the data from both instances in the normal way, but to mark both as high_cloud, and the cf_checker would have to parse the "constraints" on high_cloud and check that both are valid high_cloud things. I don't think any data processing software would need to know anything other than it could use the high_cloud as a label (for visualisation rather than cloud_area_fraction, or for selection a la Balaji's example which cannot otherwise be done automagically with CF data files)

Yes, I agree with this, and I would add that standard names alone (in the cases where a standard name is sufficient) have the same kind of role as common concepts. The definitions of standard names allow some vagueness, though some are more precise than others, because their role is to indicate which things should validly be regarded as the same thing by visualisation and processing software, like common concepts. Selection of common concepts could be done according to the other metadata they imply (standard name, coordinates, cell methods etc.); software could be written which used the definition of a common concept, just as the CF checker could use it. However it is obviously easier just to inspect the common_concept attribute than to look for a combination of metadata.

Cheers

Jonathan

comment:39 Changed 13 years ago by Heinke

Dear all

In response to points of Jonathan's.

The common concept, indicated by the URN, translates into other CF metadata (isn't that the proposal?) such as standard name, coordinates, cell_methods.

Yes, one part.

This metadata will recorded be in the file. That means if you change the definition of the concept in terms of these metadata, the metadata in the file will become incorrect, and inconsistent with the URN also recorded in the file. That is my objection to provisional registration.

This should never happen. If the header is inconsistent the data is not CF-standard. This was not part of our proposal. We don't allow updates of the URN links to the standard names and metadata. The URN is a persistent identifer. Persistent identifiers should have persistent objects. URNs without a standard name are not allowed because the objects are not persistent. In worse case we would produce URNs without standard names which will never get a standard name. URNs without correct resolving should never be allowed. With this we are leaving the concept of persistent identifiers.

I agree that pre-registration should have it's own threat.

For example, suppose someone provisionally registers the concept of daily-maximum surface air temperature. They suggest that this concept translates to standard_name="daily_maximum_surface_air_temperature" and it is assigned

For standard_name we should never use {namespace}scope names or URNs. Only the the standard_names should be allowed.

I also didn't follow Bryan's argument why the URN and the local name could and should both appear in the file. To summarise my arguments:

  • I think only the URN needs to be in the file, because if software

exists that can translate the URN it doesn't need the local name to be in the file ...

We (Frank, Michael and I) would like to have both in the header because we want to have this part '{namespace}scope names' human readable. The other point is that for inhouse (local) interpretation this part could be used without a connection to the central common concept URN server.

  • I think the local name should not be in the file because it might

be inconsistent with the URN ...

Inconsistency is a problem, that could ever happen, but the files are written with computer proprams. If they create the inconsistency for common concept, they should be repaired. Nobody creates the header directly where inconsistency would be a big problem.

  • ... and because if it is not recorded in the file, the local

name for a given URN can be modified without causing any problem for existing files.

We don't want to allow modifing the URN ! A new insert should be made.

Best wishes

Heinke

comment:40 Changed 13 years ago by stevehankin

my 2 cents on seeing the continuing discussions about registration and stability:

As I said earlier (04/16/08 14:47:57), I think this thread should be divided into two separate pieces. I suggested that because one aspect of the proposal seems solid, while the other (registration) seems questionable. The following two bullets attempt to explain those views. The bullets follow the same numbering as my bullets above.

  1. I agree with the spirit of the common_concept attribute. It is optional in the CF file. It defines a clear and simple syntax to permit a community to connect their local use of vocabulary with the standard_name vocabulary CF. It is a solid proposal. e.g.
    x:common_concept:"{gfdl.noaa.gov}high_cloud";
    
  2. I have great reservations about Section 5. (Maintenance) in the original proposal ("We would expect communities to propose common concepts to the CF mailing list, and for the standard-names secretary to provide a URN") IMHO the contents of the common_context tags should be the private responsibility of the organizations that inserts them. There should be no "registration" of the URNs in the CF standard. I understand that this would open a potential for namespace collisions. But it is a remote potential.

The motivation for registration is given as "reduce the necessity for the proliferation of some classes of standard names". There is an alternative, and I think better, way to address this. Namely, create a formal ontology for the standard names list. (a.k.a. "add semantic richness"). Then there is no longer a problem to having the standard names list grow to include many terms that are only subtly different concepts from one another.

comment:41 follow-up: Changed 13 years ago by jonathan

commenting on Heinke's contribution

We agree about the problems of pre-registration.

We disagree about the local names being recorded in the attribute. You argue that this is a good idea for human-readability of the data. I agree with that principle, of course, but I feel that the common_concept local names will not help so much with readability. I think this because:

  • Although they will look familiar to the users of their own namespace, there will be many namespaces so most data will not contain your familiar names.
  • As I understand it, these local names will be locally assigned, without the long negotiations we sometimes have about standard names (that is an advantage that common concepts will have), but consequently they might not be systematically constructed, and may not be intelligible at all. The PCMDI names are not very systematic or self-explanatory, for instance.
  • As I've commented before, I think the local names could even be misleading and hence reduce human-readability, since it could happen that different centres might give the same or similar local names to different common_concepts (like your precip example). That would be confusing to readers of the file at a different institution where they are accustomed to a different namespace.

Secondly, you argue that local names would be helpful to avoid having to connect to a central server to translate the URN. But that would only help if the file contained your own local names, which it probably will not (if it comes from another institution). Also the translation doesn't have to be done by a central server. It's probably a better idea if your particular namespace mapping to the URNs is defined by your own local server anyway because it belongs to your institution.

Regarding your last remark, I agree that the URN should not be modified once assigned. What I mean is that if the local name is not recorded in the file, the local name can be modified.

commenting on Steve's contribution

My understanding of the proposal is that 2 is essential. Just as standard names are centrally registered, common concepts are to be registered as well, which represent the combination of standard names with other CF metadata. You are right that we can add more richness to standard names, and we certainly do that by creating new distinctions among them, but by design there are many aspects of CF metadata that are not in standard names, such as cell_methods, and some commonly discussed quantities involve these other metadata in their definition. So I support 2.

On the other hand, I don't like 1 so much, as you see above. I appreciate the wish to use local names to refer to common concepts, but I think that the translation should be a local matter, and that these local names should not be recorded in the file. Instead, there can be local translation tables or servers for converting URNs for common concepts into their local names.

Best wishes

Jonathan

comment:42 in reply to: ↑ 41 ; follow-up: Changed 13 years ago by frato

Replying to jonathan:

Dear Jonathan:

... about the local names ... I feel that the common_concept local names will not help so much with readability. I think this because:

  • Although they will look familiar to the users of their own namespace, there will be many namespaces so most data will not contain your familiar names.

I do not know what it is like in other institutes, but here >98% of the data handeled in-house is of in-house origin. So from the scientists view they do contain familiar names.

  • As I understand it, these local names will be locally assigned ... might not be systematically constructed, and may not be intelligible at all.

I am afraid, facts already have overrun us. Legacy names (often not systematic nor self-explanatory) do not only exist at ECMWF and PCMDI. They exist at hundred places and a non-negligible part of their shepherds would like to map them to a more systematic system. This is exactly, why we put this proposal.

  • ..I think the local names could even be misleading.., since it could happen that different centres might give the same.. local names to different common_concepts.

Definitely yes. But let's not write could happen - it already happened many times and as data centres we have to make arrangements for this. So mapping to a systematic standard might help.

Secondly, you argue that local names would be helpful to avoid having to connect to a central server to translate the URN. But that would only help if the file contained your own local names, which it probably will not (if it comes from another institution).

As above: it probably will: ... >98% of the data handeled in-house is of in-house origin.

Also the translation doesn't have to be done by a central server.

I agree that this is not necessary but it helps a lot. In case you have registered names on a central server, you not only can ask this machine what are the attributes of.... You furthermore can ask: Is there an entity {XY}nameXY that maps to the same attributes as {mySemanticDomain}myLocalName. To offer this possibility means to cope with the different semantic domains we do face.

Best wishes... frank

comment:43 in reply to: ↑ 42 ; follow-up: Changed 13 years ago by jonathan

Dear Frank

I do not know what it is like in other institutes, but here >98% of the data handeled in-house is of in-house origin. So from the scientists view they do contain familiar names.

That is true for us too. But our local data contains our local identifiers (stash codes) in non-CF attributes (often not in netCDF at all). We already have software which can handle local data. Existing local software could therefore not access common concepts without modification. If you are going to have to modify the software anyway, why not use the URN instead? Then your software will be able to handle other institutions' data as well, so you can analyse any data as it it were locally generated.

Legacy names (often not systematic nor self-explanatory) do not only exist at ECMWF and PCMDI. They exist at hundred places and a non-negligible part of their shepherds would like to map them to a more systematic system.

I agree that this is the situation, but I think it will be more manageable and less of a burden on the maintainance of CF for each institute to look after the mapping of its own names to the CF standard. Local names are in general not of interest to any other institution (GRIB and PCMDI names are an exception). Hence supporting them in the CF standard doesn't really help interoperability and exchange of data, which is the main goal of CF.

Best wishes

Jonathan

comment:44 in reply to: ↑ 43 Changed 13 years ago by frato

Replying to jonathan:

Dear Jonathan:

...our local data contains our local identifiers.. in non-CF attributes..

Here the situation is not different. But how will we ever discuss on shared software if all local information is in non-CF local attributes?

We already have software which can handle local data. Existing local software could therefore not access common concepts without modification.

Right. This is why nobody is forced to include CC into his/her file headers.

If you are going to have to modify the software anyway, why not use the URN instead?

Because communication is not just between machines. Think about someone who got some external data that look strange. When writing a mail to the data producer, he probably would like to know what this guy calls the data they are talking about. He probably would not like to refer to the URN instead. And perhaps he just at this moment has no access to an URN resolver - just to some NetCDF tools.

Legacy names..do not only exist at ECMWF and PCMDI. ..their shepherds would like to map them to a more systematic system.

I agree that this is the situation, but I think it will be more manageable and less of a burden on the maintainance of CF for each institute to look after the mapping of its own names to the CF standard.

I fully agree - the responsibility for the mapping's management (layout) should stay in the sphere of the more specialised semantic domain, i.e. not with CF. However, once the mapping of an item is fixed, it should be registered by CF to make sure that

  • it is open accessible at a central server and
  • the assignment will not undergo further changes.

This will not be much of a burden.

Local names are in general not of interest to any other institution...

I do not understand - please, explain: When I exchange data to s.b. who calls one or more parameters different: At least one of us should be interested in how the other one calls what we handle. Otherwise we will not know, what we are going to exchange, isn't it?? Doesn't getting data from s.b. include to be interested in his/her metadata because these are going to be my metadata?

Best wishes ... Frank

comment:45 follow-up: Changed 13 years ago by jonathan

Dear Frank

In these cases:

When writing a mail to the data producer, he probably would like to know what this guy calls the data they are talking about. He probably would not like to refer to the URN instead.

When I exchange data to somebody who calls one or more parameters different: At least one of us should be interested in how the other one calls what we handle. Otherwise we will not know, what we are going to exchange

I do actually think the URN would be fine. I would use the URN to be precise about which common concept I mean. Alternatively, if I want to explain it in human-readable terms, I can give its standard name, cell_methods and so on, which are also in the netCDF file. I expect that in setting up a new project to exchange data, we would give lists of quantities by standard name etc. and by URN. Both methods refer to CF metadata that is the same everywhere. Thus, we avoid having to deal with one another's local names by adopting a common vocabulary.

Best wishes

Jonathan

comment:46 in reply to: ↑ 45 Changed 13 years ago by frato

Replying to jonathan: Dear Jonathan:

When writing a mail to the data producer, he probably would like to know what this guy calls the data they are talking about. He probably would not like to refer to the URN instead.

I do actually think the URN would be fine.

I don't understand. Do you mean, future scientists will say "urn:cf-cc:blah123" instead of saying, e.g., "precipitation"? Should data centres try to teach them to do so?

When I exchange data to somebody who calls one or more parameters different: At least one of us should be interested in how the other one calls what we handle. Otherwise we will not know, what we are going to exchange

I do actually think the URN would be fine. I would use the URN to be precise about which common concept I mean.

I would agree if the data was handled by mathematicians or between dada centres only. Our problem is, that most of our clients (=scientists) live in much less organised & well defined semantic worlds.

Alternatively, if I want to explain it in human-readable terms, I can give its standard name, cell_methods and so on, which are also in the NetCDF file.

I am afraid, most users will not agree that standard names like tendency_of_atmosphere_mass_content_of_particulate_organic_matter_dry_aerosol_due_to_net_production_and_emission and in addition a set of (attribute:value) pairs is what they want to use, when in their local semantic domain they simply call it foobar. They will use foobar as the axis' label, in their mails and in their discussions. And as a data centre we will not be able to communicate with our local scientists in case we do not handle foobar.

I expect that in setting up a new project to exchange data, we would give lists of quantities by standard name etc. and by URN.

This is fine between two (or more) data centres. However, on both sides you have (earth system) scientists that live in their semantic worlds. They won't learn to use CF instead but we should link them together, as well.

Both methods refer to CF metadata that is the same everywhere. Thus, we avoid having to deal with one another's local names by adopting a common vocabulary.

Again - this makes CF a tool for data interchange between data centres, not between scientists. They won't let us teach them how they have to describe their data.

Best wishes... Frank

comment:47 follow-up: Changed 13 years ago by jonathan

Dear Frank

I fear we are not going to agree about this.

Of course I don't mean that scientists will talk to each other in terms of URNs. I agree that when talking about science, people will always use their own ordinary words, in various languages, and they will clarify what they mean in conversation. They will label their plots in whatever way they find scientifically helpful.

When people do find themselves in a possible confusion because they are not sure what one another's words mean, they use more words to clarify what they mean. Obviously people do not normally talk about the "tendency of atmosphere mass content of particulate organic matter dry aerosol due to net production and emission". However they might use that kind of phrase to clarify their meaning. As discussed in another thread, I don't think standard names should necessarily be regarded as "names" in the usual sense. Some of them are just names, but many of them are more self-explanatory than terms used in conversation. That is because in conversation you can ask questions and request clarification from the person you are talking to, whereas you cannot ask a file what it means to say.

I thought you were asking how people in different centres would make sure their files contained the same quantities. That is a more precise situation. In that situation, in order to be perfectly clear, I think scientists (involved in setting up intercomparison projects, for instance), would use standard names, cell_methods etc and the URNs of common concepts. They already do use existing CF metadata for that purpose.

When you are analysing data yourself, with software you are used to, it is convenient to use your software's familiar terms. By translating URNs, your software would be able to analyse anyone's data in those terms. We agree that mappings of local terms to CF terms are useful to enable this.

I don't agree, however, that it helps to store anyone's local terms in the files. I do not think this will help scientists to understand one another. It would not help you if the file said it had a quantity with Met Office UM stashcode 3236 or PCMDI name tas, if you didn't know that those mean surface air temperature, for which you have another term in your database I expect. When we talk about the data, we will not use terms like those. Moreover, if (as in your proposal) one centre uses a local name of "precip" to mean precipitation rate, and another uses "precip" to mean precipitation amount - quantities which have different units and different standard names - inspection of the file, and talking about its contents in those terms, will actually cause confusion rather than assisting mutual understanding, I think.

Best wishes

Jonathan

comment:48 in reply to: ↑ 47 Changed 13 years ago by frato

Replying to jonathan:

Dear Jonathan

I think we all agree with your first four paragraphs. This is why CF is regarded as a good standard for data interchange.

I don't agree, however, that it helps to store anyone's local terms in the files.

Jonathan, whats wrong with a view more bytes in the file header? It helps to avoid the need for many connections to a central server, mapping URNs to names and vice versa.

However, if the problem is confusion by putting local names and URN in one common CF variable, we perhaps are better off with two variables, e.g.:

x:common_concept_urn   = "urn:cf-cc:blah123";
x:common_concept_local = "{badc.nerc.ac.uk}near_sfc_air_temperature";

In this case the URN is nice and pure and anybody might add local semantic flavours in case it helps - perhaps more than one.

Best wishes... Frank

comment:49 follow-up: Changed 13 years ago by jonathan

Dear Frank

Yes, personally I would find the proposal more attractive if the local names and URN were in different attributes. Thank you for that suggestion.

There is nothing wrong with a few more bytes. I am sure that many centres already identify their quantities in their own ways, with non-CF attributes. We do that and so do you. Such attributes are used by existing analysis software and that's fine. My reservation is about including them in the CF standard (by providing a CF attribute for them), as I'm still not convinced that it will help data exchange (sorry to be obstinate). You and I would not talk about quantities using CERA names or UM stashcodes. Those attributes are only of local interest. Therefore they can remain in the non-CF attributes where they currently are, can't they?

Thanks for the discussion. I think we are at least clear about what we disagree on!

Best wishes

Jonathan

comment:50 in reply to: ↑ 49 Changed 13 years ago by frato

Replying to jonathan

Dear Jonathan:

There is nothing wrong with a few more bytes. I am sure that many centres already identify their quantities in their own ways, with non-CF attributes.

Right, but don't you think that this is somewhat suboptimal?

  • Software cannot be interchanged in case it relies on these attributes,
  • in case of misunderstandings, there will be no CF that gives some guidance, no way to cross reference at least the most important vocabulary.
  • Finally, the chance to cooperate in this way with CF will strongly encourage institutes to care about the definitions of their own parameters. There might be more than just a few local vocabularies that would benefit.

We do that and so do you. Such attributes are used by existing analysis software and that's fine. (sorry to be obstinate).

No problem - so am I. I still think that writing these attributes/mappings in a defined way is better than to do it at every location in a different manner.

You and I would not talk about quantities using CERA names or UM stashcodes. Those attributes are only of local interest.

Let me tell you about a project where since a couple of years we are the main data centre for model data. In this project about 20 scientific/research partners had to agree on a common set of some tens of variables. As most of them always had handled in-house data only, there was a semantic mess about some of the physical parameters. And - believe it or not - the discussion was done using your attributes of local interest.

NB: These problems usually are not as big for data centres, as they are aware of them and/or use standards like CF. However, in the wild (of research and science) things are different.

Therefore they can remain in the non-CF attributes where they currently are, can't they?

I am sure, that would be the wrong decision.

Best wishes... Frank

comment:51 Changed 13 years ago by Heinke

Dear Jonathan & all,

Frank's idea to put the local names and the URNs in two variables looks much better than the original proposal for me. This is not only a compromise.

When you wrote...

"Yes, personally I would find the proposal more attractive if the local names and URN were in different attributes. Thank you for that suggestion."

...can we interpret this as 'yes' ?

I think at this point we can start to discuss the technical problems (ticket 29). Do you and all agree ?

Best wishes Heinke

comment:52 Changed 13 years ago by jonathan

Dear Heinke

Frank's idea to put the local names and the URNs in two variables looks much better than the original proposal for me.

I agree.

I like the URN being standardised, but I don't like local names being given a place in the CF standard, as it don't think it helps with interoperability. I read Frank's arguments and of course they are reasonable, but I am not convinced. Hence, I would be interested to see what more people think. I'd agree to the majority view if it is in favour of creating an attribute for local names.

Just to repeat, my main point is that analysis software can support familiar local names by translating URNs. It does not need to have the local names to be recorded in the file, and if it can translate URNs, it can then be applied to data from other centres which probably would not contain your local names anyway.

Another point has occurred to me, which is that a single string attribute for "local names" may not be appropriate for every centre's needs, if they do want to record their own local metadata. The Met Office model identifies quantities by two integers, for instance, not one string. Obviously one could put two integers into a string, but it's a more convenient solution in software to use two non-CF integer attributes for this purpose.

Best wishes

Jonathan

comment:53 Changed 13 years ago by Heinke

Dear Jonathan,

I like the URN being standardised, but I don't like local names being given a place in the CF standard, as it don't think it helps with interoperability. I read Frank's arguments and of course they are reasonable, but I am not convinced. Hence, I would be interested to see what more people think. I'd agree to the majority view if it is in favour of creating an attribute for local names.

Yes, I hope that more people will give us their opion.

Another point has occurred to me, which is that a single string attribute for "local names" may not be appropriate for every centre's needs, if they do want to record their own local metadata. The Met Office model identifies quantities by two integers, for instance, not one string. Obviously one could put two integers into a string, but it's a more convenient solution in software to use two non-CF integer attributes for this purpose.

This is a question of your ontology - and ontologies consist of terms, not of term sets. So this should not be a part of common concept. To bring your two integers together on a CF-standard level should be the duty of your namespace and should not be managed centrally. To have a link of every namespace to the ontologies would be very nice. This information could well be stored centrally.

Best regards Heinke

comment:54 Changed 13 years ago by apamment

As moderator of this ticket I would first like to thank everyone for the many excellent and thoughtful contributions to the discussion. Clearly this proposal has excited a lot of interest. Thanks in particular to to Bryan, John Graybeal and Steve for summarizing at intervals and providing suggestions on how to proceed with the discussion.

I would like first to express my own opinion before moving on to my summary of the ticket. From my point of view, an important strength of this proposal is that it could help to speed up the agreement of standard names. I'm not really convinced that it will reduce the number of standard name proposals because a common_concept will still require a standard_name to point to when describing a physical quantity. However, if a scientific community is able to make use of its own familiar terms via the common_concept I think that will tend to reduce the tension between the community's wish/requirement to have their particular terms accepted as standard and the need for standard_names to be constructed according to CF's accepted rules and guidelines. That is my reason for supporting this proposal.

I will turn now to the role of moderator and try to give as unbiased a summary as possible of the discussion so far. I have tried to concentrate on the main areas of agreement and disagreement while attempting not to restate too much of the detailed technical arguments.

  1. There is agreement that common_concepts may be used to:-

a) draw together a combination of CF attributes, to include standard name and other attributes such as cell_methods, valid values and/or valid ranges of coordinate variables and names of grid mappings. (This list is indicative and not necessarily exhaustive);
b) provide synonyms for the standard name attribute alone.

  1. All but one contributor seem to be agreed that a CF "concept registry" should be established within which user communities can register their own common_concepts. Steve Hankin has raised the concern that registering URIs could lead to namespace collisions.
  1. There is agreement on the following potential benefit:-

The introduction of common_concepts will allow mappings to be created between CF and an unlimited number of other vocabularies, whether they be local to a single institution or widely adopted in other metadata standards.
This will address the following use cases:-
a) the requirement for data centres to serve files that contain a user's own familiar vocabulary;
b) the requirement for short names as an alternative to standard_names + additional metadata;
(N.B. this should not be taken to imply that all are agreed on _how_ the mapping should be achieved in practice - see later)

  1. Other potential benefits and topics that have emerged during the discussion:-

a) The possibility of registering a concept even while some of its constituent parts are under discussion but have not been fully agreed within CF, for example, while the standard name is still under discussion. This would require some form of registration system that would cater for newer versions of common_concepts. This would require a proposal to be put forward under another trac ticket.
b) Common_concept may provide a mechanism for CF metadata to reference terms from other vocabularies/metadata standards. In particular, this would require the development of a syntax or "parsing scheme" for referencing metadata outside the CF standard. Benno has opened ticket #27 for the discussion of this topic. The form of the URIs used for external referencing could also be adopted for referencing common_concept bundles of CF attributes.
c) Bryan has raised the question of how the local names that map onto the common_concept URIs should be encoded. This is an important subtopic for the discussion of the current ticket and should be opened under another trac ticket.
d) John Caron, in discussing the content of the common_concept attribute raised the question of mapping not just identities, but other types of relationships between common concepts and local names. I think this point is very much akin to Steve Hankin's identification of the need to "add semantic richness" to standard names, for example, describe the relationship between sea_surface_temperature and sea_surface_skin_temperature.

In order to keep this discussion to manageable proportions these topics should all be pursued under separate trac tickets or mailing list discussion threads as appropriate.

  1. This ticket should continue to be used for the discussion of the original proposal which concentrates on defining common_concepts based on metadata attributes that already form part of the CF conventions. Even this narrower view gives rise to a number of questions that need to be addressed before a second draft of the proposal can be prepared.

a) What language should be used to describe the definition?

It is clear that, if a common_concept is to be defined, there must necessarily be a way of describing that concept. The original proposal attempts to use CDL as a language for describing the constraints on a common concept, i.e., what values its constituent attributes may be permitted to take. It has been demonstrated during the discussion of this ticket that CDL does not currently lend itself to this use. Some of the examples in the original proposal are ambiguous in their interpretation. Ticket #29 has been opened for the discussion of the technical aspects of using CDL to describe the common_concept. Other possible methods of describing a common_concept have been raised during the discussion of this ticket. These include OWL/XML, RDF and OCL. The resolution of this issue is clearly essential if a second draft of the common_concept proposal is to be developed and I would encourage all interested parties to contribute to the discussion under ticket #29.

b) What will be the procedure for registering a common concept definition?

There is some question as to whether the registration of a common_concept should be an entirely automated process or whether it would/should require some manual intervention. I would note that, although the actual procedure for registering a common_concept may not need to be included in the CF conventions (in the same way that the procedure for proposing a new standard name is not spelt out within the document itself), we nevertheless need to agree the procedure and put any necessary software/services in place before the common_concept can be put to practical use. I think that a second draft of the proposal would need to contain some further clarification of the registration process.

c) What CF metadata attributes are allowed to be included in the common_concept definition?

This hasn't really come out in the discussion, but I think that for the purpose of clearly defining the common_concept within the CF conventions it may be necessary to list those attributes that can be used as part of a combination. Does it make sense to allow the use of any attribute or are there some that should be excluded from use in common_concepts?

6) There needs to be a means of attaching the information represented by a common_concept to a CF variable. The original proposal is that this will be achieved by introducing a single additional attribute, called common_concept, whose value will take the form namespace:scoped_name;URI. The value that this attribute should take, and indeed whether a single attribute can suffice, has proved to be the point of greatest debate within this discussion.

To recap, the intended purpose of each of the proposed components of the common_concept attribute is as follows:
scoped_name - this is the name that is used within a scientific community to refer to a data variable such as 2 metre temperature;
namespace - in essence, identifies the community or institution that registered the scoped name;
URI - a machine readable identifier for the registered definition of the common_concept. All namespace:scoped_name identifiers that reference the same common_concept would be associated with the same unique and unchanging URI.

The proposed design incorporates two distinct elements:
i) The local name that is familiar to the scientist using the data
ii) The means of mapping that local name to the registered common_concept and, by implication, to other synonymous local names.

a) The use of URIs.
There seems to be agreement that any URIs should be opaque, i.e., contain no semantic information additional to the CF attributes of the common_concept to which the URI points. There is no clear preference for the use of URNs or URLs - John Graybeal has suggested that both might be included. Ticket #27 includes much discussion of the form that URIs should take and the outcome of that ticket should be used to inform the second draft of the common_concept proposal.

b) The local name vs the URI.
The original proposal was to include both these elements in the common_concept attribute. However, a number of contributors have questioned this point:

John Caron suggested including the local name only - he was concerned that a fixed URI would unduly limit possible mappings between vocabularies;

Steve Hankin also suggests using the local name only and does not support the registration of URIs within CF;

Jonathan has suggested including the common_concept URI only - he has argued that external software should translate the URI to a local name;

On behalf of the proposers Frank and Bryan have continued to argue that both elements should be included so that the URI can point to the bundle of attributes forming the concept while the inclusion of local names is convenient for scientists accessing data from within a particular institution.

Frank has further suggested that the two elements could be split between two new attributes - common_concept_urn and common_concept_local.

The decision as to whether to include one or both elements depends very much on how the mapping process between URI and local name (and by implication between one local name and another) is to be achieved (see point (c) below).

c) Should the mapping process take place within CF supported processes or by an external mechanism?

According to the original proposal the mapping would be achieved by registering both a common_concept metadata bundle and a namespace:scoped_name with CF. This, coupled with the proposed automated registration procedure, would require CF to be responsible for maintaining a machinable list of both these elements. The case for CF providing this service from a central server has been further argued by Frank and Bryan.

John Graybeal, while supporting the establishment of a CF common_concept registry, asked whether it would be appropriate for the mapping mechanism from common_concept to local name to be entrenched within the CF process. He pointed out that the method of mapping from a CF registered scoped name to another is a solution that is very local to the CF community and asked whether the mapping from one vocabulary to another should be done in cooperation with other organisations.

John Caron asks whether mapping via an immutable URI is the best approach because it allows only for the identity mapping between common_concept/standard_name and another vocabulary. The current proposal does not address the construction of more complex relationships such as finding broader or narrower terms than a particular scoped name. However, the important point for this discussion is that John also proposes splitting the job of naming the common_concept from that of mapping and that the attribute should not attempt to encapsulate the mapping mechanism.

Jonathan suggests that the mapping between the URI and the local name should be performed by servers within each institution. He prefers the suggestion of giving the URI and the local name in separate attributes, but does not support registering the local names in the CF standard.

As mentioned in my point 2, Steve Hankin has expressed the view that CF should not act as a registry even for the URIs because of the potential for namespace collision. Individual institutions/ data centres would then be responsible for mapping their own names to the CF attributes.

  1. Conclusion

The most important point to draw out is that we have a unanimous consensus that the common_concept, as a means of bundling together a number of attributes or as a synonym for standard names, will be a useful addition to the CF conventions. We must therefore work to resolve the outstanding issues that have been raised during the discussion.

Progress now rests on reaching a decision on whether CF should act as a registry for the common_concept attribute bundles (and presumably associate them with a URI), local scoped names, or both. I would say that we are very close to achieving consensus that CF should register the common_concept bundles but we are rather further away from consensus on whether to register the local names. Making a decision on this point will also clarify what the content of the common_concept attribute should be.

In any case, I think it will not be possible to finalise all the details of a second draft proposal until the outcomes of ticket #29 (CDL as a constraint language) and #27 (on namespace tags) are decided. However, I hope that this summary will provide a starting point for developing a second draft. The second draft should make clear:
a) its scope (i.e., bundling together attributes that already form part of the agreed CF conventions);
b) a common_concept can consist of a standard_name only;
c) the registration process for the common concept.

Best wishes

Alison

comment:55 Changed 12 years ago by graybeal

Allison, et al,

I apologize, I have been buried in other activities and have now reread this thread. Before I analyze it all in detail...

Can you confirm that, so far, we do not have any new proposals on the table from this thread that restate the basic elements? (I think #27 is a relevant piece but not directly responsive; #29 forms but a piece, and neither is moving toward resolution recently.)

Also, is the gist of your summary that any further discussion on any of these topics should be introduced on a separate thread? There are a few things that I want to clarify (and maybe move forward), but looking at the frequency and specifics of the last few postings, maybe people don't want further input in this context. Advice?

comment:56 Changed 10 years ago by jonathan

Reposting Steve Hankin's comment from ticket 65:

I wonder whether we have adequately thought through the interplay between the cell methods and the standard_names. The discussion that follows is a general concern that applies to other values of cell_methods, as well as what is proposed here. (Perhaps it deserves to be in a separate ticket.)

The proposal contained in this trac ticket suggests that, for example diurnal variation in sea surface temperature might best be represented as

float sstrng(dimensions)
   sstrng: standard_name = "sea_surface_skin_temperature";
   sstrng: cell_methods = "time: range";

The cell_methods attribute has altered the fundamental concept of this parameter, rendering the standard_name incorrect as a stand-alone description. Most likely, for example, data discovery systems will present this variable under the concept "sea_surface_skin_temperature" and lead users to data discovery blind alleys. Even if the authors of data discovery systems wanted to improve the search fidelity, CF is not providing them with the tools they need -- a concept name for the parameter that this file actually contains. If they wanted to synthesize a name they would need to consult the standard_name, the cell_methods AND (in order to capture the diurnal concept) look at the time axis coordinates, as well.

Tools like the emerging ncISO from NGDC (which will shortly be embedded into both TDS and HYRAX servers) may provide the way out of this fix. These tools can have in-build CF-aware smarts to enable them to synthesize more fully descriptive search terms when generating metadata records. The CF standard could define the algorithm for doing so. In this example, perhaps the algorithm would generate "time-interval-ranged_sea_surface_skin_temperature". Or if it had been a range over the spatial dimensions, perhaps, "lat-long-interval-ranged_sea_surface_skin_temperature"

More discussion seems warranted.

comment:57 Changed 10 years ago by stevehankin

In my opinion this ticket has progressed to a stage from which it can never (and should never) recover. That the discussion is so exceptionally long and complex speaks for itself: the ticket is attempting to bite off more than CF can chew. I propose that we close this ticket. I also argue that much of the problem space this ticket hoped to address (think "80:20") has already been solved.

During the 3 years that have intervened since this ticket was originally proposed, a game-changer has occurred: the CF file-internal metadata, including coordinate ranges and cell methods, is now being promoted into THREDDS-level visibility by ncISO (and its cousins). Thus a geo-spatially-aware data discovery engine can in principle find CF variables based upon their vertical positions. (There is need for improvement in the discovery engines, so that they understand Z coordinates better. I hope we agree that is someone else's problem -- not CF's.)

The genesis of the data discovery problem outlined in this ticket is found in the canonical use case of "2 meter air temperature" from Section 4 (top of this ticket). The proposed encoding does not include the vertical coordinate position; the file is not self-describing! (It could trivially be made so, using a scalar Z coordinate.) Similarly for most of the use cases (again, think "80:20") in this ticket. Here it is, again, in "discoverable" (self-documenting) form:

float temperature(time,lat,lon) ;
    temperature: standard_name = "air_temperature" ;
    temperature: long_name = "annual mean 2m temperature" ;
    temperature: coordinates = "time, hgt, lat, lon";
    temperature: keywords = "badc_2m_air_temperature, foobar1, foobar2";
    temperature: units = "blah";
float hgt;
    hgt:units="m";
    hgt:axis="z";

data:
  hgt = 2. ;

("keywords" was proposed at GO-ESSP, CF Day. If attractive, perhaps it should become a new ticket.)

comment:58 Changed 10 years ago by lowry

Dear All,

Timing of Steve's message on this ticket is a little unfortunate. Two things have happened of relevance to this ticket.

1) Dom Lowe at BADC has developed an XML schema based on SWE standards to describe a CF variable. 2) Additional requirements have developed for this work beyond the '2m temperature' use case. There is an increasing need for semantic interoperability between CF and the rapidly expanding SeaDataNet? data holdings.

The ticket stalled for two reasons:

1) Lack of a formalised way to fully describe a CF variable. 2) Good intentions but a lack of realism of what I could achieve in 'spare time' on my part

Dom has tackled (1) and at long last I have tackled (2) through getting CF/SeaDataNet mapping included as part of the work plan for an EU FP7 (projects) for the latter part of this year. Apologies for not being more communicative in the CF arena on these developments.

I like the "keywords" idea, particularly if the keywords could be specified as URIs, which would sweep aside one of the blockers to the wide scale adoption of CF in SeaDataNet?. Therefore, I would strongly support taking it forward as a new ticket.

However, it doesn't solve all of my problems as it doesn't provide sufficient information for the WPS chaining semantic support needed in NETMAR. In your example, how do I find out what 'foobar1' means? There's also the issue of legacy data: motivating large-scale incorporation of keywords into existing data might be easier said than done.

Therefore, could I beg a little more patience and ask that this ticket be left open until I can post what I think NETMAR needs (probably late September, early October) and then develop it in collaboration with the CF community.

Note: See TracTickets for help on using tickets.