[CF-metadata] standard naming from Jonathan Gregory on 2007-02-16 (Archive of CF discussions from 2002 to 2019 on the cf-metadata mailing list)

From: Jonathan Gregory <j.m.gregory>
Date: Fri, 16 Feb 2007 08:41:47 +0000

Dear Steve

(I've changed the subject to make a different thread.)

The issue of which aspects of description to put in the standard name, versus
other attributes, is as old as the CF standard, as you know. Unlike GRIB and
PCMDI names, for example, we regard air_temperature as a standard name, but
its level (1.5 m above the ground) and time processing (daily minimum) are
contained in other metadata items (coordinates and cell_methods respectively).
That's an arbitrary choice, like everything in the convention.

I feel that it is useful to "factorise" out those categories which are (a)
relevant in many or most cases and/or (b) which have a large or infinite
number of values. If (a) applies, then the extra complexity of the metadata
design (more attributes etc. being needed) is outweighed by the convenience of
not having to parse the standard_name. If (b) applies, the factorisation
greatly reduces the number of standard_names we have to define. If we put
every possible distinction which is currently represented by standard names
into its own category, requiring a separate attribute or variable for it, that
would be in effect setting up a colossal matrix of metadata categories which
would be extremely sparsely populated. I don't think that would so convenient
for users.

For example, "air" and "temperature" could be separate attributes, but I think
it would be inconvenient to have to search a dataset for the combination of
those two attributes rather than the standard name of air_temperature. On the
other hand, I think it is sensible to have "daily maximum", "daily minimum"
and "daily mean" described by cell_methods rather than in the standard name
as this distinction could apply to many quantities.

> some other attribute of the variable (cell_methods, long_name, or maybe the
> name of the variable, itself) is used to sort out the distinctions.

NB long_name is not standardised, and we say in the standard that the name of
the variable itself has no significance. But the general point is that you
do need sufficient categories to be able to distinguish all the variables in
your dataset.

> we also have families of standard names
> based upon "...transport_due_to ...", "...thickness_defined_by_...",
> and "...tendency_of_..._due_to_...". This latter approach might
> suggest adding a family of names "realization_weight_based_upon_...".

I would agree with that kind of suggestion in this case, because I suspect that
this category (method of derivation of realization weights) is one which is
particular to a small set of variables (realization weights) and hence does
not need a separate attribute. But if Jamie's use-cases suggest, on the other
hand, that it has a very large number of possible values, it would be a case
of (b) above and would need a separate attribute to prevent unwieldy expansion
of the standard name table. The families you mention result from our adopting
fairly systematic guidelines for constructing standard names. The reason for
that is to make them easier to understand, as they have a consistent style and
terminology as far as possible.

> P.S. We also have "product_of_..." variables, which points to the need
> for another level of naming that is above, rather than below the
> fundamental quantities.

I don't think so, because there are not very many of these. Moreover many
quantities *could* be described as products. For example, specific humidity
could be named as the product of relative humidity and saturation specific
humidity, or conversely relative humidity could be named as the ratio of
specific humidity to saturation specific humidity. We don't do that because
they aren't the usual names. In some cases, however, describing something as
a product has seemed to be the clearest choice. It's just an arbitary choice.
The product rule is one of the guidelines for construction.

> But future software that tries to work with standard names is going to
> find itself parsing our current names to discover the underlying
> families and sifting through other attributes to understand the deeper
> semantics. That seems to defeat our intent that the semantics of CF
> variables are captured in the standard names for easy usage by
> clients. In my opinion, these overloading techniques taken
> collectively seem to be crying out for a more general, multi-level
> naming solution.

I sympathise with the sentiment but for practical reasons I am not convinced,
personally. I think this is an instance of trying to foresee needs that we
have not identified in practice, and setting up an abstract structure that
may not really be required. I think use-cases and the balance of convenience
among different situations should determine what factorisation we do. There
are many ways in which standard names could be grouped, for example, but only
one such hierarchy could be implemented in the names themselves. A more
flexible approach, I think, is to use additional databases which categorise
the names in different ways for different purposes. There is no reason why we
should not record commonly needed categories in the standard name table itself,
but not in the standard names. For instance, we could mark those standard names
which are components of vectors, or record this information in some other
table, if there is a common need for that information.

Best wishes

Jonathan
Received on Fri Feb 16 2007 - 01:41:47 GMT

This archive was generated by hypermail 2.3.0 : Tue Sep 13 2022 - 23:02:40 BST