[CF-metadata] a different (but perhaps unoriginal) approach to standard name construction from Karl Taylor on 2008-10-28 (Archive of CF discussions from 2002 to 2019 on the cf-metadata mailing list)

From: Karl Taylor <taylor13>
Date: Mon, 27 Oct 2008 18:05:21 -0700

Dear all,

It seems to me that the issue of possibly wanting to store several
different chemical species in a single array (with a coordinate variable
identifying the species) is only one limitation of the current
constraints placed by standard names. We've also run up against the
following difficulties:

1) Currently it is impossible to identify with a single standard name,
closely related variables that one might want to store in a single
array). For example, such quantities as:

     * temperatures measured with several different instruments.

     * precipitation separated into categories of snow and ice

     * concentrations of a molecule (e.g., CO2) separated into
components defined by its source (fossil fuel combustion, volcanic
emissions, respiration, decay, etc.) [some have talked about
distinguishing these contributions by am artificial "color" label]

     * the contributions of various "processes" to a particular
quantity (e.g., temperature tendency due to advection, deep_convection,
short_wave_radiation, etc.)

     * a variable, as simulated by several different models

     * and the like

2) thresholds and similar things

3) various combinations of variables or operations/transformations, such as:

     * anomalies and more generally differences

     * products (e.g., transports, correlations, etc.)

     * and the like

I think it is perhaps time to consider devising an alternative way of
providing the information that is currently in the standard name
(instead of forcing all the information into a single attribute). As
you will see, this will eliminate the above limitations, but perhaps
more importantly it provides a way of more quickly converging on new
standard names.

The idea, which I'm sure must have been discussed at length already (but
I've forgotten by now or I've missed it entirely), is to parse the
quantity identification information into separate elements (or
"categories" or "components"). We already do this to a certain extent
by providing some information in the cell_methods attribute. I would
build on the bits of independent information already listed in the
Guidelines for Construction of CF Standard Names
(http://cf-pcmdi.llnl.gov/documents/cf-standard-names/guidelines).
We might, for example, parse air_temperature and sea_water_temperature
into two independent attributes. medium="sea_water" or "air" and
quantity="temperature"

These independent bits of information could be automatically assembled
together to create the "standard name". The current standard names
would in some cases be identical to the names created from the elements,
and in other cases we could establish aliases. This would make it
obvious in many cases how to construct new standard names, and in any
case would impose a structure on the standard names.

The main job of the standard name committee would be to agree on when a
new *component* should be added and agree on the list of acceptable
values for each component. This would force everyone to think about
whether a new variable can be distinguished from others simply by adding
a new value to one of the components, or if an entirely new category
(i.e., component) is needed.

As a first step, it might be useful to consider the following components
(many of which appear in the "guidelines" document referred to above):

1. quantity: the fundamental quantity (e.g., temperature, pressure,
geopotential_height, precipitation_rate, concentration)

2. medium: where the quantity is "measured" (e.g., sea, atmosphere (or
air?), sea_ice, troposphere, lake, stream land_ice, cloud,
ocean_surface_mixed_layer)

3. constituent: e.g., hydrometeor, ice, snow, rain, CO2, SO4, ozone,
aerosol, sulfate_aerosol, soot.

4. specie_color: for when we want to distinguish constituents by what
produced them (e.g. the sulfate aerosol in the atmosphere that comes
from different sources: anthropogenic, natural, fossil_fuel, etc.)

5. surface: a quasi-horizontal surface that cannot easily be described
by a vertical coordinate (e.g., sea_floor, top_of_atmosphere,
tropopause, adiabatic_condensation_level, surface)

6. process: identifying what process is responsible for the quantity
(e.g., for temperature tendencies: radiation, convection,
latent_heating, etc.) [I wonder if specie_color might be combined with
"process" into a single category?]

7. vector_component: indicating the component of a vector and its
positive direction (e.g., eastward, northward, upward)

8. radiative_flux_component: indicating whether only the downwelling
(incoming) or upwelling (outgoing) or net radiative flux is stored

9. tensor_component: ????

10. assumption: indicating that the quantity has been calculated under
some assumption (e.g., assuming_clear_sky, assuming_no_snow)

11. threshold: indicating that the quantity has been calculated only
when certain conditions are satisfied. The form of this attribute would
have to be worked out, but presumably would identify both the
condition(s) and the values (or variables containing the values) of the
thresholds could be specified.

The remaining 6 categories might not be considered part of the
"standard_name" information, but might better be defined as new variable
attributes:

12. formula (or transformation?): indicating that in some sense the
quantity is a "compound" quantity derivable from more fundamental
quantities. surface_net_downward_radiative_flux would have a
formula="sw + lw", and the data writer would also store in the file a
dummy variable (i.e., it would be either a scalar or array with possibly
only one element, which would be set to missing_value), and the
attributes associated with these two variables would define the quantity
stored (e.g., in this example, "sw" would have a standard name of
surface_net_downward_shortwave_radiative_flux, and similarly for "lw")
As another example, a temporal correlation of quantity "a" and quantity
"b" could be indicated by formula="correlation(a,b)". As a third
example, an "anomaly" could be represented as the difference between two
variables, and the attributes associated with the variable representing
the "base" state could explicitly indicate how it was calculated (e.g.,
for a climatology, the climatological period). For the formula
attribute, we might consider adopting the syntax for the formula from
something like matlab, I guess. Note that the formula attribute makes
it possible to express many different quantities without agreeing
explicitly on their standard names (just the standard_names of their
formula terms). Note also, that It is possible that the threshold
information (#11 above) might be represented instead by an appropriate
formula.

13. measurement_method: indicating what type of sensor was used to
measure the quantity (e.g., for sea surface temperature observations,
bucket or ship_intake_temperature, and for models where there are
multiple methods of defining cloud radiative forcing, specifying which
of two well-know procedures known as "method 1" or "method 2" is used.

14. area_type: indicating that instead of applying to the whole grid
cell (which would be the default), the quantity applies only to a
certain portion, as in the current "where_type" construction (e.g.,
where_land would be indicated by "land", and where_sea_ice would be
indicated by "sea_ice")

15. region: specifying the geographic region from which the quantity is
extracted (e.g., asia, africa, australia)

16. experiment: containing the name of the experiment that produced the
output.

17. source: containing some indication of the source of the data,
whether it be from observations (e.g., ERBE) or from a model (e.g.,
CCSM3). A variable containing output from a multi-model ensemble
(regridded to a common grid) could be stored with "source" as a
dimension and the names of the models recorded as coordinate labels.

Any of these components of the standard_name might be omitted if either
unnecessary, *or* if they themselves appeared as standard_names attached
to one of the coordinate variables of the quantity. Thus, for example,
if the "process' were left unspecified for a variable containing
"tendency_or_air_temperature", but one of the coordinates of that
variable had the standard_name "process", then one would find stored in
this variable all the different processes identified by the coordinate
labels for that coordinate. This allows us to store many different
tendencies in a single variable, but allows us to identify each of them
through the "process" dimension of that variable.

Turning now to the procedure for constructing new standard names:

When constructing a new name, one would fill in the appropriate
information for each of the components listed above (omitting those that
are not needed). If information seeming to lie outside the categories
already listed in the table were necessary to fully define the quantity,
then the requester would propose that a new category be adopted. Within
each category there would be a limited set of accepted designations
(i.e. values), and again if none of the current acceptable values was
appropriate, the requester would suggest a new one.

The standard_name discussion would focus on 1) whether a new category
was indeed needed and what the new category should be called, 2) whether
a new value under a given category was needed and what that value should
be, and 3) in many cases simply whether the user had correctly filled
out the table. [Alison could make a decision about 3) on her own in
most cases, I suspect.]

The second step would be to form a standard_name from the information in
the table, but this should be nearly automatic, following some simple
construction rules.

If someone outside the current focus of CF wanted to use CF to store
data (say someone from the biological community), they might begin by
augmenting the components of the standard name with additional ones
needed by their community. They would be required to adopt existing
"categories" when applicable to their discipline.

Sorry for the length of this and sorry if it duplicates material in
related discussions, but I think this standard_name business seems a bit
out of control and perhaps there are alternatives out there that might
make it more straight-forward to propose and adopt new names.

Anyway, I hope someone out there cares enough to comment on or improve
on or suggest alternatives to this proposal. Whatever we do, we should
be mindful that we must be able to determine when existing standard
names are equivalent to any future representation of the standard name
information.

Best regards,
Karl
Received on Mon Oct 27 2008 - 19:05:21 GMT

This archive was generated by hypermail 2.3.0 : Tue Sep 13 2022 - 23:02:40 BST