⇐ ⇒

[CF-metadata] a different (but perhaps unoriginal) approach to standard name construction

From: Robert Muetzelfeldt <r.muetzelfeldt>
Date: Tue, 28 Oct 2008 23:38:45 +0000

Hello,

Karl's email has prompted me to mention some work I've done over the
last week or so in representing variable names as compound terms. This
is highly experimental, and not at all definitive, but merely meant for
exploring possibilities.

As Karl also suggests, it is based on the representation of the
individual terms that go to make up a variable name, grouped into
separate "categories" or "comnponents". A grammar is used to specify
how the terms can be combined together. This approach is already
standard in other areas. For example, MathML marks up the individual
elements of a mathematical expression, and UnitsML is an XML-based
notation for scientific units. Granted, the actual grammar needed for
environmental variables is not a given as it is in these other areas,
but it still seems worthwhile trying to see how far we can push the
approach.

For this exercise, I've taken the CF-metadata Guidelines and expressed
them as grammar rules. You can read a brief note on this work at
http://envarml.pbwiki.com/Prototype+grammar+for+CF-metadata+%22standard+names%22.
(This is a wiki, with public read-only access.)

I should mention that this note is a rough draft of an internal working
document in a project I'm involved in. Plasmo (the Plant Systems
Biology Modelling project) aims to develop a web-based portal for plant
growth models, including the terrestrial biopshere models used in Earth
System modelling. Regardless of whatever anyone else does, we will be
definitely pursuing this approach, since we have a need to have a
standard naming scheme for the variables in the various models, and we
are committed to a compound naming approach.

Best wishes,
Robert

Karl Taylor wrote:
> Dear all,
>
> It seems to me that the issue of possibly wanting to store several
> different chemical species in a single array (with a coordinate
> variable identifying the species) is only one limitation of the
> current constraints placed by standard names. We've also run up
> against the following difficulties:
>
> 1) Currently it is impossible to identify with a single standard
> name, closely related variables that one might want to store in a
> single array). For example, such quantities as:
>
> * temperatures measured with several different instruments.
>
> * precipitation separated into categories of snow and ice
>
> * concentrations of a molecule (e.g., CO2) separated into
> components defined by its source (fossil fuel combustion, volcanic
> emissions, respiration, decay, etc.) [some have talked about
> distinguishing these contributions by am artificial "color" label]
>
> * the contributions of various "processes" to a particular
> quantity (e.g., temperature tendency due to advection,
> deep_convection, short_wave_radiation, etc.)
>
> * a variable, as simulated by several different models
>
> * and the like
>
> 2) thresholds and similar things
>
> 3) various combinations of variables or operations/transformations,
> such as:
>
> * anomalies and more generally differences
>
> * products (e.g., transports, correlations, etc.)
>
> * and the like
>
> I think it is perhaps time to consider devising an alternative way of
> providing the information that is currently in the standard name
> (instead of forcing all the information into a single attribute). As
> you will see, this will eliminate the above limitations, but perhaps
> more importantly it provides a way of more quickly converging on new
> standard names.
>
> The idea, which I'm sure must have been discussed at length already
> (but I've forgotten by now or I've missed it entirely), is to parse
> the quantity identification information into separate elements (or
> "categories" or "components"). We already do this to a certain extent
> by providing some information in the cell_methods attribute. I would
> build on the bits of independent information already listed in the
> Guidelines for Construction of CF Standard Names
> (http://cf-pcmdi.llnl.gov/documents/cf-standard-names/guidelines).
> We might, for example, parse air_temperature and sea_water_temperature
> into two independent attributes. medium="sea_water" or "air" and
> quantity="temperature"
>
> These independent bits of information could be automatically assembled
> together to create the "standard name". The current standard names
> would in some cases be identical to the names created from the
> elements, and in other cases we could establish aliases. This would
> make it obvious in many cases how to construct new standard names, and
> in any case would impose a structure on the standard names.
>
> The main job of the standard name committee would be to agree on when
> a new *component* should be added and agree on the list of acceptable
> values for each component. This would force everyone to think about
> whether a new variable can be distinguished from others simply by
> adding a new value to one of the components, or if an entirely new
> category (i.e., component) is needed.
>
> As a first step, it might be useful to consider the following
> components (many of which appear in the "guidelines" document referred
> to above):
>
> 1. quantity: the fundamental quantity (e.g., temperature, pressure,
> geopotential_height, precipitation_rate, concentration)
>
> 2. medium: where the quantity is "measured" (e.g., sea, atmosphere
> (or air?), sea_ice, troposphere, lake, stream land_ice, cloud,
> ocean_surface_mixed_layer)
>
> 3. constituent: e.g., hydrometeor, ice, snow, rain, CO2, SO4, ozone,
> aerosol, sulfate_aerosol, soot.
>
> 4. specie_color: for when we want to distinguish constituents by what
> produced them (e.g. the sulfate aerosol in the atmosphere that comes
> from different sources: anthropogenic, natural, fossil_fuel, etc.)
>
> 5. surface: a quasi-horizontal surface that cannot easily be described
> by a vertical coordinate (e.g., sea_floor, top_of_atmosphere,
> tropopause, adiabatic_condensation_level, surface)
>
> 6. process: identifying what process is responsible for the quantity
> (e.g., for temperature tendencies: radiation, convection,
> latent_heating, etc.) [I wonder if specie_color might be combined with
> "process" into a single category?]
>
> 7. vector_component: indicating the component of a vector and its
> positive direction (e.g., eastward, northward, upward)
>
> 8. radiative_flux_component: indicating whether only the downwelling
> (incoming) or upwelling (outgoing) or net radiative flux is stored
>
> 9. tensor_component: ????
>
> 10. assumption: indicating that the quantity has been calculated under
> some assumption (e.g., assuming_clear_sky, assuming_no_snow)
>
> 11. threshold: indicating that the quantity has been calculated only
> when certain conditions are satisfied. The form of this attribute
> would have to be worked out, but presumably would identify both the
> condition(s) and the values (or variables containing the values) of
> the thresholds could be specified.
>
> The remaining 6 categories might not be considered part of the
> "standard_name" information, but might better be defined as new
> variable attributes:
>
> 12. formula (or transformation?): indicating that in some sense the
> quantity is a "compound" quantity derivable from more fundamental
> quantities. surface_net_downward_radiative_flux would have a
> formula="sw + lw", and the data writer would also store in the file a
> dummy variable (i.e., it would be either a scalar or array with
> possibly only one element, which would be set to missing_value), and
> the attributes associated with these two variables would define the
> quantity stored (e.g., in this example, "sw" would have a standard
> name of surface_net_downward_shortwave_radiative_flux, and similarly
> for "lw") As another example, a temporal correlation of quantity "a"
> and quantity "b" could be indicated by formula="correlation(a,b)".
> As a third example, an "anomaly" could be represented as the
> difference between two variables, and the attributes associated with
> the variable representing the "base" state could explicitly indicate
> how it was calculated (e.g., for a climatology, the climatological
> period). For the formula attribute, we might consider adopting the
> syntax for the formula from something like matlab, I guess. Note that
> the formula attribute makes it possible to express many different
> quantities without agreeing explicitly on their standard names (just
> the standard_names of their formula terms). Note also, that It is
> possible that the threshold information (#11 above) might be
> represented instead by an appropriate formula.
>
> 13. measurement_method: indicating what type of sensor was used to
> measure the quantity (e.g., for sea surface temperature observations,
> bucket or ship_intake_temperature, and for models where there are
> multiple methods of defining cloud radiative forcing, specifying which
> of two well-know procedures known as "method 1" or "method 2" is used.
>
> 14. area_type: indicating that instead of applying to the whole grid
> cell (which would be the default), the quantity applies only to a
> certain portion, as in the current "where_type" construction (e.g.,
> where_land would be indicated by "land", and where_sea_ice would be
> indicated by "sea_ice")
>
> 15. region: specifying the geographic region from which the quantity
> is extracted (e.g., asia, africa, australia)
>
> 16. experiment: containing the name of the experiment that produced
> the output.
>
> 17. source: containing some indication of the source of the data,
> whether it be from observations (e.g., ERBE) or from a model (e.g.,
> CCSM3). A variable containing output from a multi-model ensemble
> (regridded to a common grid) could be stored with "source" as a
> dimension and the names of the models recorded as coordinate labels.
>
>
> Any of these components of the standard_name might be omitted if
> either unnecessary, *or* if they themselves appeared as standard_names
> attached to one of the coordinate variables of the quantity. Thus,
> for example, if the "process' were left unspecified for a variable
> containing "tendency_or_air_temperature", but one of the coordinates
> of that variable had the standard_name "process", then one would find
> stored in this variable all the different processes identified by the
> coordinate labels for that coordinate. This allows us to store many
> different tendencies in a single variable, but allows us to identify
> each of them through the "process" dimension of that variable.
>
> Turning now to the procedure for constructing new standard names:
>
> When constructing a new name, one would fill in the appropriate
> information for each of the components listed above (omitting those
> that are not needed). If information seeming to lie outside the
> categories already listed in the table were necessary to fully define
> the quantity, then the requester would propose that a new category be
> adopted. Within each category there would be a limited set of
> accepted designations (i.e. values), and again if none of the current
> acceptable values was appropriate, the requester would suggest a new one.
>
> The standard_name discussion would focus on 1) whether a new category
> was indeed needed and what the new category should be called, 2)
> whether a new value under a given category was needed and what that
> value should be, and 3) in many cases simply whether the user had
> correctly filled out the table. [Alison could make a decision about
> 3) on her own in most cases, I suspect.]
>
> The second step would be to form a standard_name from the information
> in the table, but this should be nearly automatic, following some
> simple construction rules.
>
> If someone outside the current focus of CF wanted to use CF to store
> data (say someone from the biological community), they might begin by
> augmenting the components of the standard name with additional ones
> needed by their community. They would be required to adopt existing
> "categories" when applicable to their discipline.
>
> Sorry for the length of this and sorry if it duplicates material in
> related discussions, but I think this standard_name business seems a
> bit out of control and perhaps there are alternatives out there that
> might make it more straight-forward to propose and adopt new names.
>
> Anyway, I hope someone out there cares enough to comment on or improve
> on or suggest alternatives to this proposal. Whatever we do, we
> should be mindful that we must be able to determine when existing
> standard names are equivalent to any future representation of the
> standard name information.
>
> Best regards,
> Karl
>
> _______________________________________________
> CF-metadata mailing list
> CF-metadata at cgd.ucar.edu
> http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata
>


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
Received on Tue Oct 28 2008 - 17:38:45 GMT

This archive was generated by hypermail 2.3.0 : Tue Sep 13 2022 - 23:02:40 BST

⇐ ⇒