⇐ ⇒

[CF-metadata] Getting back to ensembles

From: Jonathan Gregory <j.m.gregory>
Date: Sat, 16 Dec 2006 10:02:19 +0000

Dear all

I think Steve is right that when making a proposal for change we have to start
with a requirement. This is of course normally done in some informal way as a
motivation, but sometimes being more explicit would help. I would suggest this
as a requirement:

A data variable may have a dimension which serves as an index over the members
of an ensemble, in which the ensemble members are derived from different
models, integrations, institutions supplying the data, etc., and the data from
each ensemble member is a function of the same spatiotemporal coordinates
and/or other physical independent variables. A means is needed to provide
metadata identifying the ensemble members. This metadata serves as the
coordinate data for the index dimension over ensemble members.

Here's my summary of how we have proposed to meet this requirement:

We propose to allow auxiliary coordinate variables with the ensemble index
dimension to contain metadata identifying the institution, source,
experiment_id and realization of the data. institution and source are strings,
with the same meaning as the attributes of those names (CF 2.6.2).
experiment_id is a string describing the design or intent of the
integration. realization is either numeric or a string, and it distinguishes
integrations of a model which in other respects are identically specified and
constitute members of a statistical sample. Since these are auxiliary
coordinate variables, their values do not have to be unique or ordered.

(Ensemble members might also be distinguished by different analysis or
verification times, in the forecasting sense of these terms. These can be
supplied as auxiliary coordinate variables with standard_names of
forecast_reference_time and time, and could be dimensioned with the ensemble
index dimension. This is not a new convention.)

For consistency and completeness, I would propose also to allow experiment_id
and realization as attributes of data variables and as global attributes (i.e.
include them in CF 2.6.2), with the same meanings as they have for auxiliary
coordinate variables.

The string-valued metadata should be as intelligible to humans, since the file
should be self-describing. If the purposes for which the data are intended
require the values of the string-valued metadata to be chosen from a standard
set, a table specifying the possible set should be made publicly available,
and a string-valued "vocabulary" attribute of the auxiliary coordinate
variable should supply its URL.

Do we want to mandate the form of this table?

I think/hope the above is agreeable to all in general, though there may be
some details in dispute. The main outstanding issue, as Alison and others have
said, is whether institution, source, and experiment_id should be the
standard_names of their auxiliary coordinate variables, or recorded in a new
attribute. I'd say the case that Steve has raised concerning names
for structural aspects of the grid is not part of this requirement for
identifying ensemble members; it's a different question.

I think they should be standard_names, because

* I can't think of a good distinction to be drawn between this kind of
information and other coordinate variables, so I suspect we would end up
spending time trying to draw an illusory distinction which can't be made
consistently. Referring to Steve's idea, it doesn't seem to me that the
requirement indicates they should be treated differently from realization,
forecast_reference_time and time, which are already allowed as standard_names.

* if we use a different attribute, it means that software will have to look at
both alternatives, although it's usually going to treat the contents in the
same way.

Here are some distinctions which *don't* seem to work well:

* We could say string-valued things are not given standard names. But we have
region as a standard name with string values, identifying geographical areas
such as atlantic_ocean, for labelling the ocean overturning streamfunction.
This function is very similar to numerical geographical coordinates. It would
be strange to use standard name for latitude and longitude to describe a
rectangular region, but to say that the label for a non-rectangular region is
not a standard name. Similarly, land cover types have standard names. They are
labelling parts of gridboxes which aren't geographically delineated, so again
are acting as a sort of spatial coordinate.

* Things which you aren't going to "operate" on are not standard names i.e.
they're just labels. However, possibly the commonest thing to do with the
numerical spatiotemporal coordinates is to subset them, and you may well do
that with the ensemble metadata too. It is also possibly you might combine
them or process them in other ways.

* Quantities which could never be data variables should not be given standard
names. It does seem quite unlikely that source, institution and experiment_id
might form the contents of a data variable. (In fact it would be hard to do
until netCDF-4, since they are string-valued, but it could be done using
flag_values and flag_meanings.) But I don't think it's impossible. I can
imagine constructing a lat-lon field which indicates which model, at each
point, had the most realistic value of a quantity, for example.

But perhaps someone can think of a better distinction which would naturally
seem part of the requirement.

Best wishes

Jonathan
Received on Sat Dec 16 2006 - 03:02:19 GMT

This archive was generated by hypermail 2.3.0 : Tue Sep 13 2022 - 23:02:40 BST

⇐ ⇒