[CF-metadata] Getting back to ensembles from Steve Hankin on 2006-12-18 (Archive of CF discussions from 2002 to 2019 on the cf-metadata mailing list)

From: Steve Hankin <Steven.C.Hankin>
Date: Mon, 18 Dec 2006 15:40:21 -0800

Jonathan Gregory wrote:
> Dear all
>
> I think Steve is right that when making a proposal for change we have to start
> with a requirement. This is of course normally done in some informal way as a
> motivation, but sometimes being more explicit would help. I would suggest this
> as a requirement:
>
> A data variable may have a dimension which serves as an index over the members
> of an ensemble, in which the ensemble members are derived from different
> models, integrations, institutions supplying the data, etc., and the data from
> each ensemble member is a function of the same spatiotemporal coordinates
> and/or other physical independent variables. A means is needed to provide
> metadata identifying the ensemble members. This metadata serves as the
> coordinate data for the index dimension over ensemble members.
>
Hi Jonathan et. al.,

IMHO the above paragraph cannot really be said to be a requirement :-\
. It begins with a proposed implementation ("may have a dimension which
serves as an index") and follows up with some of the functionality that
can (and cannot) easily be achieved through that implementation. By
approaching our requirements in this way we risk failing to
systematically expose the functionality that is driving the discussion.
And we make it difficult to assess the trade-offs of solutions that may
be proposed.

Stripping the functionality away from the implementation it seems to me
that our Requirement runs something along these lines:

    Requirement: CF must to support the following concepts and
    operations with respect to ensembles of model outputs:

       1. Needs to create an association by which a number, N, of model
          outputs may be easily and unambiguously recognized as an
          ensemble (a special type of collection).
       2. Needs to provide a means by which each individual ensemble
          element can be readily identified in terms of metadata
          properties (often more than one property) -- institution,
          model code, etc.
       3. Needs *only *to support ensemble members that share identical
          spatiotemporal coordinates. (Note: this means that native
          multi-model ensembles, in which ensemble members may utilize
          different grids need not be represented. Such ensembles must
          always be regridded to a common spatiotemporal coordinate
          system in order to be regarded as a CF ensemble.)
       4. Needs to provide support to clients to easily implementation
          the following operations:
             1. accessing data from any individual member of the ensemble
             2. computing the averaging across all ensemble members
             3. computing anomalies between individual members and the
                ensemble mean
             4. computing the variance across all ensemble members
             5. [computing probability density functions? just a start
                so far -- other operations should be named, too ....]
       5. Needs to support the ensemble operations just described over
          arbitrary subsets of the collection of N ensemble members.
          For example, one may want to regard the models that assimilate
          sea surface height as a sub-ensemble ("A"), and make
          comparisons with those that do not (sub-ensemble "B").
       6. (??) Needs to support participation in the ensemble by
          distributed institutions without central management of the
          data. And potentially support dynamic inclusion of
          contributed ensemble members. (For example, all participating
          institutions may not have their model outputs ready at the
          same time. Ensemble analysis should not have to wait until
          the slowest member has provided its output.)

These 6 elements are just my off-the-cuff attempt to capture the
requirements. Hack and chop on 'em and by all means argue about 'em,
please. Writing them down has already illustrated areas where the
proposed implementation is perhaps less flexible than we might hope for:

   1. We are excluding native, multi-coordinate system ensembles (#3) in
      the style of the AR5 data management plans. Admittedly,
      non-uniformity of the spatiotemporal coordinate system makes the
      ensemble analysis problem much, *much *harder. (Perhaps a
      separate problem altogether ... though deserves some discussion in
      this context.) It requires us to think in terms of operations
      that are "owned" by datasets. e.g. We ask the data set for a time
      series at a point; or for the zonal average of temperature over
      some range; or for the integrated Pacific ocean heat content.
      Then we compare the results, independent of the native coordinate
      system.
   2. Operations over dynamically defined sub-ensembles (#5) do not map
      well onto the notion of an index-ordered ensemble axis in the
      style of a netCDF dimension. The advantages of netCDF derive from
      its ability to access a contiguous range of elements, m:n, in a
      single operation.
          * This limitation can be overcome if we provide a means to
            dynamically define new ensembles (through external
            metadata?). (Note that this is an area where *archive
            requirements and analysis requirements may diverge*.)
          * If as just proposed we are going to depend upon external
            metadata in order to achieve the required level of
            flexibility, we should explore this metadata machinery
            before finalizing the CF proposal. Use NcML?
   3. The notion of a shopping basket of metadata to identify each
      ensemble member (#2) maps awkwardly (though not unworkably) onto a
      netCDF dimension. (This has been the topic of email discussions
      and is discussed as an unsolved challenge in the final sentence of
      Jonathan's last paragraph above..)

    - Steve

========================
> Here's my summary of how we have proposed to meet this requirement:
>
> We propose to allow auxiliary coordinate variables with the ensemble index
> dimension to contain metadata identifying the institution, source,
> experiment_id and realization of the data. institution and source are strings,
> with the same meaning as the attributes of those names (CF 2.6.2).
> experiment_id is a string describing the design or intent of the
> integration. realization is either numeric or a string, and it distinguishes
> integrations of a model which in other respects are identically specified and
> constitute members of a statistical sample. Since these are auxiliary
> coordinate variables, their values do not have to be unique or ordered.
>
> (Ensemble members might also be distinguished by different analysis or
> verification times, in the forecasting sense of these terms. These can be
> supplied as auxiliary coordinate variables with standard_names of
> forecast_reference_time and time, and could be dimensioned with the ensemble
> index dimension. This is not a new convention.)
>
> For consistency and completeness, I would propose also to allow experiment_id
> and realization as attributes of data variables and as global attributes (i.e.
> include them in CF 2.6.2), with the same meanings as they have for auxiliary
> coordinate variables.
>
> The string-valued metadata should be as intelligible to humans, since the file
> should be self-describing. If the purposes for which the data are intended
> require the values of the string-valued metadata to be chosen from a standard
> set, a table specifying the possible set should be made publicly available,
> and a string-valued "vocabulary" attribute of the auxiliary coordinate
> variable should supply its URL.
>
> Do we want to mandate the form of this table?
>
> I think/hope the above is agreeable to all in general, though there may be
> some details in dispute. The main outstanding issue, as Alison and others have
> said, is whether institution, source, and experiment_id should be the
> standard_names of their auxiliary coordinate variables, or recorded in a new
> attribute. I'd say the case that Steve has raised concerning names
> for structural aspects of the grid is not part of this requirement for
> identifying ensemble members; it's a different question.
>
> I think they should be standard_names, because
>
> * I can't think of a good distinction to be drawn between this kind of
> information and other coordinate variables, so I suspect we would end up
> spending time trying to draw an illusory distinction which can't be made
> consistently. Referring to Steve's idea, it doesn't seem to me that the
> requirement indicates they should be treated differently from realization,
> forecast_reference_time and time, which are already allowed as standard_names.
>
> * if we use a different attribute, it means that software will have to look at
> both alternatives, although it's usually going to treat the contents in the
> same way.
>
> Here are some distinctions which *don't* seem to work well:
>
> * We could say string-valued things are not given standard names. But we have
> region as a standard name with string values, identifying geographical areas
> such as atlantic_ocean, for labelling the ocean overturning streamfunction.
> This function is very similar to numerical geographical coordinates. It would
> be strange to use standard name for latitude and longitude to describe a
> rectangular region, but to say that the label for a non-rectangular region is
> not a standard name. Similarly, land cover types have standard names. They are
> labelling parts of gridboxes which aren't geographically delineated, so again
> are acting as a sort of spatial coordinate.
>
> * Things which you aren't going to "operate" on are not standard names i.e.
> they're just labels. However, possibly the commonest thing to do with the
> numerical spatiotemporal coordinates is to subset them, and you may well do
> that with the ensemble metadata too. It is also possibly you might combine
> them or process them in other ways.
>
> * Quantities which could never be data variables should not be given standard
> names. It does seem quite unlikely that source, institution and experiment_id
> might form the contents of a data variable. (In fact it would be hard to do
> until netCDF-4, since they are string-valued, but it could be done using
> flag_values and flag_meanings.) But I don't think it's impossible. I can
> imagine constructing a lat-lon field which indicates which model, at each
> point, had the most realistic value of a quantity, for example.
>
> But perhaps someone can think of a better distinction which would naturally
> seem part of the requirement.
>
> Best wishes
>
> Jonathan
> _______________________________________________
> CF-metadata mailing list
> CF-metadata at cgd.ucar.edu
> http://www.cgd.ucar.edu/mailman/listinfo/cf-metadata
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cgd.ucar.edu/pipermail/cf-metadata/attachments/20061218/d353a350/attachment-0002.html>
Received on Mon Dec 18 2006 - 16:40:21 GMT

This archive was generated by hypermail 2.3.0 : Tue Sep 13 2022 - 23:02:40 BST