[CF-metadata] original_ensemble_size from Hedley, Mark on 2015-07-29 (Archive of CF discussions from 2002 to 2019 on the cf-metadata mailing list)

From: Hedley, Mark <mark.hedley>
Date: Wed, 29 Jul 2015 08:10:55 +0000

Hello Karl

I agree with your analysis that it is unlikely that a data variable will ever vary with ensemble_size, so having ensemble_size as a scalar coordinate is slightly odd, in that we'd not expect it to be anything other than scalar.
It would meet my use case, but I can see the interest in other options.

It seems to me that the rest of your thoughts are centred around cell_methods.
The conventions describe Cell Methods as:
7.3. Cell Methods

To describe the characteristic of a field that is represented by cell values, we define the cell_methods attribute of the variable.

It earlier describes Cells:
7. Data Representative of Cells

When gridded data does not represent the point values of a field but instead represents some characteristic of the field within cells of finite "volume," a complete description of the variable should include metadata that describes the domain or extent of each cell, and the characteristic of the field that the cell values represent.

It is not clear to me that the case of defining an ensemble fits with this model as described. What is the 'characteristic of the field' within 'cells of finite volume' in this case?

Is there appetite to extend the scope of Cell Methods to define such characteristics? What are the risks in doing this? Is this proposal extending Cell Methods into realms which are already nearly covered by other CF concepts?

I don't have coherent answers to these queries, but I think they are worth a little thought before we delve too far into the details of encoding

many thanks
mark

________________________________
From: CF-metadata [cf-metadata-bounces at cgd.ucar.edu] on behalf of Karl Taylor [taylor13 at llnl.gov]
Sent: 24 July 2015 01:42
To: cf-metadata at cgd.ucar.edu
Subject: Re: [CF-metadata] original_ensemble_size

Hi all,

This addresses the issue of how to associate an ensemble size with a variable. It also suggests an alternate way of proceeding that is more general and will allow us to record, for example, which models were included in a multi-model mean.

First to consider Jim's suggestion:
I agree with Jim that you might want to indicate which member (or members) of an ensemble were represented by the variable so you might want to include a coordinate variable of "realization". You could then also define an *attribute* of that coordinate as "ensemble_size" which would record the size, but currently that approach is not standardized (but of course is permitted) by our conventions.

Now Mark's suggestion:
Mark's alternative approach to make "ensemble_size" a coordinate variable (presumably in addition to possibly including "realization") would also relate it to the variable of interest, but this would be a bit unconventional since a variable would normally be considered to be a *function* of its (independent) coordinates. I don't think T(x,realization,ensemble_size) is a proper function, since T depends on x and realization, but should be independent of ensemble size in most cases.

Jonathan's suggestion:
I think Jonathan suggested including ensemble_size in a cell_methods attribute. For example

dimensions:
    lon=72
    lat=96
    e_size=5

variables:
    float precip(lon,lat)
        precip: cell_methods="realization: point (sample_size: e_size)

where because "realization" is a standard name, it does not need to be explicitly declared with a "coordinates" attribute. Jonathan originally used "dimension" rather than "sample_size", but I prefer "sample_size". If this approach were followed, then CF would need to be modified so that "sample_size" (along with "interval") was designated to be one of the options for providing "standardized" extra information in the cell_methods attribute. Note that the variable "pointed to" by original_domain would not necessarily be a coordinate variable; it need not be monotonic and it could be a character variable (i.e., a list).

Alternative "new approach"

An approach that is a slight variant on Jonathan's and would allow even more information to be provided concerning the ensemble is illustrated by the following example:

dimensions:
    lon=72
    lat=96
    members=5

variables:
    float precip(lon,lat)
        precip: cell_methods="member: point (sample_pool: members)
    int member
        member: standard_name="realization"
     int members(members)
        members: standard_name="realization"

data:
    member = 3
    members = 1, 3, 5, 6, 10

This would tell you T was from the realization labeled 3 of a 5-member ensemble (with labels 1, 3, 5, 6, and 10). If this approach were adopted, then CF would need to be modified so that "sample_pool" (along with "interval") was designated to be one of the the options for providing "standardized" extra information in the cell_methods attribute.

Under Jonathan's approach and also the "new approach", there wouldn't be a need to define the standard_name "ensemble_size" because that would be provided by the dimension size (5 in the above).

Note that the new approach could also be used to record a multi-model ensemble mean (I'm not absolutely sure this example complies with the current convention, but I think it would if the option to designate the "original_domain" were added to CF):

dimensions:
    lon=72
    lat=96
    models=5
    max_len = 10

variables:
    float precip(lon,lat)
        precip: cell_methods="realization: mean (sample_pool: models)
     char models(models, max_len)

data:
    models = "CanESM2", "CESM1", "CNRM-CM5", "HadGEM2", "MIROC-ESM"

Note also that the flexibility of this new approach could be useful for dimensions other than realization when, for example, the sampling interval for a spatial mean is from scattered stations. If one were computing an spatial mean from 5 stations, for example, this could be recorded as follows:

dimensions:
    stations=5
    max_len=16

variables:
    float precmean
        precmean: cell_methods="area: mean (sample_pool: stations)"
    char stations(stations,max_len)
        stations: coordinates="lat lon"
    lat(stations)
        lat: standard_name="latitude"
    lon(stations)
        lon: standard_name="longitude"

data:
    stations = "Oakland", "San Francisco", "Livermore", "San Jose", "Palo Alto"
    lat = 37.62, 37.77, ...
    lon = -122.27, -122.42, ....

I would find it very nice to be able to specify the models contributing to a multi-model mean using the above approach. Anyone else think so? It would also satisfy Mark's use case of wanting to record the size of the ensemble.

Best regards,
Karl

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cgd.ucar.edu/pipermail/cf-metadata/attachments/20150729/67962dad/attachment.html>
Received on Wed Jul 29 2015 - 02:10:55 BST

This archive was generated by hypermail 2.3.0 : Tue Sep 13 2022 - 23:02:42 BST