[CF-metadata] original_ensemble_size

From: Karl Taylor
Date: Thu, 23 Jul 2015 17:42:03 -0700

Hi all,

This addresses the issue of how to associate an ensemble size with a
variable. It also suggests an alternate way of proceeding that is more
general and will allow us to record, for example, which models were
included in a multi-model mean.

First to consider Jim's suggestion:
I agree with Jim that you might want to indicate which member (or
members) of an ensemble were represented by the variable so you might
want to include a coordinate variable of "realization". You could then
also define an *attribute* of that coordinate as "ensemble_size" which
would record the size, but currently that approach is not standardized
(but of course is permitted) by our conventions.

Now Mark's suggestion:
Mark's alternative approach to make "ensemble_size" a coordinate
variable (presumably in addition to possibly including "realization")
would also relate it to the variable of interest, but this would be a
bit unconventional since a variable would normally be considered to be a
*function* of its (independent) coordinates. I don't think
T(x,realization,ensemble_size) is a proper function, since T depends on
x and realization, but should be independent of ensemble size in most cases.

Jonathan's suggestion:
I think Jonathan suggested including ensemble_size in a cell_methods
attribute. For example


     float precip(lon,lat)
         precip: cell_methods="realization: point (sample_size: e_size)

where because "realization" is a standard name, it does not need to be
explicitly declared with a "coordinates" attribute. Jonathan originally
used "dimension" rather than "sample_size", but I prefer
"sample_size". If this approach were followed, then CF would need to
be modified so that "sample_size" (along with "interval") was designated
to be one of the options for providing "standardized" extra information
in the cell_methods attribute. Note that the variable "pointed to" by
original_domain would not necessarily be a coordinate variable; it need
not be monotonic and it could be a character variable (i.e., a list).

Alternative "new approach"

An approach that is a slight variant on Jonathan's and would allow even
more information to be provided concerning the ensemble is illustrated
by the following example:


     float precip(lon,lat)
         precip: cell_methods="member: point (sample_pool: members)
     int member
         member: standard_name="realization"
      int members(members)
         members: standard_name="realization"

     member = 3
     members = 1, 3, 5, 6, 10

This would tell you T was from the realization labeled 3 of a 5-member
ensemble (with labels 1, 3, 5, 6, and 10). If this approach were
adopted, then CF would need to be modified so that "sample_pool" (along
with "interval") was designated to be one of the the options for
providing "standardized" extra information in the cell_methods attribute.

Under Jonathan's approach and also the "new approach", there wouldn't be
a need to define the standard_name "ensemble_size" because that would be
provided by the dimension size (5 in the above).

Note that the new approach could also be used to record a multi-model
ensemble mean (I'm not absolutely sure this example complies with the
current convention, but I think it would if the option to designate the
"original_domain" were added to CF):

     max_len = 10

     float precip(lon,lat)
         precip: cell_methods="realization: mean (sample_pool: models)
      char models(models, max_len)

     models = "CanESM2", "CESM1", "CNRM-CM5", "HadGEM2", "MIROC-ESM"

Note also that the flexibility of this new approach could be useful for
dimensions other than realization when, for example, the sampling
interval for a spatial mean is from scattered stations. If one were
computing an spatial mean from 5 stations, for example, this could be
recorded as follows:


     float precmean
         precmean: cell_methods="area: mean (sample_pool: stations)"
     char stations(stations,max_len)
         stations: coordinates="lat lon"
         lat: standard_name="latitude"
         lon: standard_name="longitude"

     stations = "Oakland", "San Francisco", "Livermore", "San Jose",
"Palo Alto"
     lat = 37.62, 37.77, ...
     lon = -122.27, -122.42, ....

I would find it very nice to be able to specify the models contributing
to a multi-model mean using the above approach. Anyone else think so?
It would also satisfy Mark's use case of wanting to record the size of
the ensemble.

Best regards,

Received on Thu Jul 23 2015 - 18:42:03 BST

