⇐ ⇒

[CF-metadata] statistic indices

From: Jonathan Gregory <j.m.gregory>
Date: Mon, 4 Jun 2007 12:00:13 +0100

Dear Heinke and Alison

This is a response to Alison's posting. I expect and hope that we will be able
to discuss the general point in Paris, but we will be short of time there. As
this requires a lot of thinking, I thought it would be useful to send this now.

> Forty-one new names are proposed for indices that describe aspects of
> climate variability. The proposed names describe rather different
> quantities to any that currently exist in the table.

Alison's analysis of this proposal is very useful, I think, and I generally
agree with her. In the following, I have attempted to digest the names a bit
more. Many of them have definitions which require the combination of the
specification of a physical quantity with the value of a parameter, where the
parameter appears to Alison and me to be most naturally expressed as a
coordinate variable or scalar coordinate variable in CF. This isn't a new
kind of situation, but it is an issue worth thinking about.

It is rather similar to the definition of "surface air temperature". In CF we
describe this with a standard_name of air_temperature and a vertical
coordinate of e.g. 1.5 m or 2.0 m. In many other tables of parameters, such as
GRIB and PCMDI, "surface air temperature" is a possible value of a single
attribute, and the definition of the quantity isn't split up as it is in CF.
We decided on the CF approach because it is more general and it reduces the
size of the standard name table by "factorising" the information. We could
have taken this approach further and made the standard name table more
hierarchical. The approach we have followed is a middle way, I think, which
splits off specifications of the values of continuously variable parameters
(such as height), and attributes which could be combined with *any* standard
name (such as the indication of a time-mean).

If people are manipulating data from PCMDI, for instance, they might well look
for the data variable whose name is "tas", which they know to be surface air
temperature. If the data is CF-compliant, it will be described by standard
name and vertical coordinate. The name "tas" thus labels a "bundle" of
metadata which defines surface air temperature. It seems that CF metadata is
principally useful for giving a precise description of the quantity; it allows
us to distinguish between variables or decide whether they are comparable. CF
metadata may not be the most convenient way to identify which variable you
want, however. Heinke's proposal has many more examples where the CF metadata,
as Alison and I see it, does not offer the simplest way to *find* the quantity
in the dataset, because it requires searching on a combination of metadata.

Personally I don't think this calls for reengineering CF, because we already
have a lot of investment (human time and data written) in our current way of
doing things, which has advantages. Moreover, it doesn't appear to be a
general problem as there have not been complaints about it up to now. There
are two other ways to deal with the issue:

* Make the program cleverer which reads the data. I would guess it is not very
hard to offer the user a menu of quantities defined as bundles of metadata
specifications such as tas or Heinke's indices, and then search for these
combinations. It would also not be difficult to scan a dataset against the
definitions of the bundles in order to display its contents in these terms.

* Introduce some other attribute of data variables for well-known bundles of
metadata, such as the definition of tas. This would be redundant and hence a
possible pitfall as it might be inconsistent with the other metadata in the
file. Hence I don't like this idea so much. I think is better to keep these
definitions of bundles outside the netCDF file.

Now, on to Heinke's proposal!

> "index_per_time_period" names. These are counts of days on which
> particular meteorological conditions existed.

Alison remarks, "I wonder if we even need to include the word "index" in
all the names". I agree. In fact the word "index" obscures what the quantity
is - a number of days per time period - and as Alison says, the time period
is indicated by the time bounds.

The AMetSoc glossary says

frost day: An observational day on which frost occurs; one of a family of
climatic indicators (e.g., thunderstorm day, rain day). The definition is
somewhat arbitrary, depending upon the accepted criteria for a frost
observation. Thus, it may be 1) a day on which the minimum air temperature in
the thermometer shelter falls below 0degC (32degF); 2) a day on which a
deposit of white frost is observed on the ground; 3) in British usage, a day
on which the minimum temperature at the level of the ground or on the tops of
low, close-growing vegetation falls to -0.9degC (30.4degF) or below (also
called a "day with ground frost"); and perhaps others. The present trend is to
drop such terms in favor of something less ambiguous, such as "day with
minimum temperature below 0degC (32degF)".

To me, these remarks seem to support the CF approach of providing explicit
metadata, because "frost day" isn't actually a well-defined quantity. I think
we should take the last sentence as advice, and introduce standard names of
number_of_days_with_minimum|maximum_air_temperature_below|above_threshold, and
specify the threshold with a standard name of air_temperature_threshold, which
is already in the table, having been previously introduced for a similar
purpose (see below). We can assume that the first day starts at the minimum
time-bound i.e. the time-bounds define both the individual days, and the
number of days, a bit like CF climatological time. The vertical level e.g. 1.5
m can be given as a scalar coordinate variable of the data variable, as
usual. These names should also deal with summer days, ice days and tropical
nights, which I think are even less in commonly agreed usage; Google doesn't
find them (at least not near the top of its list).

This approach, and most of the rest of these proposed names, have "days" as a
defining characteristic. This is natural because a day is of course not an
arbitrary unit of time, and that's why it also has special treatment in
climatological time (CF 7.4). However, most of the quantities being proposed
could also be defined for arbitrary time-intervals e.g. 6 h. It would not be
sensible to introduce a set of standard names specifying "six_hour_interval".
We would need to generalise, I think, and we might then supersede all the
day-based names we are discussing now.

Further generality could be achieved if the "minimum" or "maximum" (or "mean")
didn't have to appear in the standard name. Usually it is in the cell_methods,
but in this case the statistical method applies to the coordinate variable
(the threshold) rather than the data variable (the number of days). We could
remove it from the standard name if we allowed the cell_methods to have a
slightly broader interpretation in which it could be involved in describing
how the quantity was calculated (number of days) from some other quantity (air
temperature). I think that would be a reasonable thing to do; in fact similar
suggestions have been made before regarding other statistical quantities such
as correlation coefficients. If we made that generalisation of cell_methods,
the standard names to be introduced would be number_of_days_with_air_
temperature_below|above_threshold.

The indices with "consecutive days" meeting a condition could be called
maximum_spell_length_of_days_with_air_temperature_below|above_threshold. Is
that right? Here, the "maximum" also implies a statistical processing, but
it is an intrinsic part of the definition of the quantity itself. First you
have to identify all the spells, then find the longest one.

> > wet_days_index_per_time_period
> > heavy_precipitation_days_index_per_time_period
> > very_heavy_precipitation_days_index_per_time_period
> These all essentially describe the same concept, i.e., the number of
> days when the daily accumulated precipitation exceeds a threshold.
> The thresholds defining wet, heavy precipitation and very heavy
> precipitation are 1 mm, 10 mm and 20 mm respectively. Perhaps we
> could combine these names and use a scalar coordinate variable to
> specify the threshold:
> number_of_days_accumulated_precipitation_exceeds_threshold

I agree with that. Consistent with temperature, the standard names would be
number_of_days_with_precipitation_amount_below|above_threshold, referring to a
threshold of precipitation_amount in kg m-2. That's numerically the same as
mm, of course, but if we want mm we should have lwe_thickness_of_
precipitation_amount instead of precipitation_amount. Although "wet days" and
"dry days" are commonly used phrases, various thresholds are used in practice
(at least, that is my recollection from when I used to work on that).

> > strong_breeze_days_index_per_time_period
> > strong_gale_days_index_per_time_period
> > hurricane_days_index_per_time_period
> This is another case where the three names are essentially
> describing the same concept, i.e., the number of days when the
> maximum wind speed exceeds a particular threshold.

Exactly. Consistent with temperature and precipitation, I would choose
number_of_days_with_maximum_wind_speed_above_threshold (almost the same as
Alison) with a threshold having standard name of wind_speed. As for
temperature, the "maximum" could be moved from the standard name to the
cell_methods if we allowed cell_methods a wider interpretation.

In all these cases, the threshold does not have to be a scalar coordinate
variable; it could have a dimension exceeding 1 and a multivalued coordinate
variable, so that one data variable could contain the values for several
thresholds.

> (2) "index_wrt_Nth_percentile|mean_of_reference_period" names, where N
> is a number in the range 1-100 and the reference period is defined in
> a variable attribute. These are counts of days when particular
> meteorological conditions existed relative to a threshold defined from
> the reference period.

While working on the IPCC AR4, I found it quite hard to find out what the
"heat wave index" etc. quantities meant, because it was not stated as
explicitly as this. In addition, it turns out that there is a variety of
colloquial phrases, such as "heat wave" and "hot spell", and this causes
confusion. Both of these problems indicate to me that clarity is useful. That
would suggest number_of_days_with_[maximum_]air_temperature_above_percentile_
and_spell_length_above_threshold, with coordinates of cumulative_probability_
of_air_temperature to specify the level of the percentile e.g. 0.9 or 90% and
spell_length_threshold e.g. 6 days. As Alison remarks, the climatology is not
specified wrt which the percentiles are calculated.

This name is rather long. We could make a further generalisation to avoid it,
by using a standard_name of number_of_days alone in all cases and
distinguishing them solely by their coordinate variables. That is consistent
with the obvious fact that standard names don't indicate all the independent
variables e.g. we don't have a standard name of air_temperature_as_a_function_
of_latitude_and_longitude to indicate that's what it depends on. We expect
software to examine the coordinate variables to work that out. A more likely
case is that software has to examine the vertical coordinate to find out if
it's air_temperature on model levels, pressure levels, height levels or
whatever. Likewise, you could argue that the quantity of number_of_days is
always the same, with different independent variables - but that argument is
not convincing, I don't think.

A counter-argument is that probability_density_of_air_temperature and
probability_density_of_precipitation_amount must have different standard names
because they have different units, of K-1 and (kg m-2)-1. So for consistency
probability_of_air_temperature and probability_of_precipitation_amount, with
coordinate bounds specifying the ranges of the independent variables, should
have different standard names, even though they have the same unit of 1. If
these have different names, so should number_of_days with different
independent variables, as number of days is a probability multiplied by a
length of time.
 
> (3) "percent of time wrt_to_Nth_percentile|mean_of_reference_period"
> names, where N is a number in the range 1-100 and the reference period
> is defined in a variable attribute. These names express the percent
> of time per time period (specified by time bounds) that conditions
> exist above or below a threshold calculated from the reference period.

The examples given are not exactly percentage of time, since the
discretisation into days is an essential element of the definition.

> Another general comment is that I would prefer to use
> "fraction" rather than "percent" as it is more consistent with
> existing standard names.

I agree with that, and support Alison's proposal, which I'd modify slightly to
fraction_of_days_with_(minimum|maximum|mean_)(air_temperature|precipitation_
amount)_above_threshold, with threshold coordinate variables of air_
temperature and precipitation_amount. As before, the minimum|maximum|mean
would not be needed if we could transfer it to cell_methods.

> (4) "per cent of amount per time period" names. These names express
> the percentage of a quantity, e.g., precipitation, that has occurred
> during a time period (defined by time bounds) from days when
> particular meteorological conditions existed.

Again, I essentially agree with Alison's proposal. I would suggest
fraction_of_precipitation_amount_on_days_with_precipitation_amount_
above_percentile, where I used "on" because "due to" has a specified meaning
in the guidelines, and the threshold would have a standard name of cumulative_
probability_of_precipitation_amount. The repetition of precipitation_amount
might seem clumsy, but is necessary because you could have a quantity of
fraction_of_precipitation_amount_from_days_with_air_temperature_above_
percentile, for example.

(5) Miscellaneous

> growing_season_length_index

I would call this growing_season_duration, as I don't think "index" makes it
clearer, and because existing "length" names are distances, while we have one
"duration" name, which is also a time. But I wonder whether "growing season"
is sufficiently well defined. Web definitions are divided between those saying
it means the duration of favourable conditions, and those saying it means the
time between the last frost of spring and the first of autumn. Which is
intended here?

> > heating_degree_days_per_time_period
> This could be:
> number_of_heating_degree_days
> I do not have sufficient expertise to comment on the definition of
> heating degree days - is there a single standard for this or do
> alternative definitions exist?

There appears to be a commonly used definition of 65degF, but even so I would
advocate the more general approach. Also, degree-days are not the same as
numbers of days. In fact we have dealt with degree-days previously (that's
when the temperature threshold was introduced), and according to that earlier
decision the name is integral_of_air_temperature_deficit_wrt_time, canonical
units K s, where the deficit is wrt the air_temperature_threshold specified as
a coordinate. This name suggest an integral over continuous time, rather than
using daily means. I don't know whether both are in use and whether we should
be more precise.

> frost_days_where_no_snow_index_per_time_period
Alison has asked for clarification of this one.
 
> > intra_period_extreme_temperature_range; K; difference between the
> > absolute extreme temperatures in the observation period.
>
> I think that perhaps we need to define a new cell_method of "range".
> The standard name would then simply be:
> air_temperature
> (which, of course, is already in the table) with cell_methods =
> "time: range". Do others agree?

Yes.

> > highest_one_day_precipitation_amount_per_time_period; kg m-2; highest
> > one day precipitation is the maximum of one day precipitation amount
> > in a given time period.
> > highest_five_day_precipitation_amount_per_time_period; kg m-2;
> > highest precipitation amount for five day interval (including the
> > calendar day as the last day).
>
> I would not include "per_time_period" as the period information
> should come from the time bounds. To describe the maximum of
> precipitation over different time intervals we can use cell_methods
> with the additional information in parenthesis (see CF1.0 7.3). Hence
> we would have just one name of:
> precipitation_amount
> (already in the standard name table) which would be specified with
> appropriate time bounds and
> cell_methods = "time: maximum (interval: 1 day)" or
> cell_methods = "time: maximum (interval: 5 days)" as appropriate.

I think Alison is right, and this raises an issue which hasn't come up before.
It is rather like climatological time, where we have different statistical
methods for nested periods. Here, we are computing first the sum over 5 days,
then the maximum of those sums. That is analogous to computing first the mean
precipitation amount over days in the month, then finding the maximum over
corresponding months in different years. If we wanted to record the maximum of
5-day-mean temperatures, I don't think we'd have a way to do it at present,
because it likewise involves two statistical operations, "time: mean (within
5-day periods) time: maximum (over 5-day periods)". We could introduce a
general method for recording double processing (not necessarily time) e.g.
"time: mean within intervals time: maximum over intervals", with interval
being a new attribute of the time cooordinate variable, and the first interval
assumed to start at the minimum time-bound.

> > simple_daily_intensity_index_per_time_period; kg m-2; simple daily
> > intensity index is the mean of precipitation amount on wet days. A
> > wet day is a day with precipitation sum exceeding 1 mm.
>
> Again I would exclude "per_time_period". "Intensity" sounds to me
> like a precipitation rate rather than an amount. I think we could
> again use:
> precipitation_amount
> with appropriate time bounds and
> cell_methods = "time: mean (over days with precipitation thickness
> > 1 mm)".

For this, I'd propose something different, but consistent with other cases
above viz. a standard name of precipitation_amount_on_days_with_precipitation_
amount_above_threshold. Although this is not climatological time, it again
needs a new concept of double time-processing, because it is dealing with a
discontinuous time-interval. First we find the value for each qualifying day,
then we average them. That could be recorded as "time: mean over days", which
is currently allowed only for climatological time.

> > wind_chill_temperature
> "Wind_chill" is a commonly used term and I am happy to accept it.
> For consistency with other temperature names perhaps we should have:
> wind_chill_air_temperature
I think we could call it just wind_chill_temperature as proposed, because I
don't think it could be the apparent temperature of anything except air.

> Please can someone tell
> me if it is OK to use the "add_offset" attribute to specify a
> conversion between Kelvin and Celsius if that is required?
No, we allow it only for packing, but it is OK to use degC as a udunit; the
conversion is built in.

Best wishes

Jonathan
Received on Mon Jun 04 2007 - 05:00:13 BST

This archive was generated by hypermail 2.3.0 : Tue Sep 13 2022 - 23:02:40 BST

⇐ ⇒