Opened 9 years ago

Last modified 19 months ago

#82 new enhancement

Extend cell_methods attribute to document multi-step operations on a variable

Reported by: mgschultz Owned by: cf-conventions@…
Priority: medium Milestone:
Component: Philip Cameron-Smith Version:
Keywords: cell_methods, averaging Cc:

Description

1. Title

Extend cell_methods attribute to document multi-step operations on a variable

2. Moderator

Philip Cameron-Smith (TBC)

3. Requirement

Air quality data (and possibly other data sets) require to store more complex arithmetic information than cell_methods currently allow for. An example are "daily maximum 8-hour-running-average concentrations". Other examples include (current US EPA standard for ozone) "0.075 ppm over 8 hour period: To attain this standard, the 3-year average of the fourth-highest daily maximum 8-hour average ozone concentrations measured at each monitor within an area over each year must not exceed 0.075 ppm." Or from a new standard proposal http://www.epa.gov/air/ozonepollution/fr/20100119.pdf : "cumulative, seasonal standard expressed as an annual index of the sum of weighted hourly concentrations, cumulated over 12 hours per day (8 am to 8 pm) during the consecutive 3-month period within the O3 season with the maximum index value, set at a level within the range of 7 to 15 ppm-hours".

The proposed changes to the cell_methods attribute shall allow for a description of multiple arithmetic operations while preserving backward compatibility as much as possible.

Multiple averaging times (e.g. "max 1-hourly average" and "max 8-hourly average" need to be stored in the same file. There is a need in principle to be able to specify the three relevant time quantities:

1) the length of the running-mean kernel (1-hour, 8-hour, annual),

2) the time interval within which the maximum of the running mean is searched for (daily, monthly, yearly),

3) the point/time in the running-mean kernel that is used to decide whether an 8-hour mean falls within the time interval for no.2, since an 8-hour mean will often include values from outside the time interval.

In practice no.3 is often not mentioned because there is usually a diurnal cycle to air-quality that helps to avoid confusion in given instances (ozone peaks in the afternoon, and smoke particles peak at night).

4. Initial Statement of Technical Proposal

We propose to enhance the cell_methods attribute, because its rationale (section 1.3 of CF-1.6) reads "An important application of this attribute is to describe climatological and diurnal statistics." This is exactly what we want to do.

Specifically, the following changes should be made to section 7.3.2. "Recording the spacing of the original data and other information": (additions marked by TEXT, deletions by TEXT.

To indicate more precisely how the cell method was applied, extra information may be included in parentheses () after the identification of the method. This information includes standardized and non-standardized parts. Currently the only sStandardized information is to provideincludes the typical interval between the original data values to which the method was applied or the period (length of time) which was considered in the arithmetic operation. , in the situation where the present data values are statistically representative of original data values which had a finer spacing.

The syntax is (interval: value unit[ period: value unit]), where value is a numerical value and unit is a string that can be recognized by UNIDATA's Udunits package [UDUNITS]. The unit will usually be dimensionally equivalent to the unit of the corresponding dimension, but this is not required (which allows, for example, the interval for a standard deviation calculated from points evenly spaced in distance along a parallel to be reported in units of length even if the zonal coordinate of the cells is given in degrees). Recording the original interval is particularly important for standard deviations. For example, the standard deviation of daily values could be indicated by cell_methods="time: standard_deviation (interval: 1 day)" and of annual values by cell_methods="time: standard_deviation (interval: 1 year)".

The (period: value unit) syntax can be used for example to express the length of an averaging interval. One such example is the recording of "daily maximum 8-hour average concentrations". In this case the cell_methods attribute would read "time:maximum time:mean (interval: 1 hour, period: 8 hours)" - this indicates processing from right to left, i.e. first averaging over eight hourly values, then computing the maximum of these averaged values. If the period is given, an interval attribute must also be given.

(the rest of the section remains unchanged)

Implication on section 7.3: "Furthermore, it should be noted that if any method other than 'point' is specified for a given axis, then cell_bounds should also be provided for that axis (except for the relatively rare exceptions described in Section 7.3.4, 'Cell methods when there are no coordinates')." - This statement may contradict the new formulation for running averages. The time coordinates will generally be recorded daily at regular intervals (i.e. 12:00 h). A time_bounds variable would give false information.

Implications on Appendix E: no changes are necessary in Appendix E.

5. Benefits

This enhancement would be of immediate value to the community of air quality research, but we expect that others can also benefit from it.

6. Status Quo

The only way this information can currently be conveyed is via the use of non-standard attributes (e.g. long_name) or by defining lengthy (non standard) variable names (e.g. o3_daily_max_8hourly_running_mean). This has the disadvantage that either no standard_name can be used, or two different fields (instantaneous output and averaged quantity) would use the same standard_name.

Change History (9)

comment:1 Changed 9 years ago by jonathan

Dear Martin

Thanks for the proposal. This approach looks good to me. It is a fairly simple extension and backward compatible. I think, however, that it actually needs a new subsection, because processing the same axis more than once is a significant new idea for cell_methods. Therefore I would suggest some rearrangement:

Rename 7.3.2 as "Recording the spacing and range of the original coordinates"

In 7.3.2, delete the text To indicate more precisely how the cell method was applied, extra information may be included in parentheses () after the identification of the method. This information includes standardized and non-standardized parts. and the entire paragraph beginning If there is both standardized and non-standardized ...".

Begin 7.3.2 thus: Standardized extra information in () after the method is used to provide the typical interval between the original ....

Add a new paragraph in the introductory part of 7.3 (following the paragraph "Note that in this example ..."), as follows:

To indicate more precisely how the cell method was applied, extra information may be included in parentheses () after the identification of the method. This information includes standardized and non-standardized parts. The standardized information takes the form "keyword: words" where words is one or more blank-separated words, and this pattern may be repeated to provide further information. If there is both standardized and non-standardized information, the non-standardized follows the standardized information and the keyword comment:. If there is no standardized information, the keyword comment: should be omitted. For instance, an area-weighted mean over latitude could be indicated as lat: mean (area-weighted) or lat: mean (interval: 1 degree_north comment: area-weighted). See Section 7.3.2 "Recording the spacing and range of the original coordinates" and Section 7.3.3 "Statistics requiring more than one method".

Insert a new section 7.3.3 "Statistics requiring more than one method" and renumber the following sections. The text for this new section is as follows:

It is possible to record a succession of statistical operations on the same axis by describing their methods arranged in the order they were applied. The left-most operation is assumed to have been applied first. If this is done, all the methods except the last must use standardized extra information to record the period (length of time) which was considered in the operation.

The syntax is (period: value unit). One such example is the recording of "daily maximum 8-hour average concentrations". In this case the cell_methods attribute would read time: mean (interval: 1 hour period: 8 hours) time: maximum. This indicates first averaging over eight hourly values, then computing the maximum of these averaged values. The interval information is optional. The periods considered by the final operation are recorded in the coordinate bounds.

Note that this differs from what you wrote, in that the left-most operation comes first, which is consistent with Section 7.3.1, and that I suggest the interval is optional - is there a reason why it should be mandatory? It seems to me that the bounds are the right way to record the days, in your application. I don't follow the concern you have noted regarding the bounds - please could you explain?

It would be helpful if you could also propose the changes required to the conformance document.

Best wishes

Jonathan

comment:2 Changed 9 years ago by painter1

  • Component changed from cf-conventions to Philip Cameron-Smith
  • Owner changed from cf-conventions@… to pjc@…

comment:3 Changed 9 years ago by cameronsmith1

This is a test. (Ignore)

comment:4 Changed 9 years ago by cameronsmith1

This is another test. (Ignore)

comment:5 Changed 9 years ago by painter1

  • Owner changed from pjc@… to cf-conventions@…

comment:6 Changed 6 years ago by davidhassell

Dear Martin,

I have just reread this proposal and support it. Many thanks.

I agree with Jonathan's points about the order in which operations are recorded (i.e. the left-most operation comes first); that the interval keyword should not be mandatory; and that there is no problem with the bounds (perhaps noting that the operations are done left to right makes this clearer?).

I would add a personal preference the period should be called duration instead. This is, as I understand it, more consistent with the language used by ISO 8601 (representation of dates and times).

All the best,

David

comment:7 Changed 6 years ago by jonathan

Dear David and Martin

duration sounds fine to me and better if it is consistent with another standard. Thanks. I still support the proposal.

Jonathan

comment:8 Changed 4 years ago by cofino

Dear David and Cameron,

Because duration it's linked to a time concept I would suggest more general term like size or window which are been using on the mathematical concept of moving averages.

My vote is for size.

Antonio

comment:9 Changed 19 months ago by martin.juckes

Dear All,

I support the concept here, but Lars has pointed out a potential link with the climatologies concept (#197 533533160), and I think there is a case for combining the two discussions. This syntax introduces an alternative approach to expressing some of the quantities which are currently expressed using the climatology construct of the convention. This could introduce confusion, but if done well, I think it could eb a big improvement.

I suggest we take up the discussion in github issue (#197),

regards, Martin

Note: See TracTickets for help on using tickets.