Opened 7 years ago
Last modified 5 years ago
#117 new enhancement
add example to 5.7 for multi-time dimension data
Reported by: | graybeal | Owned by: | cf-conventions@… |
---|---|---|---|
Priority: | medium | Milestone: | |
Component: | cf-conventions | Version: | |
Keywords: | Cc: |
Description
In his post http://mailman.cgd.ucar.edu/pipermail/cf-metadata/2006/001008.html, Jonathan thoroughly summarizes possible scenarios that require multiple time axes in a CF file (for example, for handling both forecast (valid) times, and the run time of the forecast.
He concludes with the proposal to add an example as described below.
From a look at section 5.7, it doesn't appear the example has been added. This ticket proposes adding it.
Reference also ticket #104, though I don't believe the changes resulting from that ticket affect this one.
I think what we need to do is add an example of structure (b), and here is one (an instance of case v) from the earlier discussion.
20030101 12:00 analysis (at 00hr) and 12hr,36hr forecasts 20030101 00:00 analysis 6hr,12hr,18hr,24hr forecasts 20030101 06:00 analysis 6hr,18hr forecasts
could be expressed as follows (ordering the time samples as they are above)
variables:
double reftime(record);
time1:standard_name = "forecast_reference_time" ; time1:units = "hours since 2003-01-01 00:00" ;
double valtime(record);
time2:standard_name = "time" ; time2:units = "hours since 2003-01-01 00:00" ;
float temp(record,level,lat,lon);
temp:long_name = "Air temperature on model levels" ; temp:standard_name = "air_temperature" ; temp:units = "K" ; temp:coordinates = "valtime reftime"
data:
reftime = 12., 12., 12., 0., 0., 0., 0., 6., 6. ; valtime = 12., 24., 48., 6., 12., 18., 24., 12., 24. ;
Change History (13)
comment:1 Changed 7 years ago by graybeal
comment:2 Changed 7 years ago by jonathan
Dear John
Thanks you for bringing this subject back. It has often been asked about and I think it is sufficiently important to include it in the standard document as well as the FAQ. Section 5.7 is about scalar coord variables, so I don't think that's the right place for the new example about multiple-valued time coordinates. I would suggest that we add a new subsection 4.4.2. Here is some proposed text (drawing on your draft FAQ and the email discussion).
Best wishes
Jonathan
4.4.2. Time coordinates for forecast data
In forecast applications, two kinds of time are distinguished: the time to which the forecast applies ("validity", "valid" or "forecast" time), and the time of the analysis from which the forecast was made ("analysis", "run", "data" or "reference" time). These kinds of time coordinate are recorded in separate variables, identified by standard_name. For the validity time, the standard_name of time is used, as is usual for the time coordinate of observed or simulated data. The standard_name for the analysis time is forecast_reference_time.
A single-valued analysis time or validity time can be stored in either a size-one coordinate variable or a scalar coordinate variable. If either the analysis time or the validity time is multiple-valued, but the other one is single-valued, it is recommended that there should be a coordinate variable (with dimension greater than one) for the multiple-valued one, and a size-one or a scalar coordinate variable for the other one. There could thus be two time dimensions for the data variable. Example 5.12 shows a case of multiple forecasts from a single analysis, where the analysis time is a scalar coordinate variable, and there is a dimension (of size greater than one) for validity time.
If both analysis time and validity time are multiple-valued, it is recommended to introduce a discrete axis (Section 4.5), and store both analysis time and validity time as one-dimensional auxiliary coordinate variables of this axis. This method is preferred because it is flexible, and can be used for the cases of multiple validity and analysis times where all combinations exist, multiple forecast periods from various analyses where all combinations exist, and multiple validity and analysis times where not all combinations exist. Example 4.7 illustrates the last of these cases. It is not recommended to have two multiple-valued time dimensions for the data variable.
Example 4.7. Multiple validity and analysis times
dimensions: record=9; variables: double reftime(record); time1:standard_name = "forecast_reference_time" ; time1:units = "hours since 2003-01-01 00:00" ; double valtime(record); time2:standard_name = "time" ; time2:units = "hours since 2003-01-01 00:00" ; double period(record); time2:standard_name = "forecast_period" ; float temp(record,level,lat,lon); temp:long_name = "Air temperature on model levels" ; temp:standard_name = "air_temperature" ; temp:units = "K" ; temp:coordinates = "valtime reftime" data: reftime = 12., 12., 12., 0., 0., 0., 0., 6., 6. ; valtime = 12., 24., 48., 6., 12., 18., 24., 12., 24. ; period = 0., 12., 36., 6., 12., 18., 24., 6., 18. ;
In this example, forecasts of air temperature have been made from analyses at 2003-01-01 12:00, 2003-01-01 00:00 and 2003-01-01 06:00. From the analysis of 2003-01-01 12:00, forecasts have been made for validity times of 2003-01-01 12:00 (i.e. the analysis time), 2003-01-02 00:00 (12 h later) and 2003-01-03 00:00 (36 h later). The example also includes an auxiliary coordinate variable of forecast_period, which is the difference between validity time and analysis time for each forecast. This information is redundant, but may be convenient.
comment:3 Changed 7 years ago by jonathan
I've noticed two omissions to the above.
(1) The forecast period should be named as an auxiliary coordinate variable:
temp:coordinates = "valtime reftime period" ;
(2) Add the following to the conformance document:
4.4.2. Time coordinates for forecast data
Recommendations:
- The data variable should not have more than one coordinate variable with dimension greater than one having any of the standard_names of time, forecast_reference_time or forecast_period. It is recommended instead to have a single dimension, with auxiliary coordinates for these quantities all having this same dimension.
Jonathan
comment:4 Changed 7 years ago by graybeal
Looks pretty good to me. Minor comments:
"It is not recommended to have two multiple-valued time dimensions for the data variable." This sentence would be more useful if it ends ", because that <does something bad>." The implicit reason as it stands is "because that isn't flexible", which seems a little weak. There is a great power in always being able to find a 'time' coordinate, and it would be the natural choice for the naive. The argument for not organizing the file that way should be a strong one.
To avoid the odd standard_names construct in 4.4.2 words, suggest the wording "having a standard_name of any of time, ...".
comment:5 Changed 7 years ago by markh
It is not recommended to have two multiple-valued time dimensions for the data variable.
I am concerned about this 'recommendation'. Having a time dimension with a time(time) coordinate and a forecast_period dimension with a forecast_period(forecast_period) coordinate is a useful case which is common in many of our data sets.
In this case there is generally a 2D auxiliary coordinate of forecast_reference_time(time, forecast_period); this quite often degenerates to a single value.
Similarly we have numerous data sets which have a time dimension with a time(time) coordinate and a forecast_reference_time dimension with a forecast_reference_time(forecast_reference_time) coordinate.
In this case there is generally a 2D auxiliary coordinate of forecast_period(time, forecast_reference_time).
Both of these cases are deemed practical and are in widespread use in the communities I am in contact with. There are a number of data analysis processes which are assisted by the data sets being structured in this way.
What is the intent of 'not recommended'?
Is this proposal intending to tell users not to encode their data sets in this way?
Is this proposal intending to have a validation rule for CF to indicate that this is invalid?
mark
comment:6 follow-up: ↓ 8 Changed 7 years ago by jonathan
Dear John and Mark
You have both commented on the recommendation to use a discrete axis in the case where both forecast time and analysis time are multivalued. First let me say thhat this is proposed only as a recommendation, not as a requirement. Stating it like this means it is not invalid to do things in the other ways which Mark describes. It means that the CF checker, for instance, would give a warning, but not an error.
Cases like the ones which Mark mentions are considered in the CF email which John refers to, from many years ago! The case with multiple validity and analysis times, where all combinations exist, is case (iii) in the list from 2006, which reports a discussion from 2003. But the outcome of the discussion wasn't adopted into the convention at that time so it's not surprising data has been written in other ways since then.
I agree with Mark that the structure with time and forecast_period dimensions, for forecasts for a set of validity times each made with various forecast periods, could be convenient. That one is not in the list from 2006, which does on the other hand include forecasts made from a set of analysis times for a set of forecast periods. I don't follow why the forecast_reference_time in Mark's case often degenerates to a single value. Wouldn't it be more obvious to treat this case as a single set of forecasts from a single analysis with a range of different validity times (and corresponding forecast periods)? This is then not a case which needs two multivalued time dimensions.
There is no reason for the proposed recommendation except that it makes life easier for data-users if the number of cases they have to be able to deal with is minimised. The case with the discrete axis is the only way to handle situations where there are missing combinations. For instance, you may have a "triangular" set, with forecasts from a range of analysis times, for a given validity time and later. Forecasts from earlier analysis times will have validity times going less far into the future (if the forecast system uses the same set of forecast periods on each run). Or you might just have randomly missing combinations or ones you were not interested in. Hence the data-analyst has to be able to deal with this case (v in the list from 2006). This structure will also work for the cases which could be represented by two multivalued axes, but it would be easier if the data-analyst doesn't have to be prepared to deal with all three of those possible cases (iii and iv in the list from 2006, and the case Mark describes).
In summary: there is no justification other than that this appears to be the simplest convention which is sufficiently flexible. Other conventions make life more complicated for data-users, but they are not prohibited. What do you and others think?
Your wording change is fine with me, John. Thanks.
Cheers
Jonathan
comment:7 Changed 7 years ago by taylor13
Dear all,
I agree that the proposed change is a flexible and simple approach to handling forecasts, but if there are conventional ways of doing this that involve two coordinates, perhaps we should recommend that they be used for common special cases. For example, for the SPECS Project, which focuses on seasonal to decadal prediction, they require two time variables as described in their data specifications document (authored by Pierre-Antoine Bretonniere). Here is an excerpt:
One of the novelties of the SPECS conventions is the requirement for two time variables: one being called time(time) which corresponds to the verification time of the forecast and one called leadtime. Both of these variables are mandatory in the files and must have the following
double time(time) ; time:units = "days since 1850-01-01" ; time:bounds = "time_bnds" ; time:long_name = "Verification time of the forecast" ; time:standard_name = "time" ; time:axis = "T" double leadtime(time) ; leadtime:units = "days" ; leadtime:long_name = "Time elapsed since the start of the forecast" ; leadtime:standard_name = "forecast_period" ;
Note that they also store the forecast_reference_time as a global attribute or it can be calculated as the difference between "time" and "leadtime".
I recall there was much discussion about how to do this some time ago, and I will write to Pierre-Antoine and Paco Doblas-Reyes to see why they preferred this approach rather than doing something along the lines that Jonathan suggested above (i.e., using a single time dimension with standard name "time" to indicate verification time and a singleton dimension to indicate the forecast_reference_time).
[sorry that I haven't reviewed the emails from 2003 and 2006, which might contain the answer.]
cheers, Karl
comment:8 in reply to: ↑ 6 Changed 7 years ago by markh
Replying to jonathan:
Thank you for you comments Jonathan, a few thoughts:
You have both commented on the recommendation to use a discrete axis in the case where both forecast time and analysis time are multivalued. First let me say thhat this is proposed only as a recommendation, not as a requirement. Stating it like this means it is not invalid to do things in the other ways which Mark describes. It means that the CF checker, for instance, would give a warning, but not an error.
I find it odd that these cases would return a warning from a cf checker. If this is the intent of a recommendation then I think that this proposal is too restrictive in approach. I would like data sets with two temporal dimensions to be recognised as valid CF-netCDF, not reported on with warning of potential bad practice.
I agree with Mark that the structure with time and forecast_period dimensions, for forecasts for a set of validity times each made with various forecast periods, could be convenient. That one is not in the list from 2006, which does on the other hand include forecasts made from a set of analysis times for a set of forecast periods. I don't follow why the forecast_reference_time in Mark's case often degenerates to a single value. Wouldn't it be more obvious to treat this case as a single set of forecasts from a single analysis with a range of different validity times (and corresponding forecast periods)? This is then not a case which needs two multivalued time dimensions.
You are right, I don't think the degenerate case helps the discussion here.
mark
comment:9 Changed 7 years ago by jonathan
Dear Mark and Karl
I don't feel strongly about limiting the number of structures. That we agreed on this years ago isn't a crucial argument, since we didn't put it in the convention then. The CF convention should be as simple as possible, but no simpler!
If the case with forecast_period and validity time axes is recommended elsewhere (by SPECS) that's an argument for including it, with an example like the one Karl gave. The 2D case with forecast_period and analysis time axes is equally natural and it would be inconsistent to deprecate one and not the other. What about the case with axes of analysis time and validity time? It seems to me that you are less likely to have all possible combinations in that case: some of the validity times will be too far into the future or will be already in the past for some of the analyses, given that it would be usual to use a consistent set of forecast periods.
Best wishes
Jonathan
comment:10 Changed 5 years ago by martin.juckes
Hello All,
I've come to this ticket after it was mentioned in a recent correspondence. While it makes perfect sense to have recommended approach to dealing with arbitrary combinations of reference time and forecast period coordinate values, I feel that this is not a good solution for the case of a block which can be expressed as the product of two linear coordinates. That is, we might have data for a set of reference times and validation times (all forecasts for a given day for a range of lead times), or for a set of reference times and lead times or, conceivably, for a set of lead times and validation times. In each case users may reasonably expect to find two independent dimensions on the data array. I would prefer to see a recommendation for dealing with such regular cases. E.g.
dimensions: t1=9; t2=8; variables: double t1(t1); t1:standard_name = "<sn1>" ; t1:units = "<un1>" ; double t2(t2); t2:standard_name = "<sn2>" ; t2:units = "<un2>" ; double t3(t1,t2); t3:standard_name = "<sn3>" ; t3:units = "<un3>" ; float temp(t1,t2,level,lat,lon); temp:long_name = "Air temperature on model levels" ; temp:standard_name = "air_temperature" ; temp:units = "K" ; temp:coordinates = "t3"
where {(sn1, un1)} etc can have any permutation of the following:
(time, hours since 2003-01-01 00:00) (forecast_reference_time, hours since 2003-01-01 00:00) (forecast_period, hours)
Regards, Martin
comment:11 Changed 5 years ago by caron
Funny this topic comes up now, as Ive been meaning to propose a new conventions that is exactly this topic, and indeed fits precisely into martin's proposal.
The gist of the issue is when we combine model output from multiple runs, then we have both a reftime and a set of time offsets from each reference time. The common case is that each reftime has the same set of offsets, which can be expressed succinctly as:
double reftime(reftime); time1:standard_name = "forecast_reference_time" ; time1:units = "hours since 2003-01-01 00:00" ; double offset(offset); offset:long_name= "time offset from reference" ; offset:units = "hours" ;
and data values can then use:
float temp(reftime,offset,level,lat,lon);
There are two big advantages of this over the recomendation here of adding a single dimension that lists all the coordinates in a 1D array.
- its easy to understand what set of coordinates are possible, namely {reftime} x {offset}
- the set of coordinates is m + n instead of m x n. Ive been dealing with very large datasets, eg reanalysis that have upward of 100,000 reference times and 10-100 offsets.
So the additional convention needed is to create a new type of time coordinate which has units of time (not datetime), and can only be used in conjunction with a reference time:
double offset(offset); offset:standard_name = "forecast_offset_time" ; offset:long_name= "time offset from reference" ; offset:units = "hours" ;
Then the "forecast time" is a simple calculation from the offset time and reference time coordinates.
comment:12 Changed 5 years ago by martin.juckes
Hello John,
I think that the standard name "forecast_period" can be used as an offset time in the forecast context, but I agree that it would be good to have an option for a generic "offset_time" for other applications,
regards, Martin
comment:13 Changed 5 years ago by jonathan
Dear John
As Martin says, the existing forecast_period standard name is your time offset. I would not be in favour of a generic name, unless a generic use-case comes up which requires it - being more specific is generally more informative.
I agree that two time dimensions is an more efficient way to handle this situation if all combinations are present. We could have examples of this too. However, in a separate email discussion, I think (if I have understood correctly - not having read the emails very carefully) it's been suggested that some software would be upset by having more than one multivalued time dimension. The arrangement with a single time dimension and multiple auxiliary coordinate variables is more flexible, and especially it can deal with the case of missing combinations, so I think we should add it (the main purpose of this ticket).
Best wishes
Jonathan
Modify the FAQ answer for "How can I describe a file with multiple time coordinates (e.g., run time AND valid or forecast time)" when this ticket is implemented in the standard.