⇐ ⇒

[CF-metadata] Encoding Errors on variables in CF

From: Jonathan Gregory <jonathan.gregory>
Date: Sun, 30 Mar 2003 22:58:01 +0100

Dear All

I agree with Brian that if we include prefixes like "error_on" in the standard
name, we could end up with a tremendous number of standard names - it is quite
possible that every data variable could have a statistically derived standard
error, for instance. As Brian says, that is a reason why we record statistical
processing like "mean" and "standard deviation" in cell_methods rather than in
the standard name. Therefore I retract my suggestion we should distinguish
these error quantities through the standard name.

The idea of making links between variables depends on the assumption that the
error variable can only exist in the company of the data variable it belongs
to, because the links are the identification of its role. This idea is like
bounds on coordinate variables, which are subsidiary to the coordinate
variables and have no attributes of their own. A problem with this, in my view,
is that we might choose to store the error variables apart from the variables
they belong to. I have come across situations, for instance, in which separate
files contained corresponding sets of variables, one file for the data values,
one for the measurement uncertainty, one for the data quality flags, etc.
Links between files would be require encoding filenames in files, which feels
unsafe to me. Moreover, since the link is on the main variable, if you look at
the standard error variable, for instance, you can't tell what its purpose is.
You can't even tell what it belongs to unless it has a link pointing back to
the main variable, as in Ag's example. Creating reverse links doubles the work
and the potential for things to go wrong.

I tend to feel it is more reliable to identify a quantity by labelling it,
rather than by labelling something else to point to it. That's why I proposed
modifying the standard name. However, in view of the problem with that, perhaps
a better idea would be to put the extra information into a separate attribute
whose purpose is to indicate how this data variable affects the interpretation
of another data variable. For the sake of argument, call this attribute
"intent". For example:

  float no2(time);
    no2:standard_name = "nitrogen_dioxide_volume_mixing_ratio" ; ### UNOFFICIAL
    no2:units = "1-e9" ;
  float no2_error_limit(time) ;
    no2_error_limit:standard_name = "nitrogen_dioxide_volume_mixing_ratio";
    no2_error_limit:units = "1-e9" ;
    no2_error_limit:intent = "standard_error" ;

To find the standard error, you search the file, or files, for a data variable
with the same standard name (and other metadata, if necessary, to distinguish)
as the main variable and the appropriate intent attribute. This method means
the variables are all identifying themselves and carry around their own
metadata. There can be more than one kind of error variable for each data
variable - this presents no problem. By the way, how does the detection limit
differ from the standard error?

The possible values of the intent attribute would be standardised and defined.
This is much like cell_methods. As with cell_methods, they might modify the
units implied by the standard name. For instance, a percentage error and a data
quality flag are dimensionless. Standardisation is necessary so that you know
what you've got. In Ag's example, it was a 2 * standard error that was
supplied. It is important to know whether it's once, twice, three times s.e.,
the width of the 5-95% confidence interval, or whatever. A generalisation of
the above is needed. We could define intents such as "2_standard_errors" or we
could have a further attribute e.g.

    no2_error_limit:intent = "standard_error" ;
    no2_error_limit:error_multiplier=2.0;

Another intent could be "data_quality", the example given by John Evans. This
raises the interesting issue of how to describe what the possible values are.
John has recorded the values in special attributes e.g.

    current_speed_qc:standard_name="sea_water_speed";
    current_speed_qc:intent="data_quality";
    current_speed_qc:quality_good = 0b ;
    current_speed_qc:sensor_nonfunctional = 1b ;
    current_speed_qc:outside_valid_range = 2b ;

This is nice because it makes the file self-describing, but is perhaps awkward
for translating from the values to their meanings? Also it would lead to
proliferation of different kinds of attribute.

I guess it would be hopeless to try to define standard values or names for the
quality states, so the best we can do is to provide a standardised way of
tabulating what the quality states mean. What about gathering John's attributes
together in this kind of way:

    current_speed_qc:flag_values=0b, 1b, 2b;
    current_speed_qc:flag_meanings="quality_good sensor_nonfunctional "
      "outside_valid_range";

where the values and their meanings are associated one-to-one? This would be
equally easy to use for translations either way, and would define just two
attributes which could be standardised, even though their values could not be.

Best wishes

Jonathan
Received on Sun Mar 30 2003 - 14:58:01 BST

This archive was generated by hypermail 2.3.0 : Tue Sep 13 2022 - 23:02:40 BST

⇐ ⇒