Dear Jonathan,
Firstly, having digested your message below, I am beginning to get an idea
of just what a mammoth task you have taken on in trying to put together the
CF Convention.
Here are my thoughts on what you have suggested.
- I agree with the notion that linking variables opens a can of worms in
terms of linking files and the variables getting separated at some point in
the future.
>>By the way, how does the detection limit differ from the standard error?
- The limit of detection takes into account a background or blank reading
(i.e. baseline noise) and the mean of these derived from instrumental
readings.
x(L) = x(bi) + k.s(bi)
where x(bi) - is the mean of the blank measures, s(bi) is the standard
deviation
of the blank measures, and k is a numerical factor chosen according to the
confidence level desired. For example, k could be 3 for > 99.5% Confidence
limit.
Whereas, standard error is (usually) defined as the standard deviation
divided by the square root of the number of samples:
S.E.=Standard deviation / (n)^0.5
- In your 'intent' examples you have used the same standard_name for the
variable and the associated error variable. I can see this causing problems
for software that is only looking for standard_names to identify variables
so I would not recommend this route. And I am concerned that it will be
difficult to define all the possibilities for uncertainty/error types and
that you will end up with a separate error table like the standard name
table that continues to grow. Because of this I find I am still favouring
the more simplistic "error_on_" or "uncertainty_on_" prefix on the
standard_name with the information held in the comment field regarding the
actual content.
I think my key concern here is that people report errors in so many ways,
it's preferable to give a generic method of describing them which avoids you
having to try and catch them all in advance.
- However, addressing some of your other points:
>>> no2_error_limit:intent = "standard_error" ;
>>> no2_error_limit:error_multiplier=2.0;
I think this is a sensible option if intent is used in the above way.
- In terms of error flags, I would favour the second version where you
include the 'flag_values and flag_meanings pairing:
>>> current_speed_qc:flag_values=0b, 1b, 2b;
>>> current_speed_qc:flag_meanings="quality_good sensor_nonfunctional "
>>> "outside_valid_range";
- So, overall, my personal opinion is that we are in danger of seriously
complicating things by adding a whole load of new possibilites. But equally
my suggested route is not ideal. I'm still thinking about it.
Ag
-----Original Message-----
From: Jonathan Gregory [mailto:jonathan.gregory at metoffice.com]
Sent: 30 March 2003 22:58
To: cf-metadata at cgd.ucar.edu
Cc: jonathan.gregory at metoffice.com
Subject: [CF-metadata] Encoding Errors on variables in CF
Dear All
I agree with Brian that if we include prefixes like "error_on" in the
standard
name, we could end up with a tremendous number of standard names - it is
quite
possible that every data variable could have a statistically derived
standard
error, for instance. As Brian says, that is a reason why we record
statistical
processing like "mean" and "standard deviation" in cell_methods rather than
in
the standard name. Therefore I retract my suggestion we should distinguish
these error quantities through the standard name.
The idea of making links between variables depends on the assumption that
the
error variable can only exist in the company of the data variable it belongs
to, because the links are the identification of its role. This idea is like
bounds on coordinate variables, which are subsidiary to the coordinate
variables and have no attributes of their own. A problem with this, in my
view,
is that we might choose to store the error variables apart from the
variables
they belong to. I have come across situations, for instance, in which
separate
files contained corresponding sets of variables, one file for the data
values,
one for the measurement uncertainty, one for the data quality flags, etc.
Links between files would be require encoding filenames in files, which
feels
unsafe to me. Moreover, since the link is on the main variable, if you look
at
the standard error variable, for instance, you can't tell what its purpose
is.
You can't even tell what it belongs to unless it has a link pointing back to
the main variable, as in Ag's example. Creating reverse links doubles the
work
and the potential for things to go wrong.
I tend to feel it is more reliable to identify a quantity by labelling it,
rather than by labelling something else to point to it. That's why I
proposed
modifying the standard name. However, in view of the problem with that,
perhaps
a better idea would be to put the extra information into a separate
attribute
whose purpose is to indicate how this data variable affects the
interpretation
of another data variable. For the sake of argument, call this attribute
"intent". For example:
float no2(time);
no2:standard_name = "nitrogen_dioxide_volume_mixing_ratio" ; ###
UNOFFICIAL
no2:units = "1-e9" ;
float no2_error_limit(time) ;
no2_error_limit:standard_name = "nitrogen_dioxide_volume_mixing_ratio";
no2_error_limit:units = "1-e9" ;
no2_error_limit:intent = "standard_error" ;
To find the standard error, you search the file, or files, for a data
variable
with the same standard name (and other metadata, if necessary, to
distinguish)
as the main variable and the appropriate intent attribute. This method means
the variables are all identifying themselves and carry around their own
metadata. There can be more than one kind of error variable for each data
variable - this presents no problem. By the way, how does the detection
limit
differ from the standard error?
The possible values of the intent attribute would be standardised and
defined.
This is much like cell_methods. As with cell_methods, they might modify the
units implied by the standard name. For instance, a percentage error and a
data
quality flag are dimensionless. Standardisation is necessary so that you
know
what you've got. In Ag's example, it was a 2 * standard error that was
supplied. It is important to know whether it's once, twice, three times
s.e.,
the width of the 5-95% confidence interval, or whatever. A generalisation of
the above is needed. We could define intents such as "2_standard_errors" or
we
could have a further attribute e.g.
no2_error_limit:intent = "standard_error" ;
no2_error_limit:error_multiplier=2.0;
Another intent could be "data_quality", the example given by John Evans.
This
raises the interesting issue of how to describe what the possible values
are.
John has recorded the values in special attributes e.g.
current_speed_qc:standard_name="sea_water_speed";
current_speed_qc:intent="data_quality";
current_speed_qc:quality_good = 0b ;
current_speed_qc:sensor_nonfunctional = 1b ;
current_speed_qc:outside_valid_range = 2b ;
This is nice because it makes the file self-describing, but is perhaps
awkward
for translating from the values to their meanings? Also it would lead to
proliferation of different kinds of attribute.
I guess it would be hopeless to try to define standard values or names for
the
quality states, so the best we can do is to provide a standardised way of
tabulating what the quality states mean. What about gathering John's
attributes
together in this kind of way:
current_speed_qc:flag_values=0b, 1b, 2b;
current_speed_qc:flag_meanings="quality_good sensor_nonfunctional "
"outside_valid_range";
where the values and their meanings are associated one-to-one? This would be
equally easy to use for translations either way, and would define just two
attributes which could be standardised, even though their values could not
be.
Best wishes
Jonathan
_______________________________________________
CF-metadata mailing list
CF-metadata at cgd.ucar.edu
http://www.cgd.ucar.edu/mailman/listinfo/cf-metadata
Received on Thu Apr 03 2003 - 00:46:57 BST