⇐ ⇒

[CF-metadata] CF and multi-forecast system ensemble data

From: Bryan Lawrence <b.n.lawrence>
Date: Fri, 27 Oct 2006 10:22:58 +0100

Hi Folks

It seems that there are two threads to the massive email from
Jamie,Paco, Jonathan that we've seen (and which is not included
below :-).

The first is how to expedite aggregation, which is the point John has
picked up on, and the second is how to deal with the special case of
ensemble metadata. In this email I'm going to try and boil down what I
think the issues are.

The suggestion proposed, as far as I can see reduces to:
 
a) we should create some standard names which exactly correspond to
some recommended global variables, and model integrations should use
global variables where they contain exactly one realization to indicate
these.

b) upon aggregation, one should use the realization dimension (called
joinDim by John and realization by the triumvirate) as both a dimension
on the aggregated variable and a key to the metadata about the
realizations which is extracted from global variables and populates some
new variables which are essentially metadata variables. (This automatic
extraction would be based on a correspondence between standard names and
global attribute names)

At this point if we take the word model from point a) and the constraint
on realization, then what we have is a rule for how to construct global
metadata into aggregated netcdf files.

Note that CF doesn't *require* any global attributes (2.6.2), so no code
can rely on them being present, nor does it *limit* global attributes to
only those defined. (Which is good, because I think those defined are
not precise enough for me, nor for Paco, judging by the discussion).

I would have thought it makes sense that in producing any aggregation,
that any of the original metadata which distinguishes the original files
(and their contents) should appear in metadata associated with the join
dimension. From a coding point of view the problem will arise in the
join when the individual files don't all have the appropriate global
attributes (trust me, it'll happen). Anyway from a CF point of view this
produces CF issue A: What is the best method of providing metadata on an
aggregation dimension and indicating that's what has been done?

(Issue B: might be to make sure our solution works when we aggregate two
files that have already been aggregated).

I'm sure John has thought more than most (especially me) about this, so
I'll shut up about aggregation for now ...

At this point there is no necessity for any modifications of standard
names, the discussion thus far is simply about how to use what
attributes we have, but:

Issue C: should we have a standard name modifier associated with a
variable on the aggregation dimension to show that it was produced by
aggregating file attributes?

They also got into what I would call the specifics of variable metadata
needed to distinguish ensemble member identity and characteristics.
These may or may not need standard names, but let's start with the
principle. Are these not simply special cases of variable metadata? With
the possible exception of ensemble_weight, all the others are intended
to be either character strings or url links to information about
variable metadata. In principle we could want exactly the same
facilities for any variables (e.g. a bunch of station data in one file,
we might want to have a lot more characteristics of each station linked
by the station dimension in exactly the same way as the realization
dimension could be used).

So, I think what we have here is a discussion about what are the
variable attributes (metadata) required to distinguish ensemble members
(or model integrations)? This could go way beyond what we have seen in
the email. Jamie has referred to our work on NumSim
(http://proj.badc.rl.ac.uk/ndg/wiki/NumSim), and there is also the work
on Numerical Model Metadata
(http://www.cgam.nerc.ac.uk/pmwiki/NMM/index.php/). Similarly, the
metadata one *might* want to attach to any observation goes way beyond
what one can mandate in the CF file (e.g. SensorML etc).

So there is a continuum of information one could have about a variable
(indexed by station or realization or whatever), ranging from simply
identity (e.g. item number in the sequence) to it's source, to the
entire content of one these external schema.

CF has a number of ways of actually adding such information, we can use
labels (6.1), Ancillary Data etc ... and so without going into the
specifics of the triumvirate discussion, the issue really is:

Issue-D: how far should CF go into providing standard names for
describing variable metadata?

Quite clearly there is utility in new standard names, but in doing so
are we not straying into the governance territory of others - (e.g.
SensorML). I've argued quite strongly that we shouldn't do any more of
this, because every time we do, we add to the CF maintenance problem
(e.g. I think we already have problems with gazeteers).

Now I know (because I talked to Jamie on the phone) that triumvirate
could argue that in the case of model metadata, it would be helpful for
CF to mandate this stuff at least for the numerical model community.

Now we can do that ... but ...

I didn't want to make a proposal in this email, but, I think we should
have a standard mechanism of indicating that a specific characterstring
variable set comes from an external vocabulary indicated by a URI:,
something like:

variables:
  float temperature(realization,...);
     temperature:coordinates="realization ...";
     temperature:ancillary_variables = "metadata";
  char metadata(realization,len80);
     metadata:external_dictionary='http://someExternalGovernanceBody';
(i.e. a new modifier).

Then, IPCC, or WMO or whoever, could come up with the dictionaries they
want, and the cf file would still be self-describing, cf-software could
still manipulate it, but communities could write software that made use
of these extra characteristics, without CF having to govern everything
(and add all the components of these external schema to CF, piece by
piece, with detailed arguments for each one to go in the standard name
table).

Obviously my mechanism could be extended for any characteristic of a
variable that we didn't want to put in a standard name, so the real
policy issue is Issue-D, and it comes back to the scope of CF, which
needs some decision.

Thus far we have held a line on standard names, but I feel the
Paco,Jamie,Jonathan proposal takes us into new territory. We will take
on having to discuss and approve a whole new class of standard names
(and all the work that goes with that). It's fine if it's a community
decision to go there, but we need our eyes open :-), and some realistic
appreciation of how we can actually do it.

My opinion: better to push it off onto the communities who care, and
keep CF focused.

Cheers
Bryan
Received on Fri Oct 27 2006 - 03:22:58 BST

This archive was generated by hypermail 2.3.0 : Tue Sep 13 2022 - 23:02:40 BST

⇐ ⇒