[CF-metadata] CF and multi-forecast system ensemble data from Bryan Lawrence on 2006-10-31 (Archive of CF discussions from 2002 to 2019 on the cf-metadata mailing list)

From: Bryan Lawrence <b.n.lawrence>
Date: Tue, 31 Oct 2006 15:38:09 +0000

hi Paco

You're not the only one who is trying to remain unconfused :-) :-)

There is a lot to think about. In this email, I'm going to stay on
topic. How should we handle ensembles (but allowing that they aren't
really that special :-) :-)

> - My original intention when started this discussion was to find a clear
> way of writing files with multi-forecast system ensemble integrations.

There are obviously two ways this can be done: one integration per file,
and multiple integrations in a file. We had better support both! We had
also better expect folk to want to aggregate them ... but in this email
I'm only going to consider the latter ...

(I use the word integration rather than forecast in an attempt to be
inclusive of both the operational and climate applications)

> - Bryan wonders about the adequacy of defining specific variable
> metadata in multi-forecast system files to distinguish ensemble member,
> model, system and so on, and compares this problem with the interest of
> having additional informative metadata in a file containing station
> data. In my opinion, the essential difference is that a file with
> multi-forecast system ensemble simulations is not a simple gathering of
> predictions from different sources or with different initial conditions,
> but a complete forecast in itself. The additional metadata is not a
> wish, but a need to describe an entity. Without the appropriate metadata
> the file is not self-descriptive and won't be operationally
> disseminated, in the same way that operational centres don't disseminate
> deterministic forecasts in any format without clearly specifying the
> forecast system.

I totally understand the need. I just dont really see how it's any
different from, for example disseminating BUFR format station data in
one file ... once we accept that we want to use NetCDF for operational
data, then both applications are possible ... (even desirable).

(I accept your point that an ensemble is a specific type of data, but in
the abstract it's not actually different from an aggregation of other
types of data, so in order to keep things simple, I suggest we don't
handle them in a special way ... given we have the same class of problem
for other data types).

> - I understood at the beginning that the main interest of NetCDF is that
> one doesn't need to rely on external tables to identify the data.

Correct. I think the phrases in vogue are a) "... metadata describing
each variable is sufficiently detailed to determine whether variables
from different sources are comparable", and b) "Data should be self?
describing. No external tables are needed to interpret the
file. For instance, CF encodings do not depend upon numeric codes (by
contrast with GRIB)."

> While
> the use of external vocabularies might offer a simple way of avoiding
> the creation of additional metadata, it uses a strategy that I presumed
> made the difference between NetCDF and GRIB: the use of external tables
> and extensions.

I wasn't proposing the use of external metadata, I was proposing the
externalization of the governance of that metadata, and using the url to
point to the authoritative list of definitions for that metadata. I'm
(yet to be) convinced that the "standard_name" mechanism is the right
way to handle the information requirements. I *am* convinced we need to
do it though :-). It's just *how*.

So, to answer Jamie's email, and make that point clear, we would propose
encoding an operational forecast dataset something like

float temperature(realization,time,lat,lon):
   temperature:coordinates = 'realization time lat lon' ;
   temperature:ancillary_variables = 'metadata1 metadata2 metadata3 '

# Note I've reordered the time dimension
# Note also I'm using ancillary variables not coordinate variables, it's
# only the realization that we're using in this way

char metadata1(realization,len100):
   metadata1:external_vocabulary = http://wmo.foo.int/identifierY

char metadata2(realization,len100)
   metadata2: external_vocabulary = http://ipcc.bar.org/identifierX

char metadata3(realization,len100):
   metadata3: cf_standard_metadata = identifierZ

With the actual variable values being strings which conform to the
definitions of identifiers X,Y, and Z (i.e. exactly as Jamie has it).

The "governors" of those identifiers may allow free text strings, they
may only allow standard values. It's up to them. We may well strongly
recommend the use of only one particular vocab ...

(To be slightly less abstract, concrete examples of ancillary metadata
could be:

char ipcc_scenario(realization,len100)
   ipcc_scenario: external_vocabulary =
http://www.ipcc.int/definitions/scenarios

char originating_centre(realization,len100)
  originating_centre: external_vocabulary =
http://wmo.maybe/operationalIdentifiers

or we could start building our own such list. You can see that I've
explicitly suggested a way that CF could do this (metadata3) which is
*different* from the standard name mechanism.

I think it would be useful to distinguish between the vocabs we use to
define the scientific characteristics of the variables (which are mostly
about things we observe or predict about the real world), and the vocabs
we use to say things about how those predictions or observations are
made (and their provenance).

Separating these would make it much easier to decide whether new things
were in and out, and to develop procedures to govern (define/manage)
them. It would also make it much easier to construct them methodically,
and to parse them in software.

Bryan
Received on Tue Oct 31 2006 - 08:38:09 GMT

This archive was generated by hypermail 2.3.0 : Tue Sep 13 2022 - 23:02:40 BST