[CF-metadata] Are ensembles a compelling use case for "group-aware" metadata? from Charlie Zender on 2013-09-19 (Archive of CF discussions from 2002 to 2019 on the cf-metadata mailing list)

From: Charlie Zender <zender>
Date: Thu, 19 Sep 2013 12:44:34 -0700

Hello,

Thank you all for input on potentical "group-aware" CF extensions.
I do my best to respond to new points and apologies to those I miss.
New top-level thread to discuss use cases of, and potential solutions
to, research questions relevant to "group-aware" metadata.

Some wish to see a compelling example of a feature that could best be
expressed with hierarchical groups. Let me supply one that scratches
my own itch. Others may have examples that are more persuasive to you,
or reasons why my example is not compelling, or is actually a
counter-example in disguise :)

This will be a fuller example of the case for a CF ensemble feature.
Earlier I posted a "rank 1" ensemble example: an ensemble of datasets
produced by different models. This will be a "rank 2" ensemble, i.e.,
an ensemble of ensembles. It will be two models, named CESM and ECMWF,
each with two realizations of the same "Historical" time-period.
Realistic and possibly useful wrinkles to this would include adding
"Historical" observations of different shape to the mix, adding a
third ensemble axis (Scenario?), explicitly showing that model
datasets do not typically share spatial dimensions (i.e., different
spatial grids) but do often share time dimensions (e.g., daily,
monthly), and noting that ensembles can have different numbers of
realizations. These wrinkles are left to the reader's imagination
because they are not fundamental to motivating ensembles.

An "ensemble" here means a group of datasets identical (perhaps
"isomorphic" would be better?) in structure, i.e., same variables and
dimensions. Normally, though not necessarily (metadata conventions
could be used to specify otherwise) ensemble members are assigned
equal statistical weights, e.g., simulation 1 is not preferred over
simulation 2, and model 1 is not preferred over model 2.
For example the two CESM datasets form an ensemble because they are
slightly different realizations of the same model. Although it could
be done differently, I will define the multi-model ensemble (MME) to
comprise two equally-weighted members, CESM and ECMWF.

This is still simple enough to visualize yet gives a clearer idea of
how a hierarcical group structure compares to a flat structure.
Below are 2*2=4 CDL descriptions of raw datasets as they might appear
in a CF 1.x world of flat files and no "ensemble" metadata.
Then is a single 2-level deep hierarchical CDL file that contains
all four datasets but that serves purely as a container with no
inherited group metadata or dimensions. It's called mme_cnt.cdl (for
"MME container"). It could easily be created/aggregated (e.g., by
NCO's ncecat) from the four flat files it contains.

I'll refrain from annotating mme_cnt.cdl with suggested "group-aware"
metadata, because it's a bit early for that. Instead I'll frame some
pertinent research questions and let those interested suggest answers
that may or may not imply "group-aware" metadata/storage solutions.
The mme_cnt.cdl file is provided as a bare-bones structure for
visualization purposes. Yes, hidden in its structure are intrinsic
relationships that can be used by the current netCDF4 library to
answer some of the questions posed below. That may be the subject
of a later post. For now let some prototypical research questions
steer our thinking to the best possible solution, and reserve for
later the question of how to store that solution on disk.

Let us say a researcher's goal is to compute statistics on the CESM
ensemble, the ECMWF ensemble, and the multi-model ensemble (MME).
First the researcher needs to obtain the constituent files of the
model ensembles. This is possible through some equivalent of 'ls
cesm_*.nc' and 'ls ecmwf_*.nc' performed on a local directory or
the equivalent specified through a repository gateway's GUI.
Let us say she finds the four datasets, cesm_01.nc, cesm_02.nc,
ecmwf_01.nc, and ecmwf_02.nc.

Point 1: How does the user know she has all the realizations?
She has 2 from CESM and 2 from ECMWF but how does she know she is not
missing cesm_03.nc and ecmwf_03.nc? How does she know how many total
realizations are supposed (by the producer/distributor) to comprise
the ensemble? What if, for whatever reason (a filename mispelling,
a quirky server problem), not all matching datasets are found?
Perhaps some "ensemble" metadata would help: It could standardize how
to indicate the size of the intended ensemble, and enumerate the
current realization, the equivalent of "Page 2 of 5".
Then recipients know when they have all the data. I would argue that
knowing/discovering the size of an intended ensemble via metadata
is more robust than filename grepping or equivalent.

Point 2: Multiple and/or Non-numeric Ensemble axes
The two hierachies constructable in this example are (top->bottom)
Model->Realization and Realization->Model. Let's call those
orientations the ensemble axes. Our researcher wishes to obtain
statistics along each axis of the ensemble, e.g., the mean temperature
of each Model (average realizations for each model), and the mean
temperature of each realization (average of realization #1 for all
models). How does the researcher know that she has all the models?
The lack of "ensemble" metadata means that she may have to rely, once
again, on results of "ls", GUI-searches, or non-automatically-readable
documentation to be sure she has found all the models. By this I mean
that nothing in the flat-file metadata indicates that there are only
two models (not 3 or 24) along the Model axis of the ensemble. Once
again, perhaps some "ensemble" metadata would help: It could
standardize the indication of which known ensemble axes this
particular representation belongs to. For example, file cesm_01 is a
member of two known ensemble axes, Realization and Model. This file is
Realization 1 of 2 total CESM Realizations, and is (in some scheme)
Model 1 of 2 total Models. The data producer may not know a priori the
size of either ensemble axis. In which case "ensemble" metadata will
often be appended/altered a posteriori by the producer, distributor,
or both.

Point 3: Weights and intentional reproducibility of MME statistics
The example I gave has two equally sized ensemble axes, 2 models of
2 Realizations each, so that MME statistics would would weight each
of the 4 total members equally (with a weight of 0.25). Consider now
the case (call it "sequestration?") where the CESM model has 2
realizations and the ECMWF model grows to 3 realizations so there are
5 total members in the MME. What will the researcher call the MME
average? The researcher faces a choice. She might use a brute force
method that averages all five inputs together and weights them at 20%
apiece. That's often easier. Or she might subscribe to the notion of
weighting each model equally and therefore perform a two-step average
so that the 2 CESM realizations are weighted at 0.25 each and the 3
ECMWF realizations are weighted at 0.5/3=0.167 each. Without
"ensemble" metadata, the flat files do not know what size ensembles
they belong to, and it is difficult for tools to automate the correct
uneven weighting of datasets to create an MME. Unsophisticated users
may take the brute-force approach and inadvertently weight all
datasets equally. "ensemble" metadata could reduce the chances of this
happening, and thus ensure greater reproducibility.

This example illustrates the case where dataset weights depend only
on the number of datasets in an ensemble axis, yet some datasets may
have intrinsic weights that normalize to unity along that ensemble
axis yet are not equal for each ensemble member. Such datasets could
be created by input parameters or observations that occupy a known
proportion of probability space (i.e., PDF-based not Monte Carlo-based
approach). How might these weights best be distributed with datasets
so that users compute the statistics correctly? "ensemble" metadata
could indicate weights (similar to an area_weight across datasets).

This concludes my intended-to-be-compelling example of the utility
of "ensemble" metadata. Sorry if it puts you to sleep :) People can
think about the utility to CF of "group-aware" metadata in multiple
One way is from the data-storage-elegance point of view. Another way,
advanced here, is how researchers could exploit it. Both are valid
and possibly complementary ways of thinking about "group-aware" CF.

Maybe this persuades you that CF could/should support "ensembles".
How? As a new file-level featureType? Or in some other way?
If so then we have a use-case for a convention that could apply
to flat files, and/or be implemented in hierarchical files.
That could be a follow-on discussion.
If not, if there is no consensus use-case, then I agree there's no
point for a CF modification and let's all solve the above problems
by whichever methods we happen to like.

If you're still undecided then maybe you would be persuaded by a
different use-case example. I too would like to see other use-cases.
So if you've got one, please describe it.

Thanks for reading this far!

cz

// ncgen -k netCDF-4 -b -o cesm_01.nc cesm_01.cdl

netcdf cesm_01 {
  :Conventions = "CF-2.x";
  :history = "yada yada yada";
  :Scenario = "Historical";
  :Model = "CESM";
  :Realization = "1";

dimensions:
  time=4;
variables:
  float tas(time);
data:
  tas=272.1,272.1,272.1,272.1;

} // end cesm_01

// ncgen -k netCDF-4 -b -o cesm_02.nc cesm_02.cdl

netcdf cesm_02 {
  :Conventions = "CF-2.x";
  :history = "yada yada yada";
  :Scenario = "Historical";
  :Model = "CESM";
  :Realization = "2";

dimensions:
  time=4;
variables:
  float tas(time);
data:
  tas=272.2,272.2,272.2,272.2;

} // end cesm_02

// ncgen -k netCDF-4 -b -o ecmwf_01.nc ecmwf_01.cdl

netcdf ecmwf_01 {
  :Conventions = "CF-2.x";
  :history = "yada yada yada";
  :Scenario = "Historical";
  :Model = "ECMWF";
  :Realization = "1";

dimensions:
  time=4;
variables:
  float tas(time);
data:
  tas=273.1,273.1,273.1,273.1;

} // end ecmwf_01

// ncgen -k netCDF-4 -b -o ecmwf_02.nc ecmwf_02.cdl

netcdf ecmwf_02 {
  :Conventions = "CF-2.x";
  :history = "yada yada yada";
  :Scenario = "Historical";
  :Model = "ECMWF";
  :Realization = "2";

dimensions:
  time=4;
variables:
  float tas(time);
data:
  tas=273.2,273.2,273.2,273.2;

} // end ecmwf_02

// ncgen -k netCDF-4 -b -o mme_cnt.nc mme_cnt.cdl

netcdf mme_cnt {

group: cesm {

  group: cesm_01 {
      :Conventions = "CF-2.x";
      :history = "yada yada yada";
      :Scenario = "Historical";
      :Model = "CESM";
      :Realization = "1";

    dimensions:
      time=4;
    variables:
      float tas(time);
    data:
      tas=272.1,272.1,272.1,272.1;

    } // cesm_01

  group: cesm_02 {
      :Conventions = "CF-2.x";
      :history = "yada yada yada";
      :Scenario = "Historical";
      :Model = "CESM";
      :Realization = "2";

    dimensions:
      time=4;
    variables:
      float tas(time);
    data:
      tas=272.2,272.2,272.2,272.2;

    } // cesm_02

  } // cesm

group: ecmwf {

  group: ecmwf_01 {
      :Conventions = "CF-2.x";
      :history = "yada yada yada";
      :Scenario = "Historical";
      :Model = "ECMWF";
      :Realization = "1";

    dimensions:
      time=4;
    variables:
      float tas(time);
    data:
      tas=273.1,273.1,273.1,273.1;

    } // ecmwf_01

  group: ecmwf_02 {
      :Conventions = "CF-2.x";
      :history = "yada yada yada";
      :Scenario = "Historical";
      :Model = "ECMWF";
      :Realization = "2";

    dimensions:
      time=4;
    variables:
      float tas(time);
    data:
      tas=273.2,273.2,273.2,273.2;

    } // ecmwf_02

  } // ecmwf

} // root group

-- 
Charlie Zender, Earth System Sci. & Computer Sci.
University of California, Irvine 949-891-2429 )'(

Received on Thu Sep 19 2013 - 13:44:34 BST

This archive was generated by hypermail 2.3.0 : Tue Sep 13 2022 - 23:02:41 BST