[CF-metadata] Are ensembles a compelling use case for "group-aware" metadata? from Russ Rew on 2013-09-19 (Archive of CF discussions from 2002 to 2019 on the cf-metadata mailing list)

From: Russ Rew <russ>
Date: Thu, 19 Sep 2013 14:16:23 -0600

Charlie,

Great use case, clearly explained ...

It also demonstrates a potential need for "Group dimensions", so instead
of groups cesm_01 and cesm_02, it would be possible to refer to Groups
cesm[1] and cesm[2], supporting loops over contents of closely related
groups without having to invent a convention for their names.

--Russ

> Hello,
>
> Thank you all for input on potentical "group-aware" CF extensions.
> I do my best to respond to new points and apologies to those I miss.
> New top-level thread to discuss use cases of, and potential solutions
> to, research questions relevant to "group-aware" metadata.
>
> Some wish to see a compelling example of a feature that could best be
> expressed with hierarchical groups. Let me supply one that scratches
> my own itch. Others may have examples that are more persuasive to you,
> or reasons why my example is not compelling, or is actually a
> counter-example in disguise :)
>
> This will be a fuller example of the case for a CF ensemble feature.
> Earlier I posted a "rank 1" ensemble example: an ensemble of datasets
> produced by different models. This will be a "rank 2" ensemble, i.e.,
> an ensemble of ensembles. It will be two models, named CESM and ECMWF,
> each with two realizations of the same "Historical" time-period.
> Realistic and possibly useful wrinkles to this would include adding
> "Historical" observations of different shape to the mix, adding a
> third ensemble axis (Scenario?), explicitly showing that model
> datasets do not typically share spatial dimensions (i.e., different
> spatial grids) but do often share time dimensions (e.g., daily,
> monthly), and noting that ensembles can have different numbers of
> realizations. These wrinkles are left to the reader's imagination
> because they are not fundamental to motivating ensembles.
>
> An "ensemble" here means a group of datasets identical (perhaps
> "isomorphic" would be better?) in structure, i.e., same variables and
> dimensions. Normally, though not necessarily (metadata conventions
> could be used to specify otherwise) ensemble members are assigned
> equal statistical weights, e.g., simulation 1 is not preferred over
> simulation 2, and model 1 is not preferred over model 2.
> For example the two CESM datasets form an ensemble because they are
> slightly different realizations of the same model. Although it could
> be done differently, I will define the multi-model ensemble (MME) to
> comprise two equally-weighted members, CESM and ECMWF.
>
> This is still simple enough to visualize yet gives a clearer idea of
> how a hierarcical group structure compares to a flat structure.
> Below are 2*2=4 CDL descriptions of raw datasets as they might appear
> in a CF 1.x world of flat files and no "ensemble" metadata.
> Then is a single 2-level deep hierarchical CDL file that contains
> all four datasets but that serves purely as a container with no
> inherited group metadata or dimensions. It's called mme_cnt.cdl (for
> "MME container"). It could easily be created/aggregated (e.g., by
> NCO's ncecat) from the four flat files it contains.
>
> I'll refrain from annotating mme_cnt.cdl with suggested "group-aware"
> metadata, because it's a bit early for that. Instead I'll frame some
> pertinent research questions and let those interested suggest answers
> that may or may not imply "group-aware" metadata/storage solutions.
> The mme_cnt.cdl file is provided as a bare-bones structure for
> visualization purposes. Yes, hidden in its structure are intrinsic
> relationships that can be used by the current netCDF4 library to
> answer some of the questions posed below. That may be the subject
> of a later post. For now let some prototypical research questions
> steer our thinking to the best possible solution, and reserve for
> later the question of how to store that solution on disk.
>
> Let us say a researcher's goal is to compute statistics on the CESM
> ensemble, the ECMWF ensemble, and the multi-model ensemble (MME).
> First the researcher needs to obtain the constituent files of the
> model ensembles. This is possible through some equivalent of 'ls
> cesm_*.nc' and 'ls ecmwf_*.nc' performed on a local directory or
> the equivalent specified through a repository gateway's GUI.
> Let us say she finds the four datasets, cesm_01.nc, cesm_02.nc,
> ecmwf_01.nc, and ecmwf_02.nc.
>
> Point 1: How does the user know she has all the realizations?
> She has 2 from CESM and 2 from ECMWF but how does she know she is not
> missing cesm_03.nc and ecmwf_03.nc? How does she know how many total
> realizations are supposed (by the producer/distributor) to comprise
> the ensemble? What if, for whatever reason (a filename mispelling,
> a quirky server problem), not all matching datasets are found?
> Perhaps some "ensemble" metadata would help: It could standardize how
> to indicate the size of the intended ensemble, and enumerate the
> current realization, the equivalent of "Page 2 of 5".
> Then recipients know when they have all the data. I would argue that
> knowing/discovering the size of an intended ensemble via metadata
> is more robust than filename grepping or equivalent.
>
> Point 2: Multiple and/or Non-numeric Ensemble axes
> The two hierachies constructable in this example are (top->bottom)
> Model->Realization and Realization->Model. Let's call those
> orientations the ensemble axes. Our researcher wishes to obtain
> statistics along each axis of the ensemble, e.g., the mean temperature
> of each Model (average realizations for each model), and the mean
> temperature of each realization (average of realization #1 for all
> models). How does the researcher know that she has all the models?
> The lack of "ensemble" metadata means that she may have to rely, once
> again, on results of "ls", GUI-searches, or non-automatically-readable
> documentation to be sure she has found all the models. By this I mean
> that nothing in the flat-file metadata indicates that there are only
> two models (not 3 or 24) along the Model axis of the ensemble. Once
> again, perhaps some "ensemble" metadata would help: It could
> standardize the indication of which known ensemble axes this
> particular representation belongs to. For example, file cesm_01 is a
> member of two known ensemble axes, Realization and Model. This file is
> Realization 1 of 2 total CESM Realizations, and is (in some scheme)
> Model 1 of 2 total Models. The data producer may not know a priori the
> size of either ensemble axis. In which case "ensemble" metadata will
> often be appended/altered a posteriori by the producer, distributor,
> or both.
>
> Point 3: Weights and intentional reproducibility of MME statistics
> The example I gave has two equally sized ensemble axes, 2 models of
> 2 Realizations each, so that MME statistics would would weight each
> of the 4 total members equally (with a weight of 0.25). Consider now
> the case (call it "sequestration?") where the CESM model has 2
> realizations and the ECMWF model grows to 3 realizations so there are
> 5 total members in the MME. What will the researcher call the MME
> average? The researcher faces a choice. She might use a brute force
> method that averages all five inputs together and weights them at 20%
> apiece. That's often easier. Or she might subscribe to the notion of
> weighting each model equally and therefore perform a two-step average
> so that the 2 CESM realizations are weighted at 0.25 each and the 3
> ECMWF realizations are weighted at 0.5/3=0.167 each. Without
> "ensemble" metadata, the flat files do not know what size ensembles
> they belong to, and it is difficult for tools to automate the correct
> uneven weighting of datasets to create an MME. Unsophisticated users
> may take the brute-force approach and inadvertently weight all
> datasets equally. "ensemble" metadata could reduce the chances of this
> happening, and thus ensure greater reproducibility.
>
> This example illustrates the case where dataset weights depend only
> on the number of datasets in an ensemble axis, yet some datasets may
> have intrinsic weights that normalize to unity along that ensemble
> axis yet are not equal for each ensemble member. Such datasets could
> be created by input parameters or observations that occupy a known
> proportion of probability space (i.e., PDF-based not Monte Carlo-based
> approach). How might these weights best be distributed with datasets
> so that users compute the statistics correctly? "ensemble" metadata
> could indicate weights (similar to an area_weight across datasets).
>
> This concludes my intended-to-be-compelling example of the utility
> of "ensemble" metadata. Sorry if it puts you to sleep :) People can
> think about the utility to CF of "group-aware" metadata in multiple
> One way is from the data-storage-elegance point of view. Another way,
> advanced here, is how researchers could exploit it. Both are valid
> and possibly complementary ways of thinking about "group-aware" CF.
>
> Maybe this persuades you that CF could/should support "ensembles".
> How? As a new file-level featureType? Or in some other way?
> If so then we have a use-case for a convention that could apply
> to flat files, and/or be implemented in hierarchical files.
> That could be a follow-on discussion.
> If not, if there is no consensus use-case, then I agree there's no
> point for a CF modification and let's all solve the above problems
> by whichever methods we happen to like.
>
> If you're still undecided then maybe you would be persuaded by a
> different use-case example. I too would like to see other use-cases.
> So if you've got one, please describe it.
>
> Thanks for reading this far!
>
> cz
>
> // ncgen -k netCDF-4 -b -o cesm_01.nc cesm_01.cdl
>
> netcdf cesm_01 {
> :Conventions = "CF-2.x";
> :history = "yada yada yada";
> :Scenario = "Historical";
> :Model = "CESM";
> :Realization = "1";
>
> dimensions:
> time=4;
> variables:
> float tas(time);
> data:
> tas=272.1,272.1,272.1,272.1;
>
> } // end cesm_01
>
> // ncgen -k netCDF-4 -b -o cesm_02.nc cesm_02.cdl
>
> netcdf cesm_02 {
> :Conventions = "CF-2.x";
> :history = "yada yada yada";
> :Scenario = "Historical";
> :Model = "CESM";
> :Realization = "2";
>
> dimensions:
> time=4;
> variables:
> float tas(time);
> data:
> tas=272.2,272.2,272.2,272.2;
>
> } // end cesm_02
>
> // ncgen -k netCDF-4 -b -o ecmwf_01.nc ecmwf_01.cdl
>
> netcdf ecmwf_01 {
> :Conventions = "CF-2.x";
> :history = "yada yada yada";
> :Scenario = "Historical";
> :Model = "ECMWF";
> :Realization = "1";
>
> dimensions:
> time=4;
> variables:
> float tas(time);
> data:
> tas=273.1,273.1,273.1,273.1;
>
> } // end ecmwf_01
>
> // ncgen -k netCDF-4 -b -o ecmwf_02.nc ecmwf_02.cdl
>
> netcdf ecmwf_02 {
> :Conventions = "CF-2.x";
> :history = "yada yada yada";
> :Scenario = "Historical";
> :Model = "ECMWF";
> :Realization = "2";
>
> dimensions:
> time=4;
> variables:
> float tas(time);
> data:
> tas=273.2,273.2,273.2,273.2;
>
> } // end ecmwf_02
>
> // ncgen -k netCDF-4 -b -o mme_cnt.nc mme_cnt.cdl
>
> netcdf mme_cnt {
>
> group: cesm {
>
> group: cesm_01 {
> :Conventions = "CF-2.x";
> :history = "yada yada yada";
> :Scenario = "Historical";
> :Model = "CESM";
> :Realization = "1";
>
> dimensions:
> time=4;
> variables:
> float tas(time);
> data:
> tas=272.1,272.1,272.1,272.1;
>
> } // cesm_01
>
> group: cesm_02 {
> :Conventions = "CF-2.x";
> :history = "yada yada yada";
> :Scenario = "Historical";
> :Model = "CESM";
> :Realization = "2";
>
> dimensions:
> time=4;
> variables:
> float tas(time);
> data:
> tas=272.2,272.2,272.2,272.2;
>
> } // cesm_02
>
> } // cesm
>
> group: ecmwf {
>
> group: ecmwf_01 {
> :Conventions = "CF-2.x";
> :history = "yada yada yada";
> :Scenario = "Historical";
> :Model = "ECMWF";
> :Realization = "1";
>
> dimensions:
> time=4;
> variables:
> float tas(time);
> data:
> tas=273.1,273.1,273.1,273.1;
>
> } // ecmwf_01
>
> group: ecmwf_02 {
> :Conventions = "CF-2.x";
> :history = "yada yada yada";
> :Scenario = "Historical";
> :Model = "ECMWF";
> :Realization = "2";
>
> dimensions:
> time=4;
> variables:
> float tas(time);
> data:
> tas=273.2,273.2,273.2,273.2;
>
> } // ecmwf_02
>
> } // ecmwf
>
> } // root group
>
> --
> Charlie Zender, Earth System Sci. & Computer Sci.
> University of California, Irvine 949-891-2429 )'(
> _______________________________________________
> CF-metadata mailing list
> CF-metadata at cgd.ucar.edu
> http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata
Received on Thu Sep 19 2013 - 14:16:23 BST

This archive was generated by hypermail 2.3.0 : Tue Sep 13 2022 - 23:02:41 BST