⇐ ⇒

[CF-metadata] Are ensembles a compelling use case for "group-aware" metadata?

From: Bryan Lawrence <bryan.lawrence>
Date: Fri, 20 Sep 2013 08:21:34 +0100

Hi Charlie

Great use case, but not compelling :-) Before I get going, I can see that
folks might want to do this, and fair play, but it's not compelling. Let me
give you a couple of examples of where this approach would cause others
problems.

Let's imagine that the ensemble members take time to produce, and that not
everything can be stored online and some stuff has to go to tape. So now
we have a situation where we are encouraging folks to write into files that
*will* need to be altered, and we'll have to pull all those files off tape
to modify them. Further, we can be sure that the axes of analyses will
change over time as different users analyse the data for different problems.

A priori grouping is the same as deciding a priori on the most efficient
axes of analysis. In general, for climate data analysis (and for us at BADC
with very diverse user groupings) this approach would be very inefficient
...

If you want to avoid relying on ls/grep and friends, then why not have a
short line text file, a manifest? So we only need to alter the manifest.

(Incidentally that' the same solution that NASA should be using.)

IMHO it's a category error to push everything into internal file metadata.
The suitcase metaphor is exactly appropriate for this approach, since the
reason for doing it is to transfer stuff you think belongs together.
However, five minutes after getting home, you empty the suitcase .... and
reorganise it. At petascale, we can't afford to do that!

Bryan













On 19 September 2013 20:44, <zender at uci.edu> wrote:

> Hello,
>
> Thank you all for input on potentical "group-aware" CF extensions.
> I do my best to respond to new points and apologies to those I miss.
> New top-level thread to discuss use cases of, and potential solutions
> to, research questions relevant to "group-aware" metadata.
>
> Some wish to see a compelling example of a feature that could best be
> expressed with hierarchical groups. Let me supply one that scratches
> my own itch. Others may have examples that are more persuasive to you,
> or reasons why my example is not compelling, or is actually a
> counter-example in disguise :)
>
> This will be a fuller example of the case for a CF ensemble feature.
> Earlier I posted a "rank 1" ensemble example: an ensemble of datasets
> produced by different models. This will be a "rank 2" ensemble, i.e.,
> an ensemble of ensembles. It will be two models, named CESM and ECMWF,
> each with two realizations of the same "Historical" time-period.
> Realistic and possibly useful wrinkles to this would include adding
> "Historical" observations of different shape to the mix, adding a
> third ensemble axis (Scenario?), explicitly showing that model
> datasets do not typically share spatial dimensions (i.e., different
> spatial grids) but do often share time dimensions (e.g., daily,
> monthly), and noting that ensembles can have different numbers of
> realizations. These wrinkles are left to the reader's imagination
> because they are not fundamental to motivating ensembles.
>
> An "ensemble" here means a group of datasets identical (perhaps
> "isomorphic" would be better?) in structure, i.e., same variables and
> dimensions. Normally, though not necessarily (metadata conventions
> could be used to specify otherwise) ensemble members are assigned
> equal statistical weights, e.g., simulation 1 is not preferred over
> simulation 2, and model 1 is not preferred over model 2.
> For example the two CESM datasets form an ensemble because they are
> slightly different realizations of the same model. Although it could
> be done differently, I will define the multi-model ensemble (MME) to
> comprise two equally-weighted members, CESM and ECMWF.
>
> This is still simple enough to visualize yet gives a clearer idea of
> how a hierarcical group structure compares to a flat structure.
> Below are 2*2=4 CDL descriptions of raw datasets as they might appear
> in a CF 1.x world of flat files and no "ensemble" metadata.
> Then is a single 2-level deep hierarchical CDL file that contains
> all four datasets but that serves purely as a container with no
> inherited group metadata or dimensions. It's called mme_cnt.cdl (for
> "MME container"). It could easily be created/aggregated (e.g., by
> NCO's ncecat) from the four flat files it contains.
>
> I'll refrain from annotating mme_cnt.cdl with suggested "group-aware"
> metadata, because it's a bit early for that. Instead I'll frame some
> pertinent research questions and let those interested suggest answers
> that may or may not imply "group-aware" metadata/storage solutions.
> The mme_cnt.cdl file is provided as a bare-bones structure for
> visualization purposes. Yes, hidden in its structure are intrinsic
> relationships that can be used by the current netCDF4 library to
> answer some of the questions posed below. That may be the subject
> of a later post. For now let some prototypical research questions
> steer our thinking to the best possible solution, and reserve for
> later the question of how to store that solution on disk.
>
> Let us say a researcher's goal is to compute statistics on the CESM
> ensemble, the ECMWF ensemble, and the multi-model ensemble (MME).
> First the researcher needs to obtain the constituent files of the
> model ensembles. This is possible through some equivalent of 'ls
> cesm_*.nc' and 'ls ecmwf_*.nc' performed on a local directory or
> the equivalent specified through a repository gateway's GUI.
> Let us say she finds the four datasets, cesm_01.nc, cesm_02.nc,
> ecmwf_01.nc, and ecmwf_02.nc.
>
> Point 1: How does the user know she has all the realizations?
> She has 2 from CESM and 2 from ECMWF but how does she know she is not
> missing cesm_03.nc and ecmwf_03.nc? How does she know how many total
> realizations are supposed (by the producer/distributor) to comprise
> the ensemble? What if, for whatever reason (a filename mispelling,
> a quirky server problem), not all matching datasets are found?
> Perhaps some "ensemble" metadata would help: It could standardize how
> to indicate the size of the intended ensemble, and enumerate the
> current realization, the equivalent of "Page 2 of 5".
> Then recipients know when they have all the data. I would argue that
> knowing/discovering the size of an intended ensemble via metadata
> is more robust than filename grepping or equivalent.
>
> Point 2: Multiple and/or Non-numeric Ensemble axes
> The two hierachies constructable in this example are (top->bottom)
> Model->Realization and Realization->Model. Let's call those
> orientations the ensemble axes. Our researcher wishes to obtain
> statistics along each axis of the ensemble, e.g., the mean temperature
> of each Model (average realizations for each model), and the mean
> temperature of each realization (average of realization #1 for all
> models). How does the researcher know that she has all the models?
> The lack of "ensemble" metadata means that she may have to rely, once
> again, on results of "ls", GUI-searches, or non-automatically-readable
> documentation to be sure she has found all the models. By this I mean
> that nothing in the flat-file metadata indicates that there are only
> two models (not 3 or 24) along the Model axis of the ensemble. Once
> again, perhaps some "ensemble" metadata would help: It could
> standardize the indication of which known ensemble axes this
> particular representation belongs to. For example, file cesm_01 is a
> member of two known ensemble axes, Realization and Model. This file is
> Realization 1 of 2 total CESM Realizations, and is (in some scheme)
> Model 1 of 2 total Models. The data producer may not know a priori the
> size of either ensemble axis. In which case "ensemble" metadata will
> often be appended/altered a posteriori by the producer, distributor,
> or both.
>
> Point 3: Weights and intentional reproducibility of MME statistics
> The example I gave has two equally sized ensemble axes, 2 models of
> 2 Realizations each, so that MME statistics would would weight each
> of the 4 total members equally (with a weight of 0.25). Consider now
> the case (call it "sequestration?") where the CESM model has 2
> realizations and the ECMWF model grows to 3 realizations so there are
> 5 total members in the MME. What will the researcher call the MME
> average? The researcher faces a choice. She might use a brute force
> method that averages all five inputs together and weights them at 20%
> apiece. That's often easier. Or she might subscribe to the notion of
> weighting each model equally and therefore perform a two-step average
> so that the 2 CESM realizations are weighted at 0.25 each and the 3
> ECMWF realizations are weighted at 0.5/3=0.167 each. Without
> "ensemble" metadata, the flat files do not know what size ensembles
> they belong to, and it is difficult for tools to automate the correct
> uneven weighting of datasets to create an MME. Unsophisticated users
> may take the brute-force approach and inadvertently weight all
> datasets equally. "ensemble" metadata could reduce the chances of this
> happening, and thus ensure greater reproducibility.
>
> This example illustrates the case where dataset weights depend only
> on the number of datasets in an ensemble axis, yet some datasets may
> have intrinsic weights that normalize to unity along that ensemble
> axis yet are not equal for each ensemble member. Such datasets could
> be created by input parameters or observations that occupy a known
> proportion of probability space (i.e., PDF-based not Monte Carlo-based
> approach). How might these weights best be distributed with datasets
> so that users compute the statistics correctly? "ensemble" metadata
> could indicate weights (similar to an area_weight across datasets).
>
> This concludes my intended-to-be-compelling example of the utility
> of "ensemble" metadata. Sorry if it puts you to sleep :) People can
> think about the utility to CF of "group-aware" metadata in multiple
> One way is from the data-storage-elegance point of view. Another way,
> advanced here, is how researchers could exploit it. Both are valid
> and possibly complementary ways of thinking about "group-aware" CF.
>
> Maybe this persuades you that CF could/should support "ensembles".
> How? As a new file-level featureType? Or in some other way?
> If so then we have a use-case for a convention that could apply
> to flat files, and/or be implemented in hierarchical files.
> That could be a follow-on discussion.
> If not, if there is no consensus use-case, then I agree there's no
> point for a CF modification and let's all solve the above problems
> by whichever methods we happen to like.
>
> If you're still undecided then maybe you would be persuaded by a
> different use-case example. I too would like to see other use-cases.
> So if you've got one, please describe it.
>
> Thanks for reading this far!
>
> cz
>
> // ncgen -k netCDF-4 -b -o cesm_01.nc cesm_01.cdl
>
> netcdf cesm_01 {
> :Conventions = "CF-2.x";
> :history = "yada yada yada";
> :Scenario = "Historical";
> :Model = "CESM";
> :Realization = "1";
>
> dimensions:
> time=4;
> variables:
> float tas(time);
> data:
> tas=272.1,272.1,272.1,272.1;
>
> } // end cesm_01
>
> // ncgen -k netCDF-4 -b -o cesm_02.nc cesm_02.cdl
>
> netcdf cesm_02 {
> :Conventions = "CF-2.x";
> :history = "yada yada yada";
> :Scenario = "Historical";
> :Model = "CESM";
> :Realization = "2";
>
> dimensions:
> time=4;
> variables:
> float tas(time);
> data:
> tas=272.2,272.2,272.2,272.2;
>
> } // end cesm_02
>
> // ncgen -k netCDF-4 -b -o ecmwf_01.nc ecmwf_01.cdl
>
> netcdf ecmwf_01 {
> :Conventions = "CF-2.x";
> :history = "yada yada yada";
> :Scenario = "Historical";
> :Model = "ECMWF";
> :Realization = "1";
>
> dimensions:
> time=4;
> variables:
> float tas(time);
> data:
> tas=273.1,273.1,273.1,273.1;
>
> } // end ecmwf_01
>
> // ncgen -k netCDF-4 -b -o ecmwf_02.nc ecmwf_02.cdl
>
> netcdf ecmwf_02 {
> :Conventions = "CF-2.x";
> :history = "yada yada yada";
> :Scenario = "Historical";
> :Model = "ECMWF";
> :Realization = "2";
>
> dimensions:
> time=4;
> variables:
> float tas(time);
> data:
> tas=273.2,273.2,273.2,273.2;
>
> } // end ecmwf_02
>
> // ncgen -k netCDF-4 -b -o mme_cnt.nc mme_cnt.cdl
>
> netcdf mme_cnt {
>
> group: cesm {
>
> group: cesm_01 {
> :Conventions = "CF-2.x";
> :history = "yada yada yada";
> :Scenario = "Historical";
> :Model = "CESM";
> :Realization = "1";
>
> dimensions:
> time=4;
> variables:
> float tas(time);
> data:
> tas=272.1,272.1,272.1,272.1;
>
> } // cesm_01
>
> group: cesm_02 {
> :Conventions = "CF-2.x";
> :history = "yada yada yada";
> :Scenario = "Historical";
> :Model = "CESM";
> :Realization = "2";
>
> dimensions:
> time=4;
> variables:
> float tas(time);
> data:
> tas=272.2,272.2,272.2,272.2;
>
> } // cesm_02
>
> } // cesm
>
> group: ecmwf {
>
> group: ecmwf_01 {
> :Conventions = "CF-2.x";
> :history = "yada yada yada";
> :Scenario = "Historical";
> :Model = "ECMWF";
> :Realization = "1";
>
> dimensions:
> time=4;
> variables:
> float tas(time);
> data:
> tas=273.1,273.1,273.1,273.1;
>
> } // ecmwf_01
>
> group: ecmwf_02 {
> :Conventions = "CF-2.x";
> :history = "yada yada yada";
> :Scenario = "Historical";
> :Model = "ECMWF";
> :Realization = "2";
>
> dimensions:
> time=4;
> variables:
> float tas(time);
> data:
> tas=273.2,273.2,273.2,273.2;
>
> } // ecmwf_02
>
> } // ecmwf
>
> } // root group
>
> --
> Charlie Zender, Earth System Sci. & Computer Sci.
> University of California, Irvine 949-891-2429 )'(
> _______________________________________________
> CF-metadata mailing list
> CF-metadata at cgd.ucar.edu
> http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata
> --
> Scanned by iCritical.
>



-- 
Bryan Lawrence
University of Reading: Professor of Weather and Climate Computing.
National Centre for Atmospheric Science: Director of Models and Data.
STFC: Director of the Centre for Environmental Data Archival.
Ph: +44 118 3786507 or 1235 445012; Web:home.badc.rl.ac.uk/lawrence
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cgd.ucar.edu/pipermail/cf-metadata/attachments/20130920/99af4bdb/attachment-0001.html>
Received on Fri Sep 20 2013 - 01:21:34 BST

This archive was generated by hypermail 2.3.0 : Tue Sep 13 2022 - 23:02:41 BST

⇐ ⇒