⇐ ⇒

[CF-metadata] Are ensembles a compelling use case for "group-aware" metadata? (CZ)

From: Charlie Zender <zender>
Date: Tue, 24 Sep 2013 21:45:40 -0700

Hi Martin,

Thank you for taking the time to address my 3 points about ensembles,
and for adding some new examples. Response interleaved...

> Le 19/09/2013 14:58, Schultz, Martin a ?crit :
> Hi Charlie,
>
> very good and extensive explanation of the potential use for groups
> and group-aware metadata. Yet, I have a few remarks (which may in
> part reveal that I should probably read the preamble of the CF
> convention again ;-):
>
>> Point 1: How does the user know she has all the realizations?
>
> Is this question best addressed with metadata in a (series of)
> file(s)?

> In a modern, interoperable architecture, I would think that
> this belongs into the realm of data discovery, which would be done
> via web catalogues using metadata facets. File-based metadata IMHO
> may be more prone to failure. Just imagine, ECMWF had first
> generated two ensemble members, and their metadata would say so
> (your "page 2/2" analogy). Now they run another two: do you really
> expect the metadata from the old files to be updated? A web
> catalogue would provide a more robust solution to this question, I
> believe.
>
> This doesn't mean that it may not be useful to have such information
> in a file! However, to come back to the suitcases: this can only be
> a packing list for the current trip and not an inventory of all the
> socks you may possibly own. Of course your young aspiring researcher
> may wish to express her knowledge about other ensemble members she
> found on the web but didn't include in the file (the suitcase). But
> will her supervisor or colleague on the other side of the world
> understand what she is talking about? I think, if you intend to go
> beyond the packing list, you open too many cans with too many
> worms.

Agreed, and not what I am suggesting. The ensemble use-case I posted
was designed to show an interesting use-case for which flat-file
solutions are currently lacking, where groups would help.

As discussed in my response to Bryan Lawrence, I agree that adding
"temporary" or "out-of-scope" metadata to a dataset is unwise and
unmaintainable. If servers served hierarchical files then ensemble
sizes, for example, could be simply counted by counting groups.
Ensemble metadata, like weights, could be stored as group metadata
rather than within the realization dataset itself. This preserves the
boundary between internal and external. I have yet to hear a
flat-file-based method of accomplishing this that does not violate
the internal/external barrier.

To get back to the researcher, if she has a collection of flat-files
on her local disk, obtained from a flawless discovery-server, then
she can still make a typos when analyzing the files on her disk.
Despite years of trying, the number of papers produced via analysis on
remote-files is near zero :) Bookkeeping with lots of files can be
error-prone. If we reduce the number of files by putting datasets in
suitcases, this type of mistake may be less likely.

>> Point 2: Multiple and/or Non-numeric Ensemble axes
>
> Here, you have a valid point, although - again - I would not
> connect this to knowing " that she has all the models". Yet, within
> the packed file (the suitcase) you want to know which hierarchy
> model (packing order) was applied in order to be able to aggregate
> things (for example by computing ensemble averages). See also my
> use-case on aircraft data introduced below. Question: what happens
> to this kind of information when the files are flattened and
> re-packaged? It might well become meaningless, which would indicate
> that these are "temporary" metadata, and thus probably out of scope
> for CF. This actually reminds me a bit of my experiences with the
> history attribute when I use ncks -A. This command will preserve the
> history of one file, but discard the history of the other file,
> which is certainly not the behavior you would like to see in
> ungrouping/re-grouping software.

First, we all agree on the necessity of being able to aggregate and
dismember hierarchical files without loss of information.
A new example in the NCO manual shows this is a solved problem:
http://nco.sf.net/nco.html#dismember
Perhaps this will assuage people. May this zombie not re-animate!
Dismembering into flat files a dataset aggregated from flat files
will not lose any information. And, if suitable conventions are
adopted, flat file products need not contain any new information.

Which brings us to Martin's point: what does disaggregation do with
any "external" metadata? This is, by definition, metadata that was
never in the flat files. So I see no alternative answer to: the
disposal of "external" metadata must be determined by convention.
The options include deleting it forever/completely, or appending it
to the flat files.

Finally, if you want ncks -A to save all the metadata from both files
somehow, e.g., in "history_1" and "history_2" attributes, or some
such, just post your suggestion to our help forum and we'll find a way
forward. Squeaky wheels...

>> Point 3: Weights and intentional reproducibility of MME statistics
>
> In my view this is actually just another viewing angle on your point #2.
>
> --
>
> Your use-case does however highlight the "convenience" of grouping
> data which somehow belong to each other into one file. In a world of
> flat files, one must check coordinates each time when you want to
> perform some sort of (ensemble) averaging operation. A hierarchical
> file will tell you that it is OK to average by placing the common
> coordinates on the upper level. IMPORTANT: again, this doesn't mean
> that this is the only or best way to do the grouping - yet, it seems
> a compelling advantage to have this coordinate-consistency problem
> eliminated somewhere along your processing steps. As others said
> already: there are reasons for why people use suitcases.

Yes, the power and convenience of this should not be overlooked. E.g.,

ncks -d lat,-30.,30. cmip5.nc cmip5_tropics.nc

obtains the tropical hyperslab of each model in cmip5.nc, despite
the likely heterogeneity of the underlying coordinate grids.

> Now, here is another use case, which we haven't implemented yet -
> partly because we didn't see how it can be done in a CF consistent
> way:
> While there has been a definition of a standard file layout for data
> from multiple stations (a contribution from Ben Domenico and Stefano
> Nativi if I am not mistaken), this concept cannot be applied to
> multiple aircraft flight data. The station data can be packaged
> together with help of a non-geophysical "station" coordinate,
> because all stations share the same time axis. With aircraft
> flights, the time axes often don't overlap, and forcing all data
> onto the superset of time would be a tremendous waste of
> space. Groups would seem as the natural solution to this problem!
> Why not flat files? Because you might wish to retrieve all the
> aircraft data which were sampled in a given region during a specific
> period (a natural use case for a catalogue query it seems) in one
> entity, and not in N entities, where you cannot even predict N.
>
> I would think the same applies to "granules" of satellite data which
> share a common calibration, for example.

Indeed, archiving multiple satellite granules is a main reason NASA
uses groups.

> As Nan said, we should try to come back to define what is really at
> stake for CF and what exactly shall be proposed. Now this is where
> my failure to re-read the convention preamble may show ;-). The main
> question is: is CF about files or about interoperability?

Another rhetorical dichotomy: is CF about data providers or users?
There is both tension alignment between the two.
Just ask tech support :)

> Unfortunately, my view on this is not entirely clear, because it
> seems to be a bit of both. The standard_names clearly have a bearing
> in the interoperable world, and this shows through various links to
> the CF standard_names in web catalogues or controlled vocabulary
> collections (e.g. SeaDataNet). The conventions themselves seem to be
> more file-oriented - even though the discussions about the data
> model always make a strong point to go beyond representation in a
> (single) file. [If someone disagrees and wishes to see the CF
> convention play a more important role in interoperability, then I
> would ask why it is not cast into an XML schema extending ISO19115
> then. ] If CF is indeed "file-oriented", then I do think that it
> makes a lot of sense to support "modern" file structures, which
> include groups and hierarchies, whether we like them or
> not. Therefore, I would advocate that we focus the discussion on two
> major points with a couple of sub-issues:
>
> 1. which parts of CF might fail when we have a hierarchical file?
> (and let's stick to the simple inheritance model of netcdf4 for
> now!)
> 1a. what would the current CF checker say if it is fed a
> hierarchical file?

http://puma.nerc.ac.uk/cgi-bin/cf-checker.pl
This compliance checker sometimes chokes on complex hierarcical files,
and sometimes (as far as I can tell) does a standard check of the
root group only.

> 1b. what happens to global attributes when flat files are grouped
> together?
> 1c. do we need to re-phrase some aspects of the convention to make
> them "group-aware"? (this does not include defining new rules - that
> is covered in point 2)
> 1d. anything else?
>
> 2. where do we need to extend the current CF concept?
> 2a. introduction of a new attribute "level" (equate "global" with
> "root"? What happens when hierarchical files are flattened? [please
> see the 3 varieties of flattening operations mentioned in an earlier
> post])
> 2b. specification of "ensemble_..." attributes? "ensemble_axis" may
> not be needed of these axes are defined on the group level (?)
> Something like "ensemble_history" or "ensemble_structure" to inform
> the user about the grouping principle?
> 2c. what other "relations" need to be expressed within a
> hierarchical file? The guiding principle here should be that
> additional rules are only needed if they avoid ambiguity and
> misinterpretation of the data. And here we get onto interoperability
> territory again (see my use case about aircraft data above).

Thank you for your suggestion of starting points.
If and when there is consensus on the desirability of including groups
and/or "ensemble" features into CF, numbers 1 and 2 above seem like
reasonable starting points. It is not my place to determine whether
there is a consensus, or how close we are, but it's clear to me there
is no consensus yet. Bryan Lawrence, Steve Hankin, Jonathan Gregory,
Karl Taylor, and Philip Cameron-Smith are not "on board". I hope they
will speak-up and say if they concur that maintaining the status quo
(flat files) is best (period), or whether they do wish to extend CF to
hierarchies (starting now), or the additional information they would
need to decide.

> Sorry for this long post -- this just somehow seems to be quite relevant!

I think you've hit many of the most pressing points.
Look forward to hearing whether this discussion has persuaded anyone
that reasons to include hierarchies in CF outweigh reasons not to.

cz

> Best regards,
>
> Martin

-- 
Charlie Zender, Earth System Sci. & Computer Sci.
University of California, Irvine 949-891-2429 )'(
Received on Tue Sep 24 2013 - 22:45:40 BST

This archive was generated by hypermail 2.3.0 : Tue Sep 13 2022 - 23:02:41 BST

⇐ ⇒