[CF-metadata] Are ensembles a compelling use case for "group-aware" metadata? (CZ) from Schultz, Martin on 2013-09-19 (Archive of CF discussions from 2002 to 2019 on the cf-metadata mailing list)

From: Schultz, Martin <m.schultz>
Date: Thu, 19 Sep 2013 21:58:18 +0000

Hi Charlie,

very good and extensive explanation of the potential use for groups and group-aware metadata. Yet, I have a few remarks (which may in part reveal that I should probably read the preamble of the CF convention again ;-):

> Point 1: How does the user know she has all the realizations?

Is this question best addressed with metadata in a (series of) file(s)? In a modern, interoperable architecture, I would think that this belongs into the realm of data discovery, which would be done via web catalogues using metadata facets. File-based metadata IMHO may be more prone to failure. Just imagine, ECMWF had first generated two ensemble members, and their metadata would say so (your "page 2/2" analogy). Now they run another two: do you really expect the metadata from the old files to be updated? A web catalogue would provide a more robust solution to this question, I believe.

This doesn't mean that it may not be useful to have such information in a file! However, to come back to the suitcases: this can only be a packing list for the current trip and not an inventory of all the socks you may possibly own. Of course your young aspiring researcher may wish to express her knowledge about other ensemble members she found on the web but didn't include in the file (the suitcase). But will her supervisor or colleague on the other side of the world understand what she is talking about? I think, if you intend to go beyond the packing list, you open too many cans with too many worms.

> Point 2: Multiple and/or Non-numeric Ensemble axes

Here, you have a valid point, although - again - I would not connect this to knowing " that she has all the models". Yet, within the packed file (the suitcase) you want to know which hierarchy model (packing order) was applied in order to be able to aggregate things (for example by computing ensemble averages). See also my use-case on aircraft data introduced below. Question: what happens to this kind of information when the files are flattened and re-packaged? It might well become meaningless, which would indicate that these are "temporary" metadata, and thus probably out of scope for CF. This actually reminds me a bit of my experiences with the history attribute when I use ncks -A. This command will preserve the history of one file, but discard the history of the other file, which is certainly not the behavior you would like to see in ungrouping/re-grouping software.

> Point 3: Weights and intentional reproducibility of MME statistics

In my view this is actually just another viewing angle on your point #2.

--
Your use-case does however highlight the "convenience" of grouping data which somehow belong to each other into one file. In a world of flat files, one must check coordinates each time when you want to perform some sort of (ensemble) averaging operation. A hierarchical file will tell you that it is OK to average by placing the common coordinates on the upper level. IMPORTANT: again, this doesn't mean that this is the only or best way to do the grouping - yet, it seems a compelling advantage to have this coordinate-consistency problem eliminated somewhere along your processing steps. As others said already: there are reasons for why people use suitcases.
--
Now, here is another use case, which we haven't implemented yet - partly because we didn't see how it can be done in a CF consistent way:
While there has been a definition of a standard file layout for data from multiple stations (a contribution from Ben Domenico and Stefano Nativi if I am not mistaken), this concept cannot be applied to multiple aircraft flight data. The station data can be packaged together with help of a non-geophysical "station" coordinate, because all stations share the same time axis. With aircraft flights, the time axes often don't overlap, and forcing all data onto the superset of time would be a tremendous waste of space. Groups would seem as the natural solution to this problem! Why not flat files? Because you might wish to retrieve all the aircraft data which were sampled in a given region during a specific period (a natural use case for a catalogue query it seems) in one entity, and not in N entities, where you cannot even predict N.
I would think the same applies to "granules" of satellite data which share a common calibration, for example.
--
As Nan said, we should try to come back to define what is really at stake for CF and what exactly shall be proposed. Now this is where my failure to re-read the convention preamble may show ;-). The main question is: is CF about files or about interoperability? Unfortunately, my view on this is not entirely clear, because it seems to be a bit of both. The standard_names clearly have a bearing in the interoperable world, and this shows through various links to the CF standard_names in web catalogues or controlled vocabulary collections (e.g. SeaDataNet). The conventions themselves seem to be more file-oriented - even though the discussions about the data model always make a strong point to go beyond representation in a (single) file. [If someone disagrees and wishes to see the CF convention play a more important role in interoperability, then I would ask why it is not cast into an XML schema extending ISO19115 then. ] If CF is indeed "file-oriented", then I do think that it makes a lot of sense to support "modern" file structures, which include groups and hierarchies, whether we like them or not. Therefore, I would advocate that we focus the discussion on two major points with a couple of sub-issues:
1. which parts of CF might fail when we have a hierarchical file? (and let's stick to the simple inheritance model of netcdf4 for now!)
1a. what would the current CF checker say if it is fed a hierarchical file?
1b. what happens to global attributes when flat files are grouped together?
1c. do we need to re-phrase some aspects of the convention to make them "group-aware"? (this does not include defining new rules - that is covered in point 2)
1d. anything else?
2. where do we need to extend the current CF concept?
2a. introduction of a new attribute "level" (equate "global" with "root"? What happens when hierarchical files are flattened? [please see the 3 varieties of flattening operations mentioned in an earlier post])
2b. specification of "ensemble_..." attributes? "ensemble_axis" may not be needed of these axes are defined on the group level (?) Something like "ensemble_history" or "ensemble_structure" to inform the user about the grouping principle?
2c. what other "relations" need to be expressed within a hierarchical file? The guiding principle here should be that additional rules are only needed if they avoid ambiguity and misinterpretation of the data. And here we get onto interoperability territory again (see my use case about aircraft data above).
Sorry for this long post -- this just somehow seems to be quite relevant!
Best regards,
Martin
--------------------------------------------------------------------------------
PD Dr. Martin G. Schultz
IEK-8, Forschungszentrum J?lich
D-52425 J?lich
Ph: +49 2461 61 2831
------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------
Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr. Achim Bachem (Vorsitzender),
Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt
------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------
Das Forschungszentrum oeffnet seine Tueren am Sonntag, 29. September, von 10:00 bis 17:00 Uhr: http://www.tagderneugier.de

Received on Thu Sep 19 2013 - 15:58:18 BST

This archive was generated by hypermail 2.3.0 : Tue Sep 13 2022 - 23:02:41 BST