⇐ ⇒

[CF-metadata] Are ensembles a compelling use case for "group-aware" metadata?

From: Charlie Zender <zender>
Date: Fri, 20 Sep 2013 20:33:03 -0700

Hi Bryan,

Your last point is crucial so let me address it first.
I agree that placing metadata that is "external" to the dataset into
the dataset is a mistake! A "category error" as you put it. As the
interleaved response explains, our proposal neither recommends nor
requires this.

Le 20/09/2013 00:21, Bryan Lawrence a ?crit :

> Great use case, but not compelling :-) Before I get going, I can see
> that folks might want to do this, and fair play, but it's not
> compelling. Let me give you a couple of examples of where this approach
> would cause others problems.
>
> Let's imagine that the ensemble members take time to produce, and that
> not everything can be stored online and some stuff has to go to tape.
> So now we have a situation where we are encouraging folks to write into
> files that *will* need to be altered, and we'll have to pull all those
> files off tape to modify them. Further, we can be sure that the axes of
> analyses will change over time as different users analyse the data for
> different problems.

Back-end storage (e.g., your tape farm) need not change at all.
Servers can continue to store flat datasets as primary sources.
Servers could aggregate into hierarchies as user or provider desires.
Datasets matching a query ("all ECMWF Historical simulations")
could be aggregated into a hierarchy by the server and returned.
No one advocates needing to know how many ensemble members will be
performed and including that number in a model dataset at run time.
It would be unwise and practically impossible to do.
Let the query retrieve what it will and aggregate the data if and how
the query specifies. Then the user/provider could obtain/serve ONE
or a few hierarchical files or possibly flat files with equivalent
metadata (subject to size limitations, of course).

In fact the group aggregation CF could consider would solve many
problems without appending _any_ metadata, for groups hierarchies can
be automatically traversed and counted (thus determining ensemble axis
size), and/or extra group metadata can be added outside the "ensemble"
without touching the original data. A logical place for "external"
data for members of the CESM ensemble could be group metadata in
the supergroup (/cesm) that holds the members (/cesm/cesm_??).
Do flat files offer a cleaner method for handling "external" data?
And one that generalizes to ensembles of any order?

As mentioned in a previous thread, some users prefer all their data in
one place. Where they can, e.g., average it with one command:

ncwa mme_cnt.nc. Done. No loops or wildcards necessary.

Some users will prefer their current methods. Some will shift
gradually as tools, producers, distributors, and conventions improve
support.

Many objections to hierarchies understandably come from data
distributors with an investment in an existing codebase.
That's a fact---not a criticism---because who else reads this list?
Once users have their data, they don't spend so much time looking at
catalogs and web services. They massage the data they have obtained.
As files. Lots and lots of flat files. Many users, and some
distributors, would appreciate a tidier solution.

We could start a whole thread on the use case of "return all the data
sets relevant to this bounding box". Practically impossible with
flat files because of all the grids that could be involved. Yet quite
plausible with groups. However, this case does not interest me as much
as MME. Please be aware of the general issue of file proliferation,
though.

> A priori grouping is the same as deciding a priori on the most efficient
> axes of analysis. In general, for climate data analysis (and for us at
> BADC with very diverse user groupings) this approach would be very
> inefficient ...

In other words, ability to aggregate, dismember, flatten, invert, and
reorder hierarchies easily without loss of information is desirable.
Fully agree.

> If you want to avoid relying on ls/grep and friends, then why not have a
> short line text file, a manifest? So we only need to alter the manifest.
>
> (Incidentally that' the same solution that NASA should be using.)

No argument from me that that can work well for providers.
However, many users rely on typing commands to analyze datasets on
their local machine which have been previously downloaded or produced
locally. I was thinking more of them. When information needed for an
analysis in not in the file, then the file in incomplete.
Self-describing datasets are great.
Self-describing and complete datasets are even better.

> IMHO it's a category error to push everything into internal file
> metadata. The suitcase metaphor is exactly appropriate for this
> approach, since the reason for doing it is to transfer stuff you think
> belongs together. However, five minutes after getting home, you empty
> the suitcase .... and reorganise it. At petascale, we can't afford to do
> that!

Pack a suitcase when it makes sense.
Users will let you know when you've packed well by their silence.
Or poorly by their complaints :)

Finally, your reponse does not indicate how/if you would solve Point 3
of my "compelling ensemble" example, e.g., weights. How would you
attach data that is "external" to a dataset to the dataset without
running into all the problems of a priori writing?

cz

Le 20/09/2013 00:21, Bryan Lawrence a ?crit :
> Hi Charlie
>
> Great use case, but not compelling :-) Before I get going, I can see
> that folks might want to do this, and fair play, but it's not
> compelling. Let me give you a couple of examples of where this approach
> would cause others problems.
>
> Let's imagine that the ensemble members take time to produce, and that
> not everything can be stored online and some stuff has to go to tape.
> So now we have a situation where we are encouraging folks to write into
> files that *will* need to be altered, and we'll have to pull all those
> files off tape to modify them. Further, we can be sure that the axes of
> analyses will change over time as different users analyse the data for
> different problems.
>
> A priori grouping is the same as deciding a priori on the most efficient
> axes of analysis. In general, for climate data analysis (and for us at
> BADC with very diverse user groupings) this approach would be very
> inefficient ...
>
> If you want to avoid relying on ls/grep and friends, then why not have a
> short line text file, a manifest? So we only need to alter the manifest.
>
> (Incidentally that' the same solution that NASA should be using.)
>
> IMHO it's a category error to push everything into internal file
> metadata. The suitcase metaphor is exactly appropriate for this
> approach, since the reason for doing it is to transfer stuff you think
> belongs together. However, five minutes after getting home, you empty
> the suitcase .... and reorganise it. At petascale, we can't afford to do
> that!
>
> Bryan

-- 
Charlie Zender, Earth System Sci. & Computer Sci.
University of California, Irvine 949-891-2429 )'(
Received on Fri Sep 20 2013 - 21:33:03 BST

This archive was generated by hypermail 2.3.0 : Tue Sep 13 2022 - 23:02:41 BST

⇐ ⇒