[CF-metadata] Are ensembles a compelling use case for "group-aware" metadata? from Charlie Zender on 2013-09-25 (Archive of CF discussions from 2002 to 2019 on the cf-metadata mailing list)

From: Charlie Zender <zender>
Date: Tue, 24 Sep 2013 22:12:22 -0700

Hello Stephen,

Thanks for your comments. Response interleaved.

cz

> Le 21/09/2013 01:50, stephen.pascoe at stfc.ac.uk a ?crit :> Hi Charlie,
>
> Your perspective is as an end user who has autonomy over how they
> want to view the data. The data provider's perspective is
> different. We try to avoid transforming the data where ever
> possible. Archived data shouldn't be changed at all and
> transforming a copy adds a lot of overhead in both disk and software
> complexity. We don't like having to continually pack and unpack
> suitcases.

That seems like an accurate portrait of many providers' perspectives.
Remember though: users are providers' reason for being.
Users can build suitcases themselves if they have to.
Though returning data in the most usable form is the providers' job.

> My impression from this excellent discussion is that hierarchal
> NetCDF doesn't make sense for data providers (except in specific
> cases) but does for some users. A way round this would be for CF to
> start by saying any hierarchal CF-NetCDF file must be equivalent to
> a set of non-hierarchal CF-NetCDF files, or ideally both would have
> definitions in terms of a CF data model (but that might take some
> time). I.e. we need a standard way of packing and unpacking the
> suitcase. If we did this first (and stuck to it!) I think I'd be
> happy.

That seems like a sensible place to start to me, and some others, too.

> I've made a couple more comments inline.
>
> Cheers,
> Stephen.
>
> On 21 Sep 2013, at 04:33, Charlie Zender wrote:
>
>> Hi Bryan,
>>
>> Your last point is crucial so let me address it first.
>> I agree that placing metadata that is "external" to the dataset into
>> the dataset is a mistake! A "category error" as you put it. As the
>> interleaved response explains, our proposal neither recommends nor
>> requires this.
>>
>> Le 20/09/2013 00:21, Bryan Lawrence a ?crit :
>>
>>> Great use case, but not compelling :-) Before I get going, I can see
>>> that folks might want to do this, and fair play, but it's not
>>> compelling. Let me give you a couple of examples of where this approach
>>> would cause others problems.
>>>
>>> Let's imagine that the ensemble members take time to produce, and that
>>> not everything can be stored online and some stuff has to go to tape.
>>> So now we have a situation where we are encouraging folks to write into
>>> files that *will* need to be altered, and we'll have to pull all those
>>> files off tape to modify them. Further, we can be sure that the axes of
>>> analyses will change over time as different users analyse the data for
>>> different problems.
>>
>> Back-end storage (e.g., your tape farm) need not change at all.
>> Servers can continue to store flat datasets as primary sources.
>> Servers could aggregate into hierarchies as user or provider desires.
>> Datasets matching a query ("all ECMWF Historical simulations")
>> could be aggregated into a hierarchy by the server and returned.
>> No one advocates needing to know how many ensemble members will be
>> performed and including that number in a model dataset at run time.
>> It would be unwise and practically impossible to do.
>> Let the query retrieve what it will and aggregate the data if and how
>> the query specifies. Then the user/provider could obtain/serve ONE
>> or a few hierarchical files or possibly flat files with equivalent
>> metadata (subject to size limitations, of course).
>>
>
> As a convenient package for delivering collections of data to the
> user this would be useful. However, to maintain the interoperability
> of CF we would need watertight rules for transforming a grouped
> CF-NetCDF file into a set of equivalent NetCDF files, and visa
> versa.

Agreed.

>> In fact the group aggregation CF could consider would solve many
>> problems without appending _any_ metadata, for groups hierarchies can
>> be automatically traversed and counted (thus determining ensemble axis
>> size), and/or extra group metadata can be added outside the "ensemble"
>> without touching the original data. A logical place for "external"
>> data for members of the CESM ensemble could be group metadata in
>> the supergroup (/cesm) that holds the members (/cesm/cesm_??).
>> Do flat files offer a cleaner method for handling "external" data?
>> And one that generalizes to ensembles of any order?
>>
>> As mentioned in a previous thread, some users prefer all their data in
>> one place. Where they can, e.g., average it with one command:
>>
>> ncwa mme_cnt.nc. Done. No loops or wildcards necessary.
>>
>> Some users will prefer their current methods. Some will shift
>> gradually as tools, producers, distributors, and conventions improve
>> support.
>>
>> Many objections to hierarchies understandably come from data
>> distributors with an investment in an existing codebase.
>
>> Not entirely. Sure, we don't want to be given "CF-NetCDF" that
>> breaks all our tools because of hierarchies but that sort of
>> problem is a fact or life for us. My objection is from experience
>> of building codebases which use hierarchal concepts and how they
>> invariably are more difficult to work with at scale than
>> non-hierarchal ones. That's not to say they don't use tree
>> algorithms underneath (e.g. btrees in a database) but at the
>> interface level hierarchies normally get in the way.
>
> Take your idea that some "external" metadata will be inside the
> files in a supergroup. I predict that some of the same class of
> metadata will end up in some other part of the system: e.g. in a
> database or a metadata file. The code now needs to deal with
> multiple heterogeneous sources of the same metadata. It needs
> rules for deciding which source takes precedence, etc.

Keeping "external" metadata (like ensemble weights) as metadata in the
supergroup is only one approach to extending CF to handle powerful new
"ensemble" use-cases. Conventions could minimize this problem.
But no one has yet suggested a "flat-file" method of handling this
use-case (Point 3). The supergroup solution is relatively clean
compared to any flat-file solution I can imagine. Proliferation of
"external" metadata is a valid concern though, in either case,
"flat" or "hierarchical".

>> That's a fact---not a criticism---because who else reads this list?
>> Once users have their data, they don't spend so much time looking at
>> catalogs and web services. They massage the data they have obtained.
>> As files. Lots and lots of flat files. Many users, and some
>> distributors, would appreciate a tidier solution.
>>
>> We could start a whole thread on the use case of "return all the data
>> sets relevant to this bounding box". Practically impossible with
>> flat files because of all the grids that could be involved. Yet quite
>> plausible with groups. However, this case does not interest me as much
>> as MME. Please be aware of the general issue of file proliferation,
>> though.
>
> For the end user, who has complete autonomy over how the data is
> grouped this makes sense but not for those of use holding data
> where we have to serve multiple views of the data. On the server
> side we would use spatial indexing to solve the bounding box
> problem. I think that solution is far superior to putting the
> actual data into groups. I would hope in time those sort of tools
> will be available to users.

Again, I think providers should serve the data in the most useful form.
Give the user some options when "most useful" is debatable.
Choose a default and let the user override it.

> As a passing jibe: you also seem influenced by your current set of
> tools that happen to be group-aware (nco). Everyone's tools must
> keep moving forward. Maybe we could work on a spatial index
> format?

Anyone with a superior solution should step forward and explain why
their solution to "ensembles" or "multi-grid" datasets beats what's
on the table.

>>> A priori grouping is the same as deciding a priori on the most efficient
>>> axes of analysis. In general, for climate data analysis (and for us at
>>> BADC with very diverse user groupings) this approach would be very
>>> inefficient ...
>>
>> In other words, ability to aggregate, dismember, flatten, invert, and
>> reorder hierarchies easily without loss of information is desirable.
>> Fully agree.
>>
>>> If you want to avoid relying on ls/grep and friends, then why not have a
>>> short line text file, a manifest? So we only need to alter the manifest.
>>>
>>> (Incidentally that' the same solution that NASA should be using.)
>>
>> No argument from me that that can work well for providers.
>> However, many users rely on typing commands to analyze datasets on
>> their local machine which have been previously downloaded or produced
>> locally. I was thinking more of them. When information needed for an
>> analysis in not in the file, then the file in incomplete.
>> Self-describing datasets are great.
>> Self-describing and complete datasets are even better.
>>
>>> IMHO it's a category error to push everything into internal file
>>> metadata. The suitcase metaphor is exactly appropriate for this
>>> approach, since the reason for doing it is to transfer stuff you think
>>> belongs together. However, five minutes after getting home, you empty
>>> the suitcase .... and reorganise it. At petascale, we can't afford to do
>>> that!
>>
>> Pack a suitcase when it makes sense.
>> Users will let you know when you've packed well by their silence.
>> Or poorly by their complaints :)
>>
>> Finally, your reponse does not indicate how/if you would solve Point 3
>> of my "compelling ensemble" example, e.g., weights. How would you
>> attach data that is "external" to a dataset to the dataset without
>> running into all the problems of a priori writing?

-- 
Charlie Zender, Earth System Sci. & Computer Sci.
University of California, Irvine 949-891-2429 )'(

Received on Tue Sep 24 2013 - 23:12:22 BST

This archive was generated by hypermail 2.3.0 : Tue Sep 13 2022 - 23:02:41 BST