⇐ ⇒

[CF-metadata] some concerns about the "ensemble axis" proposal

From: Jon Blower <jdb>
Date: Wed, 7 Mar 2007 19:03:44 +0000

Hi all,

As one who previously expressed reservations about the "ensemble axis"
proposal I thought I should chip in again to say that I've become
persuaded by Bryan and Jonathan's arguments and I now think the
ensemble axis is a good (and very probably necessary) idea. It does
of course have to work both within single files and across multiple
files (which can be aggregated).

Files that don't have an ensemble axis already specified can no doubt
be marked up using NcML (or similar) to generate an ensemble axis that
will be read transparently by tools (e.g. the Java NetCDF libraries).
However, I still haven't quite got my head around what happens if you
want the ensemble axis to be "unlimited", when another axis (commonly
time) might also be unlimited. Here I am thinking aloud about a
possible solution:

What we normally call an "axis" or "dimension" roughly maps onto the
concept of an array in a programming language. Arrays have a natural
order of their elements and each element is referenced by a numeric
index, starting from 0 (or 1) and going up to n-1 (or n), with no
gaps. In order to model an ensemble axis in this way we would need to
artificially designate a numeric index to each member, which has no
real meaning other than to identify it in the array of ensemble
members. The apparent order of members in this array is meaningless.

Since ensemble members have no natural order could they be modelled
more like a Hashtable (in Java terminology) or dictionary (Python)?
Each ensemble member could be referenced by a unique key (which may be
numeric or may be a string) but there is no natural ordering. I can
see that this could work both within a file and across multiple files.
 The use of hashes rather than arrays would make it clear that it's
meaningless to subset the ensemble "axis" in the same way that one
might subset longitude or time - this is more semantically accurate I
think.

I have no idea whether this breaks something fundamental in the NetCDF
data model (or even in CF), but this sounds fairly natural to me.
Certainly I imagine that if I were writing a tool to deal with
ensembles that I would model them as hashes, not arrays.

Jon

On 3/7/07, Bryan Lawrence <b.n.lawrence at rl.ac.uk> wrote:
>
> Steve
>
> I'm obviously not getting my main point across. Fine: build an
> aggregation server, it serves up what? A file? A sequence of files? I
> think the former? So regardless of what you did server side, as a client
> I'm going to get a file, and it may well have aggregated a number of
> ensemble members. I want that file to be CF compliant!
>
> So imagine Thredds serves me a temperature field for timestep 0 from ten
> ensemble members, which are from a multi-model ensemble - but
> fortunately they're on the same grid. How do we represent that? CF ought
> to be able to do that!
>
> Bryan
>
> On Wed, 2007-03-07 at 09:31 -0800, Steve Hankin wrote:
> >
> >
> > Bryan Lawrence wrote:
> > > Hi Folks, especially Balaji and Steve
> > >
> > > I'll make some general comments, and then take Balaji's questions.
> > >
> > > Firstly, Thredds has nothing to do with this issue, and that's my point
> > > from the November email, and which I was restating in reply to Steve's
> > > point. If we have to appeal to *any* external *software* package to
> > > define our metadata, then our convention is broken. (However, I have no
> > > problem with appealing to external *definitions* of internal
> > > identifiers.)
> > >
> > Hi Bryan et. al.,
> > [Please accept my bowing and scraping in advance here both for
> > being long and wordy and for any appearance of being preachy.
> > Assertions like "THREDDS has nothing to do with this issue"
> > and "Aggregation servers are a red herring" illustrate such a
> > fundamental divergence in perceptions that I feel compelled to
> > go back to the history of CF and our fundamentals.]
> > It is wise that we have chosen to schedule a full day on CF for our
> > GO-ESSP meeting this summer, because we have some fundamental (and
> > probably difficult) issues to reach agreement on. For a very large
> > fraction of the CF community "files" were supplanted by aggregations
> > as the foundation of netCDF data management a very long time ago. In
> > Ferret this happened in 1995. GrADS I think was a little earlier.
> > CDAT and NCL probably around this time, too. I'm sure there have
> > been plenty of other application-specific aggregation solutions
> > developed, too ... also probably a long time ago.
> >
> > The fact that so many applications separately developed solutions is a
> > clear indication that aggregation is a fundamental need. I have to
> > assume, Bryan, that when you referred to aggregation servers as a red
> > herring your emphasis is on the "server" part (which I will get to).
> > Because the need for aggregation with CF data seems inescapable.
> >
> > The fact that so many applications separately developed solutions is
> > also a clear indication of inefficiency. Group after group developing
> > the same capability .... That's why it made so much sense when
> > Unidata released a fully integrated aggregation solution in 2001 ...
> > and many enhancements to it since then (the entire NcML framework).
> > For those lucky enough to be programming in Java, this was a
> > transparent solution for aggregating local files. For those using C
> > or C-dependent code the aggregations were available only through
> > OPeNDAP connections. Here we see a hole and the beginning of a
> > parting of the ways, since some parts of the community found OPeNDAP
> > to be excellent. Other parts found it unacceptable. Arguably it is
> > that parting of the ways that lies at the heart of our differing
> > outlooks on ensembles today. I would note, however, that at the last
> > GO-ESSP meeting we made a group commitment to the development of a C
> > library that will capture the richness of what Java access to netCDF
> > offers -- hopefully including the entire NcML framework for
> > aggregation.
> >
> > These dates -- 1995, 2001 -- are ancient history. Since then "service
> > oriented" concepts have come to dominate our discussions. (E.g.
> > server-side transformation capabilities on CF files are now bread and
> > butter discussion topics.) THREDDS is one example of a
> > service-oriented approach. Just as aggregation allowed us to replace
> > files with a higher order abstraction -- the dataset -- the service
> > oriented abstractions allow us to handle collections of datasets as
> > single entities. Some may like THREDDS. Others may not. Discussion
> > of the trade-offs ought to be happening. But it is an extreme
> > rhetorical stand to say that it has "nothing to do with the problem".
> >
> > The power of the service-oriented approach has everything to do with
> > the options available for handling ensembles. If we accept that
> > THREDDS-like metadata catalogs are within the scope of CF discussions,
> > then we open a huge and fertile domain for solving problems in ways
> > that are 100% harmonious with the current CF usage. If we reject the
> > discussion of catalogs, then we are forced to pile greater and greater
> > complexity into the body of our CF files, muddying fundamental CF
> > concepts and ultimately providing only a partial solution to the
> > ensemble problem.
> >
> > Conclusion: I reject the assertion that discussion of catalog level
> > metadata is out of bounds. One could as well argue that the ensemble
> > problem, itself is out of bounds. When we confront fundamental new
> > problems in CF, we may need to introduce fundamental new tools. The
> > catalog is the natural place in the CF data model for the concept of
> > an ensemble to exist. Currently there has been no proposal placed on
> > the table to handle multi-grid ensembles. (Clearly THREDDS offers
> > some obvious directions). We're treating the multi-grid problem like
> > the elephant in the living room. Lets address it and see where that
> > leads our thinking.
> >
> > - Steve
> > > Secondly, netcdf4 is also a red herring, because folk have to use
> > > netcdf3 now, and will have to do so for a while to come. (Further, I
> > > can't seriously believe that on the one hand we have an argument that
> > > adding another axis is an engineering problem, but using a different API
> > > to the persistence format is not ... both ways, software will need
> > > adjustment, but in the former case we are working on top of a known and
> > > reliable persistence format. I can tell you for a fact that we wont be
> > > accepting netcdf4 data in 2007 for the BADC ... not because I don't
> > > like it, but because it has not yet got a track record!).
> > >
> > > So where does that leave us?
> > > * It leaves us with certain classes of ensemble data, that we have
> > > available a priori (i.e. at file writing time), and that can be stored
> > > in files in a certain way, and these are the ones that we are proposing
> > > a solution for. These work fine with the proposed solution.
> > > * there are also classes of ensemble data that we might what to
> > > aggregate a postiori (i.e we don't have them at file write time, or that
> > > cannot be stored into an array which has the same underlying coordinate
> > > system). Well frankly, how is that different from *any* other existing
> > > situation? (The Unified Model 4.5 had P and UV on different grids, but
> > > we can still put them in the same file, I could even put them in the
> > > same file with an ensemble axis for each). I can always find an example
> > > where I want to add more data to a file later which isn't in the time
> > > dimension (so I rewrite the file). Ensembles are simply not special in
> > > this regard!
> > >
> > > (Aggregation servers are a red herring, in the final analysis, what I
> > > get from aggregation servers are files, so let's care about the
> > > persistence format, not the interface definition).
> > >
> > >
> > > > Is the ensemble axis static? (i.e not UNLIMITED)? What happens if I want
> > > > to increase the size of an ensemble later? (We recently added 2 members
> > > > to a 3-member initial-condition ensemble we've submitted to IPCC AR4).
> > > >
> > >
> > > So rewrite the data, but nobody is saying you *have* to have ensembles
> > > all in one file, just as you don't rely on having all the variables in
> > > one file. In the latter case, for sure one needs external information to
> > > make the links, but let's not appeal to any *specific* software to do
> > > it.
> > >
> > >
> > > > For the kinds of ensembles we have in mind, can we stay within the file
> > > > size limits?
> > > >
> > >
> > > No one is arguing that 1 file = 1 dataset.
> > >
> > >
> > > > I certainly wasn't meaning to suggest, or even imply, any software
> > > > choices or aggregation methods to go along with this. This is a comment
> > > > about metadata only.
> > > >
> > >
> > > Fair enough, I agree with your perspective.
> > >
> > > Cheers
> > > Bryan
> > >
> > > _______________________________________________
> > > CF-metadata mailing list
> > > CF-metadata at cgd.ucar.edu
> > > http://www.cgd.ucar.edu/mailman/listinfo/cf-metadata
> > >
> >
> > --
> > --
> >
> > Steve Hankin, NOAA/PMEL -- Steven.C.Hankin at noaa.gov
> > 7600 Sand Point Way NE, Seattle, WA 98115-0070
> > ph. (206) 526-6080, FAX (206) 526-6744
> _______________________________________________
> CF-metadata mailing list
> CF-metadata at cgd.ucar.edu
> http://www.cgd.ucar.edu/mailman/listinfo/cf-metadata
>


-- 
--------------------------------------------------------------
Dr Jon Blower              Tel: +44 118 378 5213 (direct line)
Technical Director         Tel: +44 118 378 8741 (ESSC)
Reading e-Science Centre   Fax: +44 118 378 6413
ESSC                       Email: jdb at mail.nerc-essc.ac.uk
University of Reading
3 Earley Gate
Reading RG6 6AL, UK
--------------------------------------------------------------
Received on Wed Mar 07 2007 - 12:03:44 GMT

This archive was generated by hypermail 2.3.0 : Tue Sep 13 2022 - 23:02:40 BST

⇐ ⇒