⇐ ⇒

[CF-metadata] some concerns about the "ensemble axis" proposal

From: V. Balaji <V.Balaji>
Date: Wed, 28 Feb 2007 16:41:41 -0500 (EST)

Francisco Doblas-Reyes writes:

> Dear Steve and Balaji,
>
> Thanks for bringing back the issue of CF and the ensemble axis.

Paco, Thank you for your patience. I'm sorry this is taking a while to
settle, but I think it's a key new feature and we needed some time to
digest the implications. I really want to come up with a solution to
this problem, as GFDL is ready with datasets to submit to ENSEMBLES,
and we are mainly waiting to finalize the issue of what metadata to
place in the netCDF files.

>> The original proposal concerned multi-forecast system ensembles. This
> includes initial-condition ensembles, perturbed-parameter ensembles and
> multi-models. It is likely that the first two systems have the same grid in
> all the forecasts, because they would be generated by the same model version.
> I wouldn't call these examples a limited sub-case.
> Solving the question of how to handle multiple grids in the same file before
> introducing the ensemble dimension would be ideal, but in the meantime the
> dissemination of standard NetCDF files with all sorts of ensembles forecasts
> is limited.

I understand that in the first two cases, initial-condition and
perturbed-parameter, it is very likely that all the models will be on
the same grid, and have the same provenance (though that "very likely"
might be changing as we do more and more cross-model forecast
ensembles...). In the last case, multi-model or heterogeneous
ensembles, the models might be quite different. I'd like to think of
the ENSEMBLES project as an example of the first,
climateprediction.net as an example of the second, and IPCC AR4 as an
example of the third, to see how we might want to represent ensembles
in these three instances.

For each of these three examples, let's take a look at something you
might want to do by way of analyzing the ensemble data.

- For the first case, the question is something like, "Is the ensemble
   mean forecast of some quantity better (i.e lower global RMS error)
   than any individual forecast?" Such a question can clearly be well
   answered using an array t(x,y,z,t,n) where n is the ensemble ID, if
   all the ensemble members are to be included, they're all on the same
   grid, and so on. If the question becomes more complicated, e.g, "can
   we further improve the skill by omitting from the ensemble the 5
   members whose _individual_ RMS error is the worst?" At this point
   you are eliminating some random selection of points along the
   ensemble axis. This operation is no longer "hyperslabbing" but
   "subsetting" in some more general sense.

   There are two other technical issues with this approach that also
   worry me. One is that netCDF-3 only supports one UNLIMITED
   dimension. If that's the time axis, is the number of ensemble
   members going to be fixed at initialization? Is that practical?

   The second, related issue, is of filesize. One record, i.e a single
   slice along the UNLIMTITED axis of a single variable, is stored as
   one contiguous hypercube of all the other dimensions. As of
   netCDF3.6, this record can be no larger than 4GB. Is it practical to
   work within this limit for the ensemble experiments we are
   considering, if there were an ensemble axis?

- In the second case (climateprediction.net) the question is something
   like, "what range of parameter P among the 100,000-member ensemble
   delivers the maximum climate sensitivity"? Again, the answer is
   achieved not by hyperslabbing, but by subsetting based on some
   outcome in an analysis.

- The third case is like the second, only more so...: a standard query
   we've been using in our thought experiments is "give me the surface
   temperature field from all 20C3M scenarios in the IPCC AR4 archive
   where volcanoes are turned OFF." This involves some fairly
   sophisticated SQL-type processing (and in fact requires 'human
   reasoning agents' at present...:-), which I honestly believe is
   beyond the scope of what can be stored in simple netCDF file
   headers.

Paco, I'm worried about the current proposal for the reasons outlined.
What I think I've learnt is that: a 'dimension' is useful for
'hyperslabbing'. Ensemble processing is usually accompanied by
'subsetting' which is much more general than 'hyperslabbing' and may
involve processing of many layers of metadata. These are very likely
in some higher-level relational model whose 'records' may not map very
well onto netCDF's current model of a 'record'.

I propose, instead, something more minimalist. IPCC AR4 has two
requirements of global attributes, 'experiment' and 'realization'.
Could I suggest that for the moment we can use something like this:
two levels of nomenclature that identifies the collection or ensemble
to which the dataset belongs, and second, an unique ID for each member
within that ensemble. (The 'realization' or 'ensemble_member' string
itself does not have to be unique, as we can combine this with other
provenance information to generate a unique ensemble member ID).

> My message is that there is a demand for standardized NetCDF ensemble
> forecast files. Shouldn't the adoption of a CF standard to write those files
> depend on when an alternative, more adequate solution is available?

I really believe that at this point, this is as far as we can go. In a
few years, with more experience of routinely working with multi-model
ensembles we can invent a better way of expressing ourselves. Would
the proposal above satisfy in some fashion the demand for an 'ensemble
forecast dataset (note: not file)'? I believe a THREDDS catalogue can
be configured to permit the kinds of aggregation you have implemented,
working off combinations of global attribute settings to generate an
ensemble dataset.

Thanks,
-- 
V. Balaji                               Office:  +1-609-452-6516
Head, Modeling Systems Group, GFDL      Home:    +1-212-253-6662
Princeton University                    Email: v.balaji at noaa.gov
Received on Wed Feb 28 2007 - 14:41:41 GMT

This archive was generated by hypermail 2.3.0 : Tue Sep 13 2022 - 23:02:40 BST

⇐ ⇒