Hi Jamie et. al.,
Continuing this long thread ...
(aside: This thread may be a good guinea pig to use as a first "issue"
to track with the CF Trac system at PCMDI.).
Bryan Lawrence and I had a chance to talk this over at some length at a
meeting earlier this week. Below I'll use the term "ensemble" in the
usual way and the term "forecast collection" to refer to the series of
outputs generated as successive forecast model runs occur. The high
points of what Bryan and I agreed on were:
1. Use a 5th dimension for the ensemble elements:
The benefits of using an extra netCDF dimension to capture the
semantics of the "ensemble" were probably sufficient to justify
the cost (to software developers and interoperability). This
argues for the use of a fifth dimension: (realization,z,y,x) as
outlined in earlier emails.
2. Do not use an additional dimension for the forecast_run_date sequence
The benefits of using an extra netCDF dimension to capture the
semantics of the "forecast collection" do not seem to justify the
cost (to software developers and interoperability) and a good
alternative is available. This argues against the use of a sixth
dimension (realization,model_run_time,forecasted_offset,z,y,x).
Bryan's reasons for rejecting the use of an additional dimension to
capture the forecast collection may differ from mine. Here are three
arguments I would give for rejecting it. Individually any of them might
or might not carry the day, but I think they make a compelling argument
collectively :
1. Forecast aggregations are not good candidates to put into a "file"
The forecast collection has a slippery character that doesn't map
onto the semantics of a netCDF dimension in the normal way. The
proposed 6th dimension is really a stream, rather than fixed
length arrays. Each (say) 6 hours a new forecast model series is
produced. Typically a forecast collection that is served on-line
is limited in duration by the amount of spinning disk storage; a
fixed number of forecasts are kept on line with new forecasts
continually added and the oldest forecasts dropped. All of this
does not map well onto a "file" as a unit of storage. The
forecast collection is more naturally some form of multi-file
aggregation where external metadata are needed to glue it together.
Another very significant benefit of using an aggregation technique
is that files of other formats, say GRIB, may be similarly
aggregated and served through OPeNDAP (typically not fully
transparent wrt netCDF CF files, but often good enough).
2. There is no articulated requirement for a true two-dimensional
approach to the dual time axis -- model_run_time X
forecasted_date. Instead there are three known one-dimensional
paths of interest through this 2D space:
1. time series of forecasted_dates at a fixed model_run_time
2. time series of model_run_time at a fixed offset on the
forecasted_dates axis
3. time series of successively closer forecasts of a fixed time
endpoint (the Z48,Z36,Z24,Z12,Z0 sequence of increasingly
realistic forecasted fields)
* Granted that there might be something interesting to be
learned by looking at a 2D contour plot of forecasted_offset
vs model_run_date. But is this a requirement that drives us?
3. There is already (BETA) software that creates the three types of
aggregations described in number 2 just above from netCDF files
and serves it through OPeNDAP. Maybe through the Java netCDF API,
too ... not sure. It should be tested and evaluated for the
completeness (or not) of the solution that it provides, before
modifications to CF are considered. I've attached John Caron's
announcement of this Sept. 6. There's also a spec for the
forecast aggregation capabilities somewhere. My link to this
document is dead, but I imagine that John Caron has it somewhere.
- Steve
========================
Kettleborough, Jamie wrote:
> Hello Steve,
>
> I appreciate your concerns on stability and the required complexity for
> CF interpreting code - but I think there is a danger in this case that
> if we don't accommodate ensembles in the right way then we are limiting
> CF's applicability to forecasts (operational or climate).
>
> Yes I agree ensemble is not a coordinate in the same sense as time or
> space - there is no obvious unique label or metric. But I think it
> makes a lot of sense to treat it as a coordinate/dimension - it
> facilitates using the data to make probabilistic forecasts.
>
> Could we live with the grouping approach - literally we could - but I
> don't think its the best way forward. I think if you simply want to
> compare output from different models then the grouping approach is
> probably OK, but I think we are moving beyond that kind of analysis (and
> operational forecasting is probably ahead of climate from this point of
> view) to an analysis where the _the ensemble is the forecast entity_,
> not the individual model.
>
> So I think if you want your analysis tool to be useful in the
> forecasting context then there is development effort needed here however
> we represent ensembles on disk (and from what Jennifer has said this
> work is already underway in GRADS). (Of course there are many
> applications where forecast isn't the be all and end all and you can say
> your analysis tool is made for this set of applications, not
> forecasting). As Bryan pointed out CF is not intrinsically 4D - it is
> just it currently makes the space and time dims special. Shouldn't CF-
> aware applications be able to deal with arbitrary dimensionality?
>
> I also suspect for large ensembles (100+ members?) the grouping approach
> could be inefficient I/O wise. But I haven't had chance to play with
> NetCDF4 yet so I could be wrong.
>
> I think the current situation with this is
>
> 1) should we use grouping or dimensions for ensembles (I think dimension
> is what we currently have)
>
> 2) How do we label ensemble meta data?
> a) standard_names, or
> b) external_dictionary
>
> I think either could work - though external_dictionary seems to give
> more scope for coping with different levels of meta data you might
> choose to associate with ensembles. And it preserves the general
> applicability of the vague names like 'source' as global attributes etc.
> (sorry this statement assumes context knowledge of the rest of this
> thread). (are there other implications of accepting
> external_dictionary?) Single model files would need to include the meta
> data coordinates as singleton dimensions - but I think that is OK. I
> _think_ this also makes aggregation along realization more like
> aggregation along time or level.
>
> 3) Is there any common ground for ensembles meta data (across
> forecastetc. timescales) that we can standardise? i.e. return to Paco's
> list.
>
> 4) Would CF act as the standard body for these - or should we be
> lobbying for someone else to do it?
>
> (Have I missed anything in my eagerness to try and keep this thread
> moving?)
>
> Jamie
>
> On Wed, 2006-11-01 at 09:26 -0800, Steve Hankin wrote:
>
>> Bryan Lawrence wrote:
>>
>>> On Tue, 2006-10-31 at 09:32 -0800, Steve Hankin wrote:
>>>
>>>
>>>
>>>> Like others I've watched this email thread grow like Topsy, wondering
>>>> if I would find the time to read it ... But the discussion topic seems
>>>> big enough to warrant a fair amount of rough and tumble.
>>>>
>>>>
>>> Absolutely. Bring it on :-) :-)
>>>
>>> With regard to the stability of CF and long term evolution, that's
>>> what's behind my wanting to separate maintaining descriptions of the
>>> quantities measured/predicted from the descriptions of how/why it was
>>> done. It's also why I want just one way of doing it, not one that is
>>> specially optimized for numerical models per se, and not of wider
>>> applicability.
>>>
>>>
>>>
>>>
>>>> OK. Enuf preamble. In a nutshell, the new proposed structures seem
>>>> to capture the semantics of the various collections of model outputs
>>>> -- ensembles and forecast collections -- through the addition of new
>>>> dimensions. In the most extreme case this dimension list might become
>>>> (realization,forecast_reference_time,forecast_period,lev,lat,lon).
>>>>
>>>> I'd pose two questions:
>>>> 1. Will this approach break existing CF applications? If yes, is
>>>> that a red flag to consider other options?
>>>>
>>>>
>>> I don't believe there is any restriction on number of dimensions in CF,
>>> so while it may not be pretty, it seems ok to me.
>>>
>>>
>>>> 1. Is this the same approach that we would take if we already had
>>>> netCDF 4? If no, is that a red flag that we should give more
>>>> thought to the long-term stability of the standard?
>>>>
>>>>
>>> Let's not think about netCDF4 immediately :-) One of the other things we
>>> discussed was trying to divorce the content standard from the
>>> implementation standard ...
>>>
>>>
>>>
>> Bryan,
>>
>> Let me start by saying with full sincerity that I am not arguing for
>> any particular conclusion. I see major trade-offs to both options --
>> adding new dimensions and juggling external metadata. But ...
>>
>> With very little discussion (just above -- granting there are elements
>> added below) you have waved away the type of questions that I'd argue
>> we need to ask ourselves continuously. Standards are always about
>> making compromises. It is right to begin with the approach you are
>> insisting on -- divorce the abstract understanding of the problem from
>> the messy technology. But the next step has to be to ask what the
>> undesirable impacts of your choices might be. And then to make
>> changes to your thinking and accept compromises. Without stability CF
>> is not a standard at all.
>>
>> Regarding no. 1 -- the choice to utilize a 5 or 6 D encoding -- the
>> consequence will be the the great majority of existing applications
>> will be unable to read the new files at all without significant
>> modifications. Inside of those new files will be 4D subsets that
>> represent the current "CF nut" -- objects that the applications can
>> currently read. So we have a large net loss of interoperability until
>> additional investments are made in software across our community.
>>
>> Regarding no. 2 -- the CF community is not more than 2 years from
>> where discussions of netCDF 4 will occupy a significant part of our
>> time and energy (off the cuff estimate). So to propose concrete
>> changes today without fully considering how you would handle them in
>> netCDF 4, you are inviting two major sets of changes in as many years.
>> That's a very low standard of stability.
>>
>> So what are the compromises to be looked at? How would your data
>> modeling of ensembles and forecasts look different if you think in
>> terms of dimensions (an array of identical objects) versus
>> "groups" (unordered lists which may have heterogeneous contents) ?
>> The former is a more perfect fit to the concepts -- the latter is a
>> more general structure that fully embraces the needed concepts.
>>
>> The question is how great are the negative impacts in stability and
>> interoperability that you are willing to accept in order to work with
>> the more perfect data model? This is a balancing question where the
>> "abstract data modeler" view represents one of the polarized
>> positions. The entire substance of your arguments so far seems to
>> come from this viewpoint. We acknowledge that CF discussions have not
>> found a balance between provider and user viewpoints in the past.
>> How do we improve upon that?
>>
>> - Steve
>>
>>
>>> But your substantive point is fair enough, is there a cleaner way to
>>> think about this ...?
>>>
>>> Particularly from the point of view that a set of simulations comprising
>>> a forecast IS THE forecast (singular), then I think at least
>>> (realization,t,z,y,x) is inescapable. I don't like the other example
>>> (realization,ref_time,period,z,y,x), but accept that it may well be
>>> the natural output of a set of aggregations. Is it not what Thredds
>>> would deliver anyway from an aggregation? Which brings me to my last
>>> point:
>>>
>>> As far as Thredds as a stop-gap solution goes. I think we need to
>>> divorce how we interact with the data from how we manage it
>>> (particularly for posterity). There is no way that I'm going to rely on
>>> ANY *interface* to preserve information content, so what you're really
>>> saying is that you want to rely on metadata held externally from the
>>> files (whether Thredds or not). To some extent that's inescapable (which
>>> was one of my points in an earlier email), but we want to stick to the
>>> requirement that CF can differentiate data (if not fully describe all
>>> the ancillary information) ... and for forecast ensembles (and station
>>> data) there is a need for some extra information over and above the
>>> index value in a "special" dimension. It's that information we need to
>>> get into the CF content standard.
>>>
>>> Also, from an operational perspective, ok, so maybe I can use Thredds or
>>> any interface to get some data, but then I've got it. What then? The
>>> content standard has to tell me what's in it, so we're back to CF and
>>> possibly pointers to external metadata.
>>>
>>> (Wrt netcdf4: I see groups as being more useful for aggregating things
>>> that don't share common dimensionality)
>>>
>>> Cheers
>>> Bryan
>>>
>>>
>>>
>> --
>> --
>>
>> Steve Hankin, NOAA/PMEL -- Steven.C.Hankin at noaa.gov
>> 7600 Sand Point Way NE, Seattle, WA 98115-0070
>> ph. (206) 526-6080, FAX (206) 526-6744
>> _______________________________________________
>> CF-metadata mailing list
>> CF-metadata at cgd.ucar.edu
>> http://www.cgd.ucar.edu/mailman/listinfo/cf-metadata
>>
Kettleborough, Jamie wrote:
> Hello Steve,
>
> I appreciate your concerns on stability and the required complexity for
> CF interpreting code - but I think there is a danger in this case that
> if we don't accommodate ensembles in the right way then we are limiting
> CF's applicability to forecasts (operational or climate).
>
> Yes I agree ensemble is not a coordinate in the same sense as time or
> space - there is no obvious unique label or metric. But I think it
> makes a lot of sense to treat it as a coordinate/dimension - it
> facilitates using the data to make probabilistic forecasts.
>
> Could we live with the grouping approach - literally we could - but I
> don't think its the best way forward. I think if you simply want to
> compare output from different models then the grouping approach is
> probably OK, but I think we are moving beyond that kind of analysis (and
> operational forecasting is probably ahead of climate from this point of
> view) to an analysis where the _the ensemble is the forecast entity_,
> not the individual model.
>
> So I think if you want your analysis tool to be useful in the
> forecasting context then there is development effort needed here however
> we represent ensembles on disk (and from what Jennifer has said this
> work is already underway in GRADS). (Of course there are many
> applications where forecast isn't the be all and end all and you can say
> your analysis tool is made for this set of applications, not
> forecasting). As Bryan pointed out CF is not intrinsically 4D - it is
> just it currently makes the space and time dims special. Shouldn't CF-
> aware applications be able to deal with arbitrary dimensionality?
>
> I also suspect for large ensembles (100+ members?) the grouping approach
> could be inefficient I/O wise. But I haven't had chance to play with
> NetCDF4 yet so I could be wrong.
>
> I think the current situation with this is
>
> 1) should we use grouping or dimensions for ensembles (I think dimension
> is what we currently have)
>
> 2) How do we label ensemble meta data?
> a) standard_names, or
> b) external_dictionary
>
> I think either could work - though external_dictionary seems to give
> more scope for coping with different levels of meta data you might
> choose to associate with ensembles. And it preserves the general
> applicability of the vague names like 'source' as global attributes etc.
> (sorry this statement assumes context knowledge of the rest of this
> thread). (are there other implications of accepting
> external_dictionary?) Single model files would need to include the meta
> data coordinates as singleton dimensions - but I think that is OK. I
> _think_ this also makes aggregation along realization more like
> aggregation along time or level.
>
> 3) Is there any common ground for ensembles meta data (across
> forecastetc. timescales) that we can standardise? i.e. return to Paco's
> list.
>
> 4) Would CF act as the standard body for these - or should we be
> lobbying for someone else to do it?
>
> (Have I missed anything in my eagerness to try and keep this thread
> moving?)
>
> Jamie
>
> On Wed, 2006-11-01 at 09:26 -0800, Steve Hankin wrote:
>
>> Bryan Lawrence wrote:
>>
>>> On Tue, 2006-10-31 at 09:32 -0800, Steve Hankin wrote:
>>>
>>>
>>>
>>>> Like others I've watched this email thread grow like Topsy, wondering
>>>> if I would find the time to read it ... But the discussion topic seems
>>>> big enough to warrant a fair amount of rough and tumble.
>>>>
>>>>
>>> Absolutely. Bring it on :-) :-)
>>>
>>> With regard to the stability of CF and long term evolution, that's
>>> what's behind my wanting to separate maintaining descriptions of the
>>> quantities measured/predicted from the descriptions of how/why it was
>>> done. It's also why I want just one way of doing it, not one that is
>>> specially optimized for numerical models per se, and not of wider
>>> applicability.
>>>
>>>
>>>
>>>
>>>> OK. Enuf preamble. In a nutshell, the new proposed structures seem
>>>> to capture the semantics of the various collections of model outputs
>>>> -- ensembles and forecast collections -- through the addition of new
>>>> dimensions. In the most extreme case this dimension list might become
>>>> (realization,forecast_reference_time,forecast_period,lev,lat,lon).
>>>>
>>>> I'd pose two questions:
>>>> 1. Will this approach break existing CF applications? If yes, is
>>>> that a red flag to consider other options?
>>>>
>>>>
>>> I don't believe there is any restriction on number of dimensions in CF,
>>> so while it may not be pretty, it seems ok to me.
>>>
>>>
>>>> 1. Is this the same approach that we would take if we already had
>>>> netCDF 4? If no, is that a red flag that we should give more
>>>> thought to the long-term stability of the standard?
>>>>
>>>>
>>> Let's not think about netCDF4 immediately :-) One of the other things we
>>> discussed was trying to divorce the content standard from the
>>> implementation standard ...
>>>
>>>
>>>
>> Bryan,
>>
>> Let me start by saying with full sincerity that I am not arguing for
>> any particular conclusion. I see major trade-offs to both options --
>> adding new dimensions and juggling external metadata. But ...
>>
>> With very little discussion (just above -- granting there are elements
>> added below) you have waved away the type of questions that I'd argue
>> we need to ask ourselves continuously. Standards are always about
>> making compromises. It is right to begin with the approach you are
>> insisting on -- divorce the abstract understanding of the problem from
>> the messy technology. But the next step has to be to ask what the
>> undesirable impacts of your choices might be. And then to make
>> changes to your thinking and accept compromises. Without stability CF
>> is not a standard at all.
>>
>> Regarding no. 1 -- the choice to utilize a 5 or 6 D encoding -- the
>> consequence will be the the great majority of existing applications
>> will be unable to read the new files at all without significant
>> modifications. Inside of those new files will be 4D subsets that
>> represent the current "CF nut" -- objects that the applications can
>> currently read. So we have a large net loss of interoperability until
>> additional investments are made in software across our community.
>>
>> Regarding no. 2 -- the CF community is not more than 2 years from
>> where discussions of netCDF 4 will occupy a significant part of our
>> time and energy (off the cuff estimate). So to propose concrete
>> changes today without fully considering how you would handle them in
>> netCDF 4, you are inviting two major sets of changes in as many years.
>> That's a very low standard of stability.
>>
>> So what are the compromises to be looked at? How would your data
>> modeling of ensembles and forecasts look different if you think in
>> terms of dimensions (an array of identical objects) versus
>> "groups" (unordered lists which may have heterogeneous contents) ?
>> The former is a more perfect fit to the concepts -- the latter is a
>> more general structure that fully embraces the needed concepts.
>>
>> The question is how great are the negative impacts in stability and
>> interoperability that you are willing to accept in order to work with
>> the more perfect data model? This is a balancing question where the
>> "abstract data modeler" view represents one of the polarized
>> positions. The entire substance of your arguments so far seems to
>> come from this viewpoint. We acknowledge that CF discussions have not
>> found a balance between provider and user viewpoints in the past.
>> How do we improve upon that?
>>
>> - Steve
>>
>>
>>> But your substantive point is fair enough, is there a cleaner way to
>>> think about this ...?
>>>
>>> Particularly from the point of view that a set of simulations comprising
>>> a forecast IS THE forecast (singular), then I think at least
>>> (realization,t,z,y,x) is inescapable. I don't like the other example
>>> (realization,ref_time,period,z,y,x), but accept that it may well be
>>> the natural output of a set of aggregations. Is it not what Thredds
>>> would deliver anyway from an aggregation? Which brings me to my last
>>> point:
>>>
>>> As far as Thredds as a stop-gap solution goes. I think we need to
>>> divorce how we interact with the data from how we manage it
>>> (particularly for posterity). There is no way that I'm going to rely on
>>> ANY *interface* to preserve information content, so what you're really
>>> saying is that you want to rely on metadata held externally from the
>>> files (whether Thredds or not). To some extent that's inescapable (which
>>> was one of my points in an earlier email), but we want to stick to the
>>> requirement that CF can differentiate data (if not fully describe all
>>> the ancillary information) ... and for forecast ensembles (and station
>>> data) there is a need for some extra information over and above the
>>> index value in a "special" dimension. It's that information we need to
>>> get into the CF content standard.
>>>
>>> Also, from an operational perspective, ok, so maybe I can use Thredds or
>>> any interface to get some data, but then I've got it. What then? The
>>> content standard has to tell me what's in it, so we're back to CF and
>>> possibly pointers to external metadata.
>>>
>>> (Wrt netcdf4: I see groups as being more useful for aggregating things
>>> that don't share common dimensionality)
>>>
>>> Cheers
>>> Bryan
>>>
>>>
>>>
>> --
>> --
>>
>> Steve Hankin, NOAA/PMEL -- Steven.C.Hankin at noaa.gov
>> 7600 Sand Point Way NE, Seattle, WA 98115-0070
>> ph. (206) 526-6080, FAX (206) 526-6744
>> _______________________________________________
>> CF-metadata mailing list
>> CF-metadata at cgd.ucar.edu
>> http://www.cgd.ucar.edu/mailman/listinfo/cf-metadata
>>
--
--
Steve Hankin, NOAA/PMEL -- Steven.C.Hankin at noaa.gov
7600 Sand Point Way NE, Seattle, WA 98115-0070
ph. (206) 526-6080, FAX (206) 526-6744
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cgd.ucar.edu/pipermail/cf-metadata/attachments/20061109/93a09c8d/attachment-0002.html>
-------------- next part --------------
An embedded message was scrubbed...
From: John Caron <caron at unidata.ucar.edu>
Subject: announce experimental TDS feature: "Forecast Model Run Collection"
Date: Wed, 06 Sep 2006 17:25:50 -0600
Size: 3464
URL: <http://mailman.cgd.ucar.edu/pipermail/cf-metadata/attachments/20061109/93a09c8d/attachment.mht>
Received on Thu Nov 09 2006 - 21:33:37 GMT