Hi CF geeks (consider this a term of endearment from a fellow CF'er),
It looks like this one might come down to a vote. To prepare for
such a contingency, I've just reviewed the recent emails and arguments
about the various approaches. I really do think that either the most
recent proposal of Jonathan's, or Brian's alternative of Feb. 11 (stemming
from Russ and John's suggestion) would be o.k. I'm leaning
toward favoring Brian's approach, but I'm a little concerned about
the proliferation of attributes because this may make it more difficult
for new users to get started in learning CF (e.g., the length of
Appendix A could grow considerably if enough different projections
were defined).
I do think we should hear Jonathan out, so I don't think we should vote
yet.
Note to Brian: I noticed that "grid_mapping" has yet to be included in
Appendix 1 of the documentation.
cheers,
Karl
Russ Rew wrote:
>
> Hi Jonathan,
>
> Sorry, but I'm not convinced.
>
> Readability of a CDL file should be secondary to its accuracy and
> maintainability. And CDL is not really the issue, since the resulting
> netCDF file could be translated into an XML form such as NcML or any
> other lossless representation that would still have a redundant copy
> of the shared grid_mapping information for each variable defined on
> the same grid. Duplication of information can be beneficial when it
> prevents errors, as in declaring variable types even when the type of
> the variable may be inferred from its usage. But the minor
> convenience of keeping a copy of the information in text form close to
> each variable is of dubious value compared to the redundancy, identity
> issues, potential inconsistency (update anomalies), insertion
> anomalies, and deletion anomalies this design permits.
>
> Identity issues: Is the mapping for T given by
>
> T:grid_mapping="rotated_latitude_longitude ",
> grid_north_pole_latitude: nplat grid_north_pole_longitude: nplon";
>
> the same as the mapping for S specified with
>
> S:grid_mapping="rotated_latitude_longitude ",
> grid_north_pole_longitude: nplon grid_north_pole_latitude: nplat";
>
> ? The only difference is in the order of the (keyword: value) pairs,
> which shouldn't be significant, but it's difficult for a human reader
> to tell these are the same, and the difficulty of comparing them for a
> human increases with the square of the number of keywords. You could
> define a canonical form for each grid_mapping, perhaps requiring that
> the keywords appear in alphabetical order, but this added complexity
> is not needed when grid_mappings are identified just by having the
> same name for the attribute.
>
> This also occurs in merging information from two files, where the
> variables may use the same grid mapping represented by keywords in
> different orders. Using a single variable to represent a grid mapping
> doesn't have this problem, since it is easy to compare two variables
> to see if they have the same attributes.
>
> Potential inconsistency: If you update a value in some grid_mappings but
> not in others, an inconsistency results which is not possible if the
> information is only stored in one place.
>
> Insertion anomalies: There is no place to store a grid_mapping if it
> does not currently apply to at least one variable in the dataset. But
> it might be useful to insert a grid_mapping even when there are no
> variables that use it, either for later use or to record alternate
> grid_mappings. Providing a template dataset with a set of
> grid_mappings but no variables is also not possible if grid_mappings
> cannot exist independent from variables. The alternative of using a
> variable for a grid_mapping doesn't have this problem.
>
> Deletion anomalies: The inverse of insertion anomalies is that if you
> delete all the variables that use a particular grid_mapping, you may
> lose information about the grid_mapping as well. This may not be a
> concern, but I can imagine it might be easier to create some new
> variables on a grid by starting with an existing dataset, copying all
> information except the existing variables, and then adding the new
> variables. This is more difficult if the grid_mappings are only
> associated with the variables.
>
> Finally, the design with a single variable representing a grid_mapping
> is more flexible, because you can later attach more information to the
> grid_mapping without affecting working programs. For example, if you
> wanted to later associate a "french_name" with each grid_mapping,
> adding it to the string would require changing the string parsers.
> But adding an additional attribute to the variable would require no
> changes to existing programs. Since they don't know about the extra
> information, they aren't affected by it.
>
> These arguments are similar to the reasons database designers use
> "normal forms" in designing relational database schema.
>
> If you have similarly redundant designs for cell_measures,
> cell_methods, and formula_terms, I would argue that these should
> likewise be factored out into a separate named variable, with
> indirection used to eliminate redundancy, for the same reasons as in
> the grid_mapping case.
>
> --Russ
>
> On Thu, 20 Feb 2003 18:55:41, Jonathan Gregory wrote:
>
> > Here's a hopefully clearer restatement of why I prefer a single attribute for
> > the grid_mapping, as currently proposed in CF-beta. If I can't convince you,
> > I'll go along with the majority, of course - perhaps kicking and screaming. :-)
> > But as a result of thinking about this, I have a new proposal - see below.
> >
> > (1) Readability of the CDL file. I want to have the projection information in
> > the same place as the variable definition, rather than somewhere else. A single
> > attribute is a string of a few lines at most, and I think that is easy to
> > understand and much more informative than a name for the mapping, which then
> > has to be looked up somewhere else. Brian is right that the names themselves
> > might be meaningful, but if they are generated by models according to their own
> > diverse conventions it is quite likely they won't be. It can't be guaranteed.
> >
> > (2) Simplicity and self-consistency of the standard. We have used "keyword:
> > value" type attributes for cell_measures, cell_methods and formula_terms, so it
> > would be consistent to do so for this purpose. We have to be able to parse such
> > attributes already. We can and should provide parsers in, say, Fortran, C and
> > Python at least. I agree there is a problem with parsing numerical values. (See
> > my suggestion below.) I do not like the idea of introducing a new kind of
> > variable and defining a large number of new kinds of attribute that could only
> > be used for this kind of variable. It is a lot of new machinery.
> >
> > (3) I prefer per-variable information to global information because it's easier
> > for programs to manipulate when modifying individual variables and combining
> > files. This is also generally the approach we have taken in CF. If you have
> > global information, you have to break the association before you can change an
> > individual variable, and you have to test for identical information on
> > variables from different sources when you combine files. On the other hand,
> > doing a global change to lots of identical per-variable information is
> > generally easy. Duplication per se is not a problem in this case. Duplication
> > is a problem if you require data in different places to remain consistent. We
> > do not require this for variables with grid mappings. You can't modify such a
> > variable just by changing its grid mapping info - you have to produce a
> > completely new variable anyway.
> >
> > (4) Telling differences by inspection. I agree with Brian that it would not be
> > easy to tell from inspection of a CDL file whether the coordinates were the
> > same if they were all per-variable, so I admit that space-saving is not the
> > only reason for factoring out coordinates. But coordinates are usually *much*
> > more voluminous than the grid mapping attribute. It *is* easy to compare grid
> > mappings by inspection, just like units, standard_name and all the other
> > metadata we don't factor out.
> >
> > Here's my suggestion to avoid parsing numbers. This actually amounts to a
> > half-way house between our proposals so far.
> >
> > variables:
> > float nplat;
> > float nplon;
> > float T(lev,rlat,rlon);
> > T:units="K";
> > T:grid_mapping="rotated_latitude_longitude ",
> > grid_north_pole_latitude: nplat grid_north_pole_longitude: nplon";
> > data:
> > nplat=32.5;
> > nplon=170.0;
> >
> > Referring to my points above:
> >
> > (1) Readability. Partially satisfied. The data variable states what kind the
> > mapping is and what parameters it has. Unfortunately you do have to look
> > elsewhere for their values. These values could be grouped together by giving
> > the variables similar names (like nplat and nplon).
> >
> > (2) Simplicity and self-consistency. On these grounds it is better even than
> > CF-beta. The grid_mapping attribute is now text-only and points to other
> > variables, which is exactly like cell_methods, formula_terms, etc. It does not
> > introduce any new attributes. The numerical values are in variables of the
> > right numerical type so there is no problem with conversion.
> >
> > (3) Per-variable information. Satisfied. It is easy to modify an individual
> > variable's mapping, but it also easy to modify a mapping's parameter globally,
> > so it seems to have the best of both worlds.
> >
> > (4) Differences by inspection. Partially satisfied - as much as it is in the
> > projection-variable approach. You can't say for sure whether the parameters are
> > different until you look at their values, but you can have a good guess.
> _______________________________________________
> CF-metadata mailing list
> CF-metadata at cgd.ucar.edu
> http://www.cgd.ucar.edu/mailman/listinfo/cf-metadata
Received on Thu Feb 20 2003 - 17:07:52 GMT