⇐ ⇒

[CF-metadata] Re: projections in CF

From: Russ Rew <russ>
Date: Thu, 20 Feb 2003 14:35:36 -0700

Hi Jonathan,

Sorry, but I'm not convinced.

Readability of a CDL file should be secondary to its accuracy and
maintainability. And CDL is not really the issue, since the resulting
netCDF file could be translated into an XML form such as NcML or any
other lossless representation that would still have a redundant copy
of the shared grid_mapping information for each variable defined on
the same grid. Duplication of information can be beneficial when it
prevents errors, as in declaring variable types even when the type of
the variable may be inferred from its usage. But the minor
convenience of keeping a copy of the information in text form close to
each variable is of dubious value compared to the redundancy, identity
issues, potential inconsistency (update anomalies), insertion
anomalies, and deletion anomalies this design permits.

Identity issues: Is the mapping for T given by

  T:grid_mapping="rotated_latitude_longitude ",
          grid_north_pole_latitude: nplat grid_north_pole_longitude: nplon";

the same as the mapping for S specified with

  S:grid_mapping="rotated_latitude_longitude ",
           grid_north_pole_longitude: nplon grid_north_pole_latitude: nplat";

? The only difference is in the order of the (keyword: value) pairs,
which shouldn't be significant, but it's difficult for a human reader
to tell these are the same, and the difficulty of comparing them for a
human increases with the square of the number of keywords. You could
define a canonical form for each grid_mapping, perhaps requiring that
the keywords appear in alphabetical order, but this added complexity
is not needed when grid_mappings are identified just by having the
same name for the attribute.

This also occurs in merging information from two files, where the
variables may use the same grid mapping represented by keywords in
different orders. Using a single variable to represent a grid mapping
doesn't have this problem, since it is easy to compare two variables
to see if they have the same attributes.

Potential inconsistency: If you update a value in some grid_mappings but
not in others, an inconsistency results which is not possible if the
information is only stored in one place.

Insertion anomalies: There is no place to store a grid_mapping if it
does not currently apply to at least one variable in the dataset. But
it might be useful to insert a grid_mapping even when there are no
variables that use it, either for later use or to record alternate
grid_mappings. Providing a template dataset with a set of
grid_mappings but no variables is also not possible if grid_mappings
cannot exist independent from variables. The alternative of using a
variable for a grid_mapping doesn't have this problem.

Deletion anomalies: The inverse of insertion anomalies is that if you
delete all the variables that use a particular grid_mapping, you may
lose information about the grid_mapping as well. This may not be a
concern, but I can imagine it might be easier to create some new
variables on a grid by starting with an existing dataset, copying all
information except the existing variables, and then adding the new
variables. This is more difficult if the grid_mappings are only
associated with the variables.

Finally, the design with a single variable representing a grid_mapping
is more flexible, because you can later attach more information to the
grid_mapping without affecting working programs. For example, if you
wanted to later associate a "french_name" with each grid_mapping,
adding it to the string would require changing the string parsers.
But adding an additional attribute to the variable would require no
changes to existing programs. Since they don't know about the extra
information, they aren't affected by it.

These arguments are similar to the reasons database designers use
"normal forms" in designing relational database schema.

If you have similarly redundant designs for cell_measures,
cell_methods, and formula_terms, I would argue that these should
likewise be factored out into a separate named variable, with
indirection used to eliminate redundancy, for the same reasons as in
the grid_mapping case.

--Russ

On Thu, 20 Feb 2003 18:55:41, Jonathan Gregory wrote:

> Here's a hopefully clearer restatement of why I prefer a single attribute for
> the grid_mapping, as currently proposed in CF-beta. If I can't convince you,
> I'll go along with the majority, of course - perhaps kicking and screaming. :-)
> But as a result of thinking about this, I have a new proposal - see below.
>
> (1) Readability of the CDL file. I want to have the projection information in
> the same place as the variable definition, rather than somewhere else. A single
> attribute is a string of a few lines at most, and I think that is easy to
> understand and much more informative than a name for the mapping, which then
> has to be looked up somewhere else. Brian is right that the names themselves
> might be meaningful, but if they are generated by models according to their own
> diverse conventions it is quite likely they won't be. It can't be guaranteed.
>
> (2) Simplicity and self-consistency of the standard. We have used "keyword:
> value" type attributes for cell_measures, cell_methods and formula_terms, so it
> would be consistent to do so for this purpose. We have to be able to parse such
> attributes already. We can and should provide parsers in, say, Fortran, C and
> Python at least. I agree there is a problem with parsing numerical values. (See
> my suggestion below.) I do not like the idea of introducing a new kind of
> variable and defining a large number of new kinds of attribute that could only
> be used for this kind of variable. It is a lot of new machinery.
>
> (3) I prefer per-variable information to global information because it's easier
> for programs to manipulate when modifying individual variables and combining
> files. This is also generally the approach we have taken in CF. If you have
> global information, you have to break the association before you can change an
> individual variable, and you have to test for identical information on
> variables from different sources when you combine files. On the other hand,
> doing a global change to lots of identical per-variable information is
> generally easy. Duplication per se is not a problem in this case. Duplication
> is a problem if you require data in different places to remain consistent. We
> do not require this for variables with grid mappings. You can't modify such a
> variable just by changing its grid mapping info - you have to produce a
> completely new variable anyway.
>
> (4) Telling differences by inspection. I agree with Brian that it would not be
> easy to tell from inspection of a CDL file whether the coordinates were the
> same if they were all per-variable, so I admit that space-saving is not the
> only reason for factoring out coordinates. But coordinates are usually *much*
> more voluminous than the grid mapping attribute. It *is* easy to compare grid
> mappings by inspection, just like units, standard_name and all the other
> metadata we don't factor out.
>
> Here's my suggestion to avoid parsing numbers. This actually amounts to a
> half-way house between our proposals so far.
>
> variables:
> float nplat;
> float nplon;
> float T(lev,rlat,rlon);
> T:units="K";
> T:grid_mapping="rotated_latitude_longitude ",
> grid_north_pole_latitude: nplat grid_north_pole_longitude: nplon";
> data:
> nplat=32.5;
> nplon=170.0;
>
> Referring to my points above:
>
> (1) Readability. Partially satisfied. The data variable states what kind the
> mapping is and what parameters it has. Unfortunately you do have to look
> elsewhere for their values. These values could be grouped together by giving
> the variables similar names (like nplat and nplon).
>
> (2) Simplicity and self-consistency. On these grounds it is better even than
> CF-beta. The grid_mapping attribute is now text-only and points to other
> variables, which is exactly like cell_methods, formula_terms, etc. It does not
> introduce any new attributes. The numerical values are in variables of the
> right numerical type so there is no problem with conversion.
>
> (3) Per-variable information. Satisfied. It is easy to modify an individual
> variable's mapping, but it also easy to modify a mapping's parameter globally,
> so it seems to have the best of both worlds.
>
> (4) Differences by inspection. Partially satisfied - as much as it is in the
> projection-variable approach. You can't say for sure whether the parameters are
> different until you look at their values, but you can have a good guess.
Received on Thu Feb 20 2003 - 14:35:36 GMT

This archive was generated by hypermail 2.3.0 : Tue Sep 13 2022 - 23:02:40 BST

⇐ ⇒