[CF-metadata] Feedback requested on proposed CF Simple Geometries from Jonathan Gregory on 2016-09-22 (Archive of CF discussions from 2002 to 2019 on the cf-metadata mailing list)

From: Jonathan Gregory <j.m.gregory>
Date: Thu, 22 Sep 2016 11:40:07 +0100

Dear Ben

Thank you for your thoughtful and interesting proposal. I have quite a lot of
questions and comments about it.

* You explain that the need is to specify spatial coordinates with a simple
geometry for a timeSeries variable. For example, this could be for the
discharge as a function of time across some line in a river (your example), or
I suppose it could be an average temperature as a function of time for the
Atlantic Ocean, where you wanted to supply the polygon which drew the outline
of the basin. Have I got the idea? Timeseries like this can be stored in CF,
but their geographical extent is usually described only in words e.g. a region
name of atlantic_ocean, and this is fine for applications like CMIP where you
want to compare data from different data sources in which the Atlantic Ocean
may have different exact shapes (different AOGCMs, in particular). An array of
region names is also possible, so I don't think we need a new convention to
contain your dwarf planet example.

* Sect 9.1 on discrete sampling geometries says it cannot yet be used for cases
"where geo-positioning cannot be described as a discrete point location.
Problematic examples include time series that refer to a geographical region
(e.g. the northern hemisphere) ...". Actually I think that's not quite right.
The existing convention *can* describe regions which are contiguous, and
rectangular or polygonal, using its usual bounds convention (Sect 7.1). I think
we should consider changing this text, because it seems unnecessarily
restrictive. For example, a timeSeries for the average temperature in the
Northern Hemisphere can be stored like this:

  dimensions:
    region=1;
    nv=2;
    time=UNLIMITED;
  variables:
    float temperature(region,time);
      temperature:standard_name="surface_temperature";
      temperature:units="K";
      temperature:coordinates="lat lon";
      temperature:cell_methods="time: mean area: mean";
    float lat(region);
      lat:standard_name="latitude";
      lat:units="degrees_north";
      lat:bounds="lat_bounds";
    float lat_bounds(region,nv);
    float lon(region);
      lon:standard_name="longitude";
      lon:units="degrees_east";
      lon:bounds="lon_bounds";
    float lon_bounds(region,nv);
  data:
    lat_bounds=0,90;
    lon_bounds=0,360;

which means the region is 0-90N and 0-360E. If the regions were irregular
polygons in latitude and longitude, nv would be the number of vertices and the
lat and lon bounds would trace the outline of the polygon e.g. nv=3, lat=0,90,0
and lon=0,0,90 describes the eighth of the sphere which is bounded by the
meridians at 0E and 90E and the Equator. I think, therefore, we do not need an
additional convention for points or polygonal regions. However, we would need
new conventions for a timeseries where each value applies to a set of
discontiguous regions or regions with holes in them, a set of points, a line or
a set of lines. I guess that these are included in the geometry types you list
(LineString, Multipoint, MultiLineString, and MultiPolygon). Do you have
definite use-cases for all of these? (I ask this because we don't add new
functionality to CF until there is a definite and common need for it in
practice.)

* I suspect that geometries of this kind can be described by the ugrid
convention http://ugrid-conventions.github.io/ugrid-conventions, which is
compliant with CF. Their purpose is to describe a set of connected points,
edges or faces at which values are given, whereas in your case you'd give a
single value for the whole set, but the description of the geometry itself
might be similar. Have you had a look at whether ugrid could meet your needs?
If it almost does so, perhaps a better thing to do would be to propose
additions to ugrid. We would like to avoid having more than one way to describe
such geometries.

If you decide to make use of ugrid instead, the rest of my comments may
not be relevant!

* So far CF does not say anything about the use of netCDF-4 features (i.e. not
the classic model). We have often discussed allowing them but the general
argument is also made that there has to be a compelling case for providing a
new way to do something which can already be done. (Steve Hankin often made
this argument, but since he's mostly retired I'll make it now in his name :-)
If there are two ways to do something, software has to support both of them. We
already have ways to encode ragged arrays, so is there a compelling case for
needing the netCDF-4 vlen array as well? We already have a way to encode
strings too, as character arrays. I think this is probably a discussion we
should have again in a different thread, so I'll just talk about your classic
encoding. The same points apply to both encodings.

* Your approach uses a coordinate_index variable to identify indices of
geometry coordinates e.g.

  dimensions:
    indices = 30;
    node = 25 ;
    geom = 1 ;
  variables:
    int coordinate_index(indices) ;
      coordinate_index:coordinates = "x y" ;
    double x(node) ;
    double y(node) ;
  data:
    coordinate_index = 0, 1, 2, 3, 4, -2, 5, 6, 7, 8, -2, 9, 10, 11, 12, -2,
      13, 14, 15, 16, -1, 17, 18, 19, 20, -1, 21, 22, 23, 24 ;
    x = 0, 20, 20, 0, 0, 1, 10, 19, 1, 5, 7, 9, 5, 11, 13, 15, 11, 5, 9, 7, 5,
      11, 15, 13, 11 ;
    y = 0, 0, 20, 20, 0, 1, 5, 1, 1, 15, 19, 15, 15, 15, 19, 15, 15, 25, 25,
      29, 25, 25, 25, 29, 25 ;

where the -1 and -2 indices indicates where exterior and interior polygons
begin, and the first polygon has an implied -1 at the start. Is that right?
Given this example, I wonder why you need the index array, because none of the
coordinates indices (values >=0) is repeated, so no space is saved in the x
and y arrays. I guess this would be the usual case. If polygons did touch or
lines crossed, a few points would be in common, but not so many that seems to
need the complication of the index array. A simpler way to do it would be

    int outside_inside(node); // -1 for exterior, -2 for interior
    double x(node) ;
    double y(node) ;
    outside_inside=-1,-1,-1,-1,-1, -2,-2,-2,-2, -2,-2,-2,-2,-2,
      -1,-1,-1,-1,-1, -1,-1,-1,-1,-1;
    x = 0, 20, 20, 0, 0, 1, 10, 19, 1, 5, 7, 9, 5, 11, 13, 15, 11, 5, 9, 7, 5,
      11, 15, 13, 11 ;
    y = 0, 0, 20, 20, 0, 1, 5, 1, 1, 15, 19, 15, 15, 15, 19, 15, 15, 25, 25,
      29, 25, 25, 25, 29, 25 ;

which needs only one dimension, or you could use the CF ragged array
convention (Sect 9.3.3):

    segment=5;
    node=25;
    int count(segment);
      count:sample_dimension="node";
    int outside_inside(segment); // -1 for exterior, -2 for interior
    double x(node) ;
    double y(node) ;
    outside_inside=-1,-2,-2,-1,-1;
    count=5,5,5,5,5;
    x = 0, 20, 20, 0, 0, 1, 10, 19, 1, 5, 7, 9, 5, 11, 13, 15, 11, 5, 9, 7, 5,
      11, 15, 13, 11 ;
    y = 0, 0, 20, 20, 0, 1, 5, 1, 1, 15, 19, 15, 15, 15, 19, 15, 15, 25, 25,
      29, 25, 25, 25, 29, 25 ;

* You provide the attributes multipart_break_value and hole_break_value to
specify the values (-1 and -2 above) for the outside vs inside distinction. Do
you need the generality of being able to choose these values? It would seem
simpler to use a character array and specify in the convention which letters
should be used e.g.

    char outside_inside(segment)
    outside_inside="OOIIO";

That makes it more readable, perhaps.

* Similarly, you propose attributes for clockwise/anticlockwise node order and
for the polygon closure convention. Do these need to be freely choosable? You
could specify clockwise, like the existing CF bounds convention, and that the
polygons are closed. In the latter case, you could omit the last vertex of each
polygon since it must be the same as the first, and that would save a bit of
space. If you specify these choices, the attributes aren't needed.

* If this convention is going to be used for discrete sampling geometries,
an additional dimension is needed, because in a single data variable you
might have data for several of these geometries. That is, you need an array
of ragged arrays. Again, I wonder whether this suggests trying to use ugrid.
It might be you could name each one as a mesh, and specify the geometry of
for the set of timeSeries as an array of mesh names. That would be a very
easy change to the existing Sect 9.

Best wishes

Jonathan
Received on Thu Sep 22 2016 - 04:40:07 BST

This archive was generated by hypermail 2.3.0 : Tue Sep 13 2022 - 23:02:42 BST