[CF-metadata] Feedback requested on proposed CF Simple Geometries from Jonathan Gregory on 2017-01-31 (Archive of CF discussions from 2002 to 2019 on the cf-metadata mailing list)

From: Jonathan Gregory <j.m.gregory>
Date: Tue, 31 Jan 2017 09:22:52 +0000

Dear Ben

Thanks for your new thoughts. I find this intriguing but still puzzling, and I
think this means we are talking at cross-purposes. Perhaps we ought to speak on
the phone? However, here are some replies.

Maybe this is a clue to our differences:

> We intend for this proposal to fit in the Discrete Sampling Geometry
> timeSeries featureType. So this proposal does not contain any new mechanism
> to link a time-varying data variable with a network composed of polygons,
> points, and lines (a whole hydrologic system for example).
...
> We would never associate time-varying data with nodes or
> the edges between them.

So far, CF describes data, and provides coordinates to locate the data in space
and time (and other dimensions). I'm not really familiar with the terminology,
but I understand that this is called a "coverage" - that is, a data which is a
function of a domain. Your "new mechanism" sentence suggests that your aim is
to describe just a domain, with no data. Maybe that's why you're agnostic about
whether and how the 2.7 million stream segments are grouped. Your aim is to
describe the network alone.

But if you want to link it to CF timeseries, as you say you do, this question
must has a definite answer, because a collection of timeseries is stored as a
data variable with a single dimension of time and a single dimension of space.
The latter is an index to information which locates the data e.g. a simplified
version of CF example H.2:

  dimensions:
    station = 10 ; // measurement locations
    time = UNLIMITED ;
  variables:
    float humidity(station,time) ;
      humidity:standard_name = "specific humidity" ;
      humidity:coordinates = "lat lon" ;
    double time(time) ;
      time:standard_name = "time";
      time:units = "days since 1970-01-01 00:00:00" ;
    float lon(station) ;
      lon:standard_name = "longitude";
      lon:units = "degrees_east";
    float lat(station) ;
      lat:standard_name = "latitude";
      lat:units = "degrees_north" ;

Here the the data is located at 10 (lon,lat) points. In the streamflow example
I guess that each of the 2.7M stream segments has a timeseries of flow rates -
is that right? That means we have to replace the points with linestrings (which
I think is essentially the same as polylines, isn't it?), one for each stream
segment. There must be exactly the same number of linestrings as there are
timeseries. We need something like:

  dimensions:
    station = 2700000; // stream segments
    time = UNLIMITED;
  variables:
    float flow(station,time) ;
      flow:units="m3 s-1";
    double time(time) ;
      time:standard_name = "time";
      time:units = "days since 1970-01-01 00:00:00" ;
    SOMETHING(station) // to describe the geometry of each stream segment

Your proposal is about the SOMETHING, but not how it links to the data. Is
that right? You would like to have SOMETHING alone in the file, just to
describe the network itself. CF doesn't do this at present (domain without
data), but it's been discussed before, and if we agree a CF convention for
SOMETHING, it could also be linked to the timeseries data variables.

Taking your previous comments into account (I'll come back to them below), as
a modified version of what I suggested before, here's a possible way to handle
this case, for a small number (3) of linestrings:

  dimensions:
    station = 3; // stream segments
    time = UNLIMITED;
    node = 9; // = 2 + 4 + 3
  variables:
    float flow(station,time) ;
      flow:units="m3 s-1";
      flow:topology="SOMETHING";
    double time(time) ;
      time:standard_name = "time";
      time:units = "days since 1970-01-01 00:00:00" ;
    int SOMETHING(station); // number of nodes for each linestring
      SOMETHING:node_coordinates="lon lat";
      SOMETHING:geometry_type="linestring";
    float lon(node) ;
      lon:standard_name = "longitude";
      lon:units = "degrees_east";
    float lat(node) ;
      lat:standard_name = "latitude";
      lat:units = "degrees_north" ;
  data:
    SOMETHING=2, 4, 3;
    lon=0, 1, 0, -1, -2, -3, 2, 3, 4;
    lat=51, 52, 51, 50, 50, 49, 55, 55, 56;

The timeseries flow(0,*) is for the 2-point line from (0,51) to (1,52), and
flow(1,*) is for the 4-point line (0,51) -> (-1,50) -> (-2,50) -> (-3,49).
Timeseries of data on polygons (one timeseries for each polygon) would be done
in the same way, with geometry_type="polygon". The topology attribute of the
data variable provides a link to the SOMETHING variable, which specifies how
the nodes are connected to make the linestring or polygon for each timeseries.
I use the attribute names "topology" and "node_coordinates" to be reminiscent
of ugrid. The SOMETHING variable could exist in a file without data variables
to describe the linestrings alone.

With linestrings and polygons, the geometry of each timeseries has a single
part (one linestring or one polygon), so the SOMETHING variable is used to
specify the number of nodes for each geometry. The example we have been
discussing in previous emails is the more complicated one of multipolygons. We
need to be able to do this in CF if there are real use-cases where each data
value applies to a *set* of polygons. I can imagine this is the case. For
example, some of the cantons of Switzerland are composed of several
non-contiguous territories, and the US state of Michigan is made of two
non-contiguous territories, and you might have one number that applied to
Michigan as a whole, or Canton Fribourg as a whole. So, for example, we could
store three timeseries, each applying to a collection of polygons, like this:

  dimensions:
    station = 3; // collections of polygons
    time = UNLIMITED;
    node = 24; // = 4 + 3 + 3 + 3 + 5 + 3 + 3
  variables:
    float flow(station,time) ;
      flow:units="m3 s-1";
      flow:topology="SOMETHING";
    double time(time) ;
      time:standard_name = "time";
      time:units = "days since 1970-01-01 00:00:00" ;
    int SOMETHING(station); // number of polygons in each collection
      SOMETHING:node_coordinates="lon lat";
      SOMETHING:geometry_type="multipolygon";
      SOMETHING:nodes=4, 3, 3, 3, 5, 3, 3; // number of nodes in each polygon
    float lon(node) ;
      lon:standard_name = "longitude";
      lon:units = "degrees_east";
    float lat(node) ;
      lat:standard_name = "latitude";
      lat:units = "degrees_north" ;
  data:
    SOMETHING=3, 2, 2; // number of polygons in each collection
    lon=0, 20, 20, 0, ... // first polygon, etc. ...
    lat=0, 0, 20, 20, ...

In this arrangement, the SOMETHING variable says how many parts each geometry
has, and its nodes attribute says how many nodes there in each part. Thus I
have combined the two variables I suggested last time (number_of_parts and
number_of_nodes) into SOMETHING.

These simple geometries can be regarded as a more complex alternative to cells
bounds - each timeseries has a complicated geometry of nodes and lines, but
logically it's still a single "cell". For the sake of applications which can
read CF but don't understand simple geometries, it might be a good idea in
addition to provide a "representative" location for each timeseries, as
representive_lat(station) and representative_lon(station), which could for
instance be the mean of the node coordinates for each geometry. That would
allow a map to be drawn by which showed where the data is approximately
located, without its shape.

I wrote
> You propose the index variable in order for the convention to be like
> ugrid. However this still seems to me to be an unnecessary complexity and
> use of space if you aren?t going to have many shared nodes.
and you replied
> Sharing nodes should be possible within the spec in our opinion. There may
> be overhead for some dataset encodings, but if one is willing to sacrifice
> computational time when writing complex geometric datasets with
> shared-node-topology, considerable disk space and memory may be saved.
and Chris said
> There is another reason to do it ? it makes it definitive that two
> (or more) geometries share the exact same node, rather than them being
> distinct points that happened to be at the same location (Or worse, with FP
> error and all, two points that are very close)

To be frank, I'm not convinced by either argument. Regarding the first, in your
example you don't reuse any points at all. Can you give an example where there
is a lot of reuse? Regarding the second, I agree that it is a nuisance and
unreliable to have to make comparisons with tolerance between floating-point
numbers to determine equality. However, when you write a file, I suppose you
can and would write exactly the same numbers for the coordinates of a node if
it appears several times, wouldn't you? Thus the coincidence of nodes can be
tested by *exact* equality of coordinates - no tolerance needed.

In my example above, I assumed the polygons have no holes in them, so I've
omitted the inside/outside information. If needed, this information could also
be an attribute e.g. SOMETHING:inout="OIIIOOOOIOO", with as many elements as
there are polygons in total. Thinking again about it, I wonder whether this
information is really needed. If you draw all the polygons, isn't it apparent
which ones are inside anyway? When would you use this information?

My scheme avoids the use of break values, which you're not very keen on your-
selves, it sounds like. A disadvantage that break values have is that they
make the number of nodes and the length of the vectors different, so you need
extra netCDF dimensions which cannot be logically related (since netCDF doesn't
support arithmetic with dimensions).

You wrote
> - The nice thing about the coordinate index approach is that one
> approach basically works for all geometry types. In your example,
> number_of_parts and inout are only present for multi-geometries
> (multipolygons).
I hope my new version looks better to you. Because it's got only one layer
now, it's always the same structure. Yes, there is extra information for the
multi-geometries, because they're more complex, but that can be stored in the
same structure using optional attribute.

You wrote > - It is more difficult to extract a single geometry using this
approach. It's not hard, though, and the same comment would apply to the CF
contiguous ragged array representation. To find the first part of multipolygon
i, you sum the number of parts for multipolygons 1 to i-1 - call this sum j
(j=0 if i=0). The first part of multipolygon i is polygon j+1. You then sum the
number of nodes for polygons 1 to j - call this sum k (k=0 if i=0). The first
node of multipolygon i is k+1.

I look forward to hearing your and others' thoughts. As I said, we could
arrange to talk if that would be more efficient.

Best wishes

Jonathan
Received on Tue Jan 31 2017 - 02:22:52 GMT

This archive was generated by hypermail 2.3.0 : Tue Sep 13 2022 - 23:02:42 BST