[CF-metadata] Feedback requested on proposed CF Simple Geometries from Ben Koziol - NOAA Affiliate on 2016-12-21 (Archive of CF discussions from 2002 to 2019 on the cf-metadata mailing list)

From: Ben Koziol - NOAA Affiliate <ben.koziol>
Date: Wed, 21 Dec 2016 10:41:04 -0700

Jonathan, Chris, and CF-Metadata,

Thank you kindly for the replies and apologies for the delay on the
response. Please see responses below. The comments were very useful in
introducing some conceptual hangups and will help us moving forward. As
usual, looking forward to continued discussion.

A quick note. We?ve revised our thinking a bit and recorded our current
design in an AGU poster: http://goo.gl/0NI4Sd. Based on the feedback in
this thread and what we?ve learned in the process of preparing the poster
and software, we need to discuss a bit more and will prepare a proposal for
the community to review soon.

I was asking whether this means that for each *collection* (of points,
lines or polygons) there is a *single* timeseries. For instance, in your
example of a single geometry composed of several polygons, there is a
single number for each time. But that is not the case for weather stations;
for each weather station there is a timeseries, and at each time there is a
different number (value of temperature, precipitation or whatever) for each
weather station. You also write, ?The US National Weather Service?s
National Water Model (NWM) ? forecasts streamflow rates in about 2.7
million stream segments averaging 2km.? The stream network is a
MultiLineString geometry, but I don?t think there is just one value of
streamflow applying to the entire network at any given time; I guess there
is a different timeseries for each stream segment. But in my example above,
the Atlantic Ocean is a single polygon with a single timeseries for its
average temperature, not a different timeseries for each node. Thus I am
unclear about the dimensions of the data. In terms of your original
example, does the data have dimensions (time,geometry, where geometry=1) or
(time,node)?

Before diving in, it?s critical to define some terminology. A geometry is
meant to refer to a potentially multipart geometric entity that might
otherwise be called a feature. A geometry is made up of one or more points,
lines, or polygons.

That said, we are thinking the dimensions of time-varying data would be (time,
geometry) where time and geometry may have arbitrary lengths. Hence,
multiple time-varying variables could be associated with each geometry.
Chris addressed this in his response.

How geometry data is ?exploded? is up to the client-software. The 2.7
million stream segments would likely not be a single MultiLineString
geometry. The geometry counts could be 2.7 million Linestrings and 2.7
million Polygons. One could collapse all this geometry data into single
multi-geometries, but this would prove unwieldy. Some of the LineStrings
could be discontinuous multilinestring geometries (only requiring one index
on the geometry dimension but consisting of two physical LineStrings).

This seems to me to be a crucial difference. In the former case the simple
geometry can be regarded as a more complex alternative to cells bounds -
the cell has a complicated geometry of nodes and lines, but it?s still a
single cell. In the latter case you?re providing many timeseries in an
unstructured geometry, which is what ugrid describes. Which do you have in
mind?

We intend for this proposal to fit in the Discrete Sampling Geometry
timeSeries featureType. So this proposal does not contain any new mechanism
to link a time-varying data variable with a network composed of polygons,
points, and lines (a whole hydrologic system for example). UGRID provides
some mechanisms for this similar to other CF conventions (data is
associated with a grid center point and its bounds for example - or a
?face? has a center point in UGRID).

It?s possible your question is still not being addressed. ?Nodes? are used
in all geometries. We would never associate time-varying data with nodes or
the edges between them. Data would always be associated with the geometry
(a feature comprised of nodes).

You propose the index variable in order for the convention to be like
ugrid. However this still seems to me to be an unnecessary complexity and
use of space if you aren?t going to have many shared nodes. I think the
case for having another convention, distinct from ugrid, is stronger if it
is *unlike* ugrid in this respect, and therefore simpler as well.

Sharing nodes should be possible within the spec in our opinion. There may
be overhead for some dataset encodings, but if one is willing to sacrifice
computational time when writing complex geometric datasets with
shared-node-topology, considerable disk space and memory may be saved.

Reusing coordinate indexing indirection does not seem like a duplication of
UGRID. In fact, it makes sense to align with UGRID as much as possible to
facilitate data exchange.

I agree that repeating the inside/outside flag many times is wasteful.
That, coupled with your clarification that you may have several geometries,
each consisting of several elements (points, lines, polygons), means that
you need, in effect, a ragged array of ragged arrays
(geometry,element,node). This is more complicated than DSGs, but it seems
to me it would be reasonably easy to understand if your multi-geometry
example
https://github.com/bekozi/netCDF-CF-simple-geometry/wiki/VLEN-Arrays-in-NetCDF-3#multipolygon-example
was stored something like this:

geom=3;
part=11;
node=36;
int number_of_parts(geom);
  number_of_parts:parts="number_of_nodes";
int number_of_nodes(part);
  number_of_nodes:inout="inout";
char inout(part);
float x(node);
float y(node);
number_of_parts=6, 3, 2;
number_of_nodes=4, 3, 3, 3, 3, 3, 3, 5, 3, 3, 3;
inout="OIIIOOOOIOO";
x=0, 20, 20, 0, 1, 10, 19, 5, 7, 9, 11, 13, 15, 5, 9, 7, 11, 15, 13, -40,
-20, -45, -20, -10, -10, -30, -45, -30, -20, -20, 30, 45, 10, 25, 50, 30;
y = 0, 0, 20, 20, 1, 5, 1, 15, 19, 15, 15, 19, 15, 25, 25, 29, 25, 25, 29,
-40, -45, -30, -35, -30, -10, -5, -20, -20, -15, -25, 20, 40, 40, 5, 10, 15;

where I assume that all polygons are closed. What do you think?

This is not a bad approach and, with some modifications, it could be used.
A few thoughts:

   - In regards to ?ragged array of ragged arrays?, this is solved in our
   proposal by using multiple coordinate index / geometry variables indicating
   a ?geom/instance? dimension to identify associated data variables. These
   variables may use their own coordinate vectors or index into shared
   coordinate vectors. Some more discussion on this when responding to Chris?s
   comments.
   - This approach is more complex, in our opinion, than the indexing
   approach. In the case of multiple geometries in a single file, it is not
   clear how these variables are all linked together. The coordinate index
   variable hosts the attributes necessary for self-description.
   - We had considered adopting something along these lines to avoid the
   use of ?break values?. We wanted to avoid variables where possible. The
   contiguous ragged approach requires only stop indexing (one additional
   variable). For NetCDF4 ragged arrays, no ragged indexing is required
   keeping the schema clean. Geometries lend themselves naturally to NetCDF4
   ragged arrays. We agree break values are confusing. Break values present a
   reasonable solution without adding considerable instrumentation for
   multi-part geometries which will generally be an exception.
   - The nice thing about the coordinate index approach is that one
   approach basically works for all geometry types. In your example,
   number_of_parts and inout are only present for multi-geometries
   (multipolygons).
   - It is more difficult to extract a single geometry using this approach.
   Not that big of deal necessarily, but the contiguous ragged and NetCDF4
   variable length make the process relatively simple. With geometries, we
   view single element / random access as an important schema characteristic.

For comparison, your CDL would look something like the following when
transformed into our proposal. Note, the ragged index and nodes have been
broken onto one line per geometry:

geom=3;
index=44;
node=36;
int start_of_geom(geom);
  start_of_geom:contiguous_ragged_dimension = "index"
int coordinate_index(index);
  coordinate_index:geom_type="multipolygon";
  coordinate_index:geom_coordinates="x y";
  coordinate_index:geom_dimension="geom";
  coordinate_index:stop_encoding="cra";
  coordinate_index:multipart_break_value=-1;
  coordinate_index:hole_break_value=-2;
  coordinate_index:outer_ring_order="anticlockwise";
  coordinate_index:closure_convention="last_node_equals_first";
float x(node);
float y(node);

start_of_geom=0, 24, 37 ;
coordinate_index=0,1,2,3,-2,4,5,6,-2,7,8,9,-2,10,11,12,-1,13,14,15,-1,16,17,18,
19,20,21,-1,22,23,24,25,26,-2,27,28,29,
30,31,32,-1,33,34,35 ;
x=0, 20, 20, 0, 1, 10, 19, 5, 7, 9, 11, 13, 15, 5, 9, 7, 11, 15, 13,
-40, -20, -45, -20, -10, -10, -30, -45, -30, -20, -20,
30, 45, 10, 25, 50, 30;
y = 0, 0, 20, 20, 1, 5, 1, 15, 19, 15, 15, 19, 15, 25, 25, 29, 25, 25, 29,
-40, -45, -30, -35, -30, -10, -5, -20, -20, -15, -25,
20, 40, 40, 5, 10, 15;

On to Chris?s comments.

I think it may be helpful to borrow terminology (and the data model) from
the GIS world here. In this case, I am referencing the geoJSON spec, as I
happen to be working with that at the moment, but the basic data model is
pretty consistent.

Agreed. We want to, at the very least, maintain consistency with geometry
types (not case sensitive). The ?break values? are a poor man?s parentheses
corresponding to their uses in arrays used by GeoJSON and WKT.

Note that they have ?geometries? which can be things like points, polygons,
polyllines. IIUC (and I?m no osgeo mavin) geometries represent a ?single?
entity. Then there are ?Features?: a Feature is essentially data associated
with a particular geometry. But note: there are ?Collections? ? both
Geometry and Feature Collections ? that is what you use to ?bundle? various
data together.

I think we may be well served by thinking in terms of mapping the GIS data
model to CF/netcdf ? for instance it would be great to be able to write a
netcdf<->geoJSON converter that was lossless, AND would be fairly ?native?
in both cases.

Agree in principle. In practice, this proves difficult of course. :-) A
FeatureCollection may contain features with different geometry types. We
would need to add an additional dimension variable describing the geometry
type. A GeometryCollection itself may nest inside a FeatureCollection. We
think of simple geometry variables as FeatureCollections with a static
geometry type. NetCDF groups may help provide a crosswalk, but I am not
sure we are ready to go there.

GeoJSON also tends to break down with time series - repeating geometries
coordinates ad nauseum for each time coordinate. TopoJSON offers some
solutions in this regard. TopoJSON also incorporates node sharing. Chris
did link to this JSON spec already.

Also, how do NetCDF attributes, dimensions, and data types fit into
GeoJSON? No good answer for this one.

What?s important is that we establish a way to encode simple/basic geometry
types. Collections can be created using indexing or hierarchies. The basic
static geometry-typed FeatureCollection should be sufficient for most
applications.

(though I?m still confused, maybe you can have an ?array? of data
associated with a GeometryCollection?)

as for MultiLineString ? you could associate an array of data with the
Multilinestring ? so one value per segment. But I think that violates the
intent of the data model ? you should have a GeometryCollection of
linestrings instead. and then each segment has its own geometry and you can
associate an array of data with that. (or it should be a FeatureCollection?
I?m getting confused now!

Yup, makes your head hurt. No good answer for this. This leads back to the
?ragged array of ragged arrays? comment by Jonathan earlier. It is likely
beyond the scope of our proposal. The solution may lie in turning geometry
variable attributes and names into arrays themselves adding another layer
of indexing in the process. We hope that the simple geometry encoding
methods could be reused in an effort like this.

Of course, CF doesn?t need to follow this data model, but it?s a good idea
to be informed by it.

Yes, absolutely. Your descriptions of geometry types and collections is
correct. GeometryCollections can also be used for the *same* types of
geometries - a minor point.

In the GIS data model, nodes are not shared between geometries, and you are
quite right that keeping nodes separate with geometries indexing into it is
an added complication and would not be space-efficient.

However, there is another reason to do it ? it makes it definitive that two
(or more) geometries share the exact same node, rather than them being
distinct points that happened to be at the same location (Or worse, with FP
error and all, two points that are very close)

Yes, this is why we are for the coordinate indexing approach used in UGRID.

This is actually a major limitation in the standard GIS model.

Also yes!
?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cgd.ucar.edu/pipermail/cf-metadata/attachments/20161221/28c0b6e9/attachment-0001.html>
Received on Wed Dec 21 2016 - 10:41:04 GMT

This archive was generated by hypermail 2.3.0 : Tue Sep 13 2022 - 23:02:42 BST