[CF-metadata] CF-metadata Digest, Vol 164, Issue 12 from Little, Chris on 2016-12-21 (Archive of CF discussions from 2002 to 2019 on the cf-metadata mailing list)

From: Little, Chris <chris.little>
Date: Wed, 21 Dec 2016 18:14:04 +0000

Dear CF Community

Thanks for sight of this interesting debate and approach to something practical in a complicated domain.

Over the years, I have seen various patterns (and anti-patterns) develop, whether in computer graphics or GIS systems.

Can I recommend the mathematical concept of 'completion' or 'closure' to help your decisions. Apologies if anyone is being taught to suck eggs.

Multi-lines or multi-polylines are, in some sense, more fundamental entities than lines or polylines, because the result of various operations such as cut or intersect may converts lines into multi-lines. The same operation on multi-lines always ends up with a multiline.

Similarly for polygons. Multi-polygons are more fundamental, as operating on a single polygon may result in several polygons, possibly with holes e.g. Lake with falling water levels may convert from a single polygon into several disjoint puddles or lakes with islands.

Similar patterns occur in 3D graphics, where various splines are not fundamental, but Non-Uniform Rational Beta Splines (NURBS) are. Special cases of anti-patterns are adding circles but not ellipses, or rectangles and squares but not quadrilaterals to graphics libraries.

If I understand the threads below correctly you seem to have converged to a sensible set of closures.

HTH, Chris

-----Original Message-----
From: CF-metadata [mailto:cf-metadata-bounces at cgd.ucar.edu] On Behalf Of cf-metadata-request at cgd.ucar.edu
Sent: Wednesday, December 21, 2016 5:42 PM
To: cf-metadata at cgd.ucar.edu
Subject: CF-metadata Digest, Vol 164, Issue 12

Send CF-metadata mailing list submissions to
        cf-metadata at cgd.ucar.edu

To subscribe or unsubscribe via the World Wide Web, visit
        http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata
or, via email, send a message with subject or body 'help' to
        cf-metadata-request at cgd.ucar.edu

You can reach the person managing the list at
        cf-metadata-owner at cgd.ucar.edu

When replying, please edit your Subject line so it is more specific than "Re: Contents of CF-metadata digest..."

Today's Topics:

   1. Re: Feedback requested on proposed CF Simple Geometries
      (Ben Koziol - NOAA Affiliate)

----------------------------------------------------------------------

Message: 1
Date: Wed, 21 Dec 2016 10:41:04 -0700
From: Ben Koziol - NOAA Affiliate <ben.koziol at noaa.gov>
To: Chris Barker <chris.barker at noaa.gov>
Cc: Jonathan Gregory <j.m.gregory at reading.ac.uk>,
        "cf-metadata at cgd.ucar.edu" <cf-metadata at cgd.ucar.edu>
Subject: Re: [CF-metadata] Feedback requested on proposed CF Simple
        Geometries
Message-ID:
        <CADckZ8dwjwOt+em_z46y_kx1jvA5MnsMs=qdj3P4nX+Rth4HMA at mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Jonathan, Chris, and CF-Metadata,

Thank you kindly for the replies and apologies for the delay on the response. Please see responses below. The comments were very useful in introducing some conceptual hangups and will help us moving forward. As usual, looking forward to continued discussion.

A quick note. We?ve revised our thinking a bit and recorded our current design in an AGU poster: http://goo.gl/0NI4Sd. Based on the feedback in this thread and what we?ve learned in the process of preparing the poster and software, we need to discuss a bit more and will prepare a proposal for the community to review soon.

I was asking whether this means that for each *collection* (of points, lines or polygons) there is a *single* timeseries. For instance, in your example of a single geometry composed of several polygons, there is a single number for each time. But that is not the case for weather stations; for each weather station there is a timeseries, and at each time there is a different number (value of temperature, precipitation or whatever) for each weather station. You also write, ?The US National Weather Service?s National Water Model (NWM) ? forecasts streamflow rates in about 2.7 million stream segments averaging 2km.? The stream network is a MultiLineString geometry, but I don?t think there is just one value of streamflow applying to the entire network at any given time; I guess there is a different timeseries for each stream segment. But in my example above, the Atlantic Ocean is a single polygon with a single timeseries for its average temperature, not a different timeseries for each node. Thus I am unclear about
the dimensions of the data. In terms of your original example, does the data have dimensions (time,geometry, where geometry=1) or (time,node)?

Before diving in, it?s critical to define some terminology. A geometry is meant to refer to a potentially multipart geometric entity that might otherwise be called a feature. A geometry is made up of one or more points, lines, or polygons.

That said, we are thinking the dimensions of time-varying data would be (time,
geometry) where time and geometry may have arbitrary lengths. Hence, multiple time-varying variables could be associated with each geometry.
Chris addressed this in his response.

How geometry data is ?exploded? is up to the client-software. The 2.7 million stream segments would likely not be a single MultiLineString geometry. The geometry counts could be 2.7 million Linestrings and 2.7 million Polygons. One could collapse all this geometry data into single multi-geometries, but this would prove unwieldy. Some of the LineStrings could be discontinuous multilinestring geometries (only requiring one index on the geometry dimension but consisting of two physical LineStrings).

This seems to me to be a crucial difference. In the former case the simple geometry can be regarded as a more complex alternative to cells bounds - the cell has a complicated geometry of nodes and lines, but it?s still a single cell. In the latter case you?re providing many timeseries in an unstructured geometry, which is what ugrid describes. Which do you have in mind?

We intend for this proposal to fit in the Discrete Sampling Geometry timeSeries featureType. So this proposal does not contain any new mechanism to link a time-varying data variable with a network composed of polygons, points, and lines (a whole hydrologic system for example). UGRID provides some mechanisms for this similar to other CF conventions (data is associated with a grid center point and its bounds for example - or a ?face? has a center point in UGRID).

It?s possible your question is still not being addressed. ?Nodes? are used in all geometries. We would never associate time-varying data with nodes or the edges between them. Data would always be associated with the geometry (a feature comprised of nodes).

You propose the index variable in order for the convention to be like ugrid. However this still seems to me to be an unnecessary complexity and use of space if you aren?t going to have many shared nodes. I think the case for having another convention, distinct from ugrid, is stronger if it is *unlike* ugrid in this respect, and therefore simpler as well.

Sharing nodes should be possible within the spec in our opinion. There may be overhead for some dataset encodings, but if one is willing to sacrifice computational time when writing complex geometric datasets with shared-node-topology, considerable disk space and memory may be saved.

Reusing coordinate indexing indirection does not seem like a duplication of UGRID. In fact, it makes sense to align with UGRID as much as possible to facilitate data exchange.

I agree that repeating the inside/outside flag many times is wasteful.
That, coupled with your clarification that you may have several geometries, each consisting of several elements (points, lines, polygons), means that you need, in effect, a ragged array of ragged arrays (geometry,element,node). This is more complicated than DSGs, but it seems to me it would be reasonably easy to understand if your multi-geometry example https://github.com/bekozi/netCDF-CF-simple-geometry/wiki/VLEN-Arrays-in-NetCDF-3#multipolygon-example
was stored something like this:

geom=3;
part=11;
node=36;
int number_of_parts(geom);
  number_of_parts:parts="number_of_nodes";
int number_of_nodes(part);
  number_of_nodes:inout="inout";
char inout(part);
float x(node);
float y(node);
number_of_parts=6, 3, 2;
number_of_nodes=4, 3, 3, 3, 3, 3, 3, 5, 3, 3, 3; inout="OIIIOOOOIOO"; x=0, 20, 20, 0, 1, 10, 19, 5, 7, 9, 11, 13, 15, 5, 9, 7, 11, 15, 13, -40, -20, -45, -20, -10, -10, -30, -45, -30, -20, -20, 30, 45, 10, 25, 50, 30; y = 0, 0, 20, 20, 1, 5, 1, 15, 19, 15, 15, 19, 15, 25, 25, 29, 25, 25, 29, -40, -45, -30, -35, -30, -10, -5, -20, -20, -15, -25, 20, 40, 40, 5, 10, 15;

where I assume that all polygons are closed. What do you think?

This is not a bad approach and, with some modifications, it could be used.
A few thoughts:

   - In regards to ?ragged array of ragged arrays?, this is solved in our
   proposal by using multiple coordinate index / geometry variables indicating
   a ?geom/instance? dimension to identify associated data variables. These
   variables may use their own coordinate vectors or index into shared
   coordinate vectors. Some more discussion on this when responding to Chris?s
   comments.
   - This approach is more complex, in our opinion, than the indexing
   approach. In the case of multiple geometries in a single file, it is not
   clear how these variables are all linked together. The coordinate index
   variable hosts the attributes necessary for self-description.
   - We had considered adopting something along these lines to avoid the
   use of ?break values?. We wanted to avoid variables where possible. The
   contiguous ragged approach requires only stop indexing (one additional
   variable). For NetCDF4 ragged arrays, no ragged indexing is required
   keeping the schema clean. Geometries lend themselves naturally to NetCDF4
   ragged arrays. We agree break values are confusing. Break values present a
   reasonable solution without adding considerable instrumentation for
   multi-part geometries which will generally be an exception.
   - The nice thing about the coordinate index approach is that one
   approach basically works for all geometry types. In your example,
   number_of_parts and inout are only present for multi-geometries
   (multipolygons).
   - It is more difficult to extract a single geometry using this approach.
   Not that big of deal necessarily, but the contiguous ragged and NetCDF4
   variable length make the process relatively simple. With geometries, we
   view single element / random access as an important schema characteristic.

For comparison, your CDL would look something like the following when transformed into our proposal. Note, the ragged index and nodes have been broken onto one line per geometry:

geom=3;
index=44;
node=36;
int start_of_geom(geom);
  start_of_geom:contiguous_ragged_dimension = "index"
int coordinate_index(index);
  coordinate_index:geom_type="multipolygon";
  coordinate_index:geom_coordinates="x y";
  coordinate_index:geom_dimension="geom";
  coordinate_index:stop_encoding="cra";
  coordinate_index:multipart_break_value=-1;
  coordinate_index:hole_break_value=-2;
  coordinate_index:outer_ring_order="anticlockwise";
  coordinate_index:closure_convention="last_node_equals_first";
float x(node);
float y(node);

start_of_geom=0, 24, 37 ;
coordinate_index=0,1,2,3,-2,4,5,6,-2,7,8,9,-2,10,11,12,-1,13,14,15,-1,16,17,18,
19,20,21,-1,22,23,24,25,26,-2,27,28,29,
30,31,32,-1,33,34,35 ;
x=0, 20, 20, 0, 1, 10, 19, 5, 7, 9, 11, 13, 15, 5, 9, 7, 11, 15, 13, -40, -20, -45, -20, -10, -10, -30, -45, -30, -20, -20, 30, 45, 10, 25, 50, 30; y = 0, 0, 20, 20, 1, 5, 1, 15, 19, 15, 15, 19, 15, 25, 25, 29, 25, 25, 29, -40, -45, -30, -35, -30, -10, -5, -20, -20, -15, -25, 20, 40, 40, 5, 10, 15;

On to Chris?s comments.

I think it may be helpful to borrow terminology (and the data model) from the GIS world here. In this case, I am referencing the geoJSON spec, as I happen to be working with that at the moment, but the basic data model is pretty consistent.

Agreed. We want to, at the very least, maintain consistency with geometry types (not case sensitive). The ?break values? are a poor man?s parentheses corresponding to their uses in arrays used by GeoJSON and WKT.

Note that they have ?geometries? which can be things like points, polygons, polyllines. IIUC (and I?m no osgeo mavin) geometries represent a ?single?
entity. Then there are ?Features?: a Feature is essentially data associated with a particular geometry. But note: there are ?Collections? ? both Geometry and Feature Collections ? that is what you use to ?bundle? various data together.

I think we may be well served by thinking in terms of mapping the GIS data model to CF/netcdf ? for instance it would be great to be able to write a netcdf<->geoJSON converter that was lossless, AND would be fairly ?native?
in both cases.

Agree in principle. In practice, this proves difficult of course. :-) A FeatureCollection may contain features with different geometry types. We would need to add an additional dimension variable describing the geometry type. A GeometryCollection itself may nest inside a FeatureCollection. We think of simple geometry variables as FeatureCollections with a static geometry type. NetCDF groups may help provide a crosswalk, but I am not sure we are ready to go there.

GeoJSON also tends to break down with time series - repeating geometries coordinates ad nauseum for each time coordinate. TopoJSON offers some solutions in this regard. TopoJSON also incorporates node sharing. Chris did link to this JSON spec already.

Also, how do NetCDF attributes, dimensions, and data types fit into GeoJSON? No good answer for this one.

What?s important is that we establish a way to encode simple/basic geometry types. Collections can be created using indexing or hierarchies. The basic static geometry-typed FeatureCollection should be sufficient for most applications.

(though I?m still confused, maybe you can have an ?array? of data associated with a GeometryCollection?)

as for MultiLineString ? you could associate an array of data with the Multilinestring ? so one value per segment. But I think that violates the intent of the data model ? you should have a GeometryCollection of linestrings instead. and then each segment has its own geometry and you can associate an array of data with that. (or it should be a FeatureCollection?
I?m getting confused now!

Yup, makes your head hurt. No good answer for this. This leads back to the ?ragged array of ragged arrays? comment by Jonathan earlier. It is likely beyond the scope of our proposal. The solution may lie in turning geometry variable attributes and names into arrays themselves adding another layer of indexing in the process. We hope that the simple geometry encoding methods could be reused in an effort like this.

Of course, CF doesn?t need to follow this data model, but it?s a good idea to be informed by it.

Yes, absolutely. Your descriptions of geometry types and collections is correct. GeometryCollections can also be used for the *same* types of geometries - a minor point.

In the GIS data model, nodes are not shared between geometries, and you are quite right that keeping nodes separate with geometries indexing into it is an added complication and would not be space-efficient.

However, there is another reason to do it ? it makes it definitive that two (or more) geometries share the exact same node, rather than them being distinct points that happened to be at the same location (Or worse, with FP error and all, two points that are very close)

Yes, this is why we are for the coordinate indexing approach used in UGRID.

This is actually a major limitation in the standard GIS model.

Also yes!
?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cgd.ucar.edu/pipermail/cf-metadata/attachments/20161221/28c0b6e9/attachment.html>

------------------------------

Subject: Digest Footer

_______________________________________________
CF-metadata mailing list
CF-metadata at cgd.ucar.edu
http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata

------------------------------

End of CF-metadata Digest, Vol 164, Issue 12
********************************************
Received on Wed Dec 21 2016 - 11:14:04 GMT

This archive was generated by hypermail 2.3.0 : Tue Sep 13 2022 - 23:02:42 BST