⇐ ⇒

[CF-metadata] Feedback requested on proposed CF Simple Geometries

From: Ben Koziol - NOAA Affiliate <ben.koziol>
Date: Tue, 27 Sep 2016 11:28:36 -0600

Jonathan and CF-Metadata List,

Thanks for the suggestions and discussion. We?ve attempted to respond to
the major questions and concerns using Jonathan's mail as a template.
Apologies in advance if we missed anything outstanding or did not
appropriately acknowledge contributions in this thread.

You explain that the need is to specify spatial coordinates with a simple
> geometry for a timeSeries variable. For example, this could be for the
> discharge as a function of time across some line in a river (your example),
> or I suppose it could be an average temperature as a function of time for
> the Atlantic Ocean, where you wanted to supply the polygon which drew the
> outline of the basin. Have I got the idea?


Yes, you have this mostly right. It?s common to have a collection of points
(weather stations), lines (stream reaches), or polygons (hydrologic
catchments) with an associated time series.

Timeseries like this can be stored in CF, but their geographical extent is
> usually described only in words e.g. a region name of atlantic_ocean, and
> this is fine for applications like CMIP where you want to compare data from
> different data sources in which the Atlantic Ocean may have different exact
> shapes (different AOGCMs, in particular). An array of region names is also
> possible, so I don't think we need a new convention to contain your dwarf
> planet example.


The dwarf planet example is intended to describe our generalized approach
to continuous ragged arrays that may be used for arbitrarily-sized data
arrays. For some (including me), using a string instead of a numeric
example helps illustrate the concept. It is an idiosyncratic example in
many ways. Sorry for the confusion.

Sect 9.1 on discrete sampling geometries says it cannot yet be used for
> cases "where geo-positioning cannot be described as a discrete point
> location. Problematic examples include time series that refer to a
> geographical region (e.g. the northern hemisphere) ...". Actually I think
> that's not quite right. The existing convention *can* describe regions
> which are contiguous, and rectangular or polygonal, using its usual bounds
> convention (Sect 7.1). I think we should consider changing this text,
> because it seems unnecessarily restrictive.


Your explanation makes sense, and this should be captured in the DSG
convention text.


> If the regions were irregular polygons in latitude and longitude, nv would
> be the number of vertices and the lat and lon bounds would trace the
> outline of the polygon e.g. nv=3, lat=0,90,0 and lon=0,0,90 describes the
> eighth of the sphere which is bounded by the meridians at 0E and 90E and
> the Equator. I think, therefore, we do not need an additional convention
> for points or polygonal regions.


Many earth science datasets (excluding triangular, hexagonal, etc. meshes)
representable as polygons and lines have differing node counts. "nv" could
not efficiently capture watershed A with 5 nodes and watershed B with 100.
Additionally, the cell bounds concept does not include the structure and
semantics needed to support MultiLines, MultiPolygons, or polygons with
holes/interiors.

However, we would need new conventions for a timeseries where each value
> applies to a set of discontiguous regions or regions with holes in them, a
> set of points, a line or a set of lines. I guess that these are included in
> the geometry types you list (LineString, Multipoint, MultiLineString, and
> MultiPolygon).


Yes.

Do you have definite use-cases for all of these? (I ask this because we
> don't add new functionality to CF until there is a definite and common need
> for it in practice.)


David Arctur described the primary motivation for developing the simple
geometries approach: "Among other applications, NetCDF-CF is now being used
as an intermediate & output data format in the US National Weather
Service?s National Water Model (NWM). This forecasts streamflow rates in
about 2.7 million stream segments averaging 2km, throughout the continental
US, at multiple time horizons (3 hr, 18 hr, 10 days) every hour, and an
ensemble for 30-day forecast less frequently." These data also contain
multi-geometries primarily in the form of MultiLineStrings and
MultiPolgyons.

To this we would add that working with GIS datasets of this magnitude is
difficult with current NetCDF metadata conventions, often yielding an
unwieldy hybrid of NetCDF data and other softwares like ESRI ArcGIS and
PostGIS. ESRI ArcGIS and PostGIS are not usable on many HPC platforms where
models like the NWM reside.

I suspect that geometries of this kind can be described by the ugrid
> convention http://ugrid-conventions.github.io/ugrid-conventions, which is
> compliant with CF. Their purpose is to describe a set of connected points,
> edges or faces at which values are given, whereas in your case you'd give a
> single value for the whole set, but the description of the geometry itself
> might be similar. Have you had a look at whether ugrid could meet your
> needs? If it almost does so, perhaps a better thing to do would be to
> propose additions to ugrid. We would like to avoid having more than one way
> to describe such geometries.


Bert Jagers and Chris Barker have already commented on this. It is
important to note that UGRID is the *primary* inspiration behind this
proposed approach. That should have been mentioned in the original mail.
The genesis of this work was with full knowledge of UGRID.

This proposed CF addition is meant to align more closely with the community
standards behind GIS features types used by the OGC community. To
accommodate the feature types described by this proposal UGRID would need
to incorporate:

   1. Ragged arrays for coordinate index vectors.
   2. Encoding method for multi-geometries.
   3. Support for point geometries.

The simple features proposal does not expect node sharing amongst
adjacent/contiguous elements and, in all fairness, this is not a
requirement of UGRID but rather a recommendation. The simple features
approach does inherit from UGRID as Bert indicated in that it is possible
to implement node sharing via coordinate index indirection.

We agree with David Arctur that the simple features approach is easier to
implement than UGRID. No offense intended to UGRID which is a powerful
convention indeed.

It really is up to the community if they would rather see simple features
represented in an amended CF-compatible UGRID or an addition to CF. We are
of the opinion that a simple features specification would be very useful.

So far CF does not say anything about the use of netCDF-4 features (i.e.
> not the classic model). We have often discussed allowing them but the
> general argument is also made that there has to be a compelling case for
> providing a new way to do something which can already be done. (Steve
> Hankin often made this argument, but since he's mostly retired I'll make it
> now in his name :-) If there are two ways to do something, software has to
> support both of them. We already have ways to encode ragged arrays, so is
> there a compelling case for needing the netCDF-4 vlen array as well? We
> already have a way to encode strings too, as character arrays. I think this
> is probably a discussion we should have again in a different thread, so
> I'll just talk about your classic encoding. The same points apply to both
> encodings.


Yes, let?s leave that conversation for another time. We mostly want to be
forward compatible understanding that vlen provides a more simple and some
would say more elegant way of handling ragged array data.

Your approach uses a coordinate_index variable to identify indices of
> geometry coordinates where the -1 and -2 indices indicates where exterior
> and interior polygons begin, and the first polygon has an implied -1 at the
> start. Is that right? Given this example, I wonder why you need the index
> array, because none of the coordinates indices (values >=0) is repeated, so
> no space is saved in the x and y arrays. I guess this would be the usual
> case. If polygons did touch or lines crossed, a few points would be in
> common, but not so many that seems to need the complication of the index
> array. A simpler way to do it would be ... which needs only one dimension,
> or you could use the CF ragged array convention (Sect 9.3.3)...


Our example may not be complete enough to fully demonstrate the use case we
are trying to describe. The example given, inspired by the DSG Continuous
Ragged Array encoding, uses a 'stop' variable rather than a ?count?
variable. It may not be apparent that each ?simple feature? may actually be
multiple polygons (with or without hole polygons) or lines. Regarding the
?outside_inside? example you provided, we should show an example where the
geometry count (dimension) is more than 1 and a geometry has multiple
polygons prior to the 'stop' coordinate. The word encoding example was
meant to convey this, but may not have been sufficient. Here is an example
with three geometries:
https://github.com/bekozi/netCDF-CF-simple-geometry/wiki/VLEN-Arrays-in-NetCDF-3#multipolygon-example
.

We are hesitant to add an additional integer variable to indicate
'inside_outside' as it will introduce (in our minds) extraneous, duplicated
value variables. Why repeat -1 5,000 times when introducing a -1 at
multi-geometry breaks accomplishes the same task? One could also argue for
additional variables containing multi-geometry breaks, but again, this is
extraneous. As an example, using break values and ragged arrays similar to
what we describe, the 2.7 million catchment dataset mentioned by Dave
Arctur (which contains MultiPolygons) results in a ~10 GB uncompressed,
netCDF-4 file. Adding 'inside_outside' variables to describe the breaks
and/or holes will make this file larger. We could reduce the file size by
removing repeated nodes via the coordinate indexing method.

You provide the attributes multipart_break_value and hole_break_value to
> specify the values (-1 and -2 above) for the outside vs inside
> distinction. Do you need the generality of being able to choose these
> values? It would seem simpler to use a character array and specify in the
> convention which letters should be used e.g. ... That makes it more
> readable, perhaps.


Those values could be fixed. We would recommend they always be appended to
the variable as attributes, however. We also tend to think of them as fill
values which are customizable in CF. In regards to the character array, it
again seems like a lot of repetition.

Similarly, you propose attributes for clockwise/anticlockwise node order
> and for the polygon closure convention. Do these need to be freely
> choosable? You could specify clockwise, like the existing CF bounds
> convention, and that the polygons are closed. In the latter case, you could
> omit the last vertex of each polygon since it must be the same as the
> first, and that would save a bit of space. If you specify these choices,
> the attributes aren't needed.


These attributes would be considered optional. If controls are in place for
ordering, it may be specified on the polygon variables. Many GIS software
packages don't care about ordering and repeated nodes, but modelling and
regridding codes tend to be more picky.

If this convention is going to be used for discrete sampling geometries, an
> additional dimension is needed, because in a single data variable you might
> have data for several of these geometries. That is, you need an array of
> ragged arrays. Again, I wonder whether this suggests trying to use ugrid.
> It might be you could name each one as a mesh, and specify the geometry of
> for the set of timeSeries as an array of mesh names. That would be a very
> easy change to the existing Sect 9.


The multiple geometry example may help:
https://github.com/bekozi/netCDF-CF-simple-geometry/wiki/VLEN-Arrays-in-NetCDF-3#multipolygon-example
.

Again, thanks all for the feedback and looking forward to continued
discussion.

-- 
Ben Koziol
NESII/CIRES/NOAA Earth System Research Laboratory
ben.koziol at noaa.gov
<https://mail.google.com/mail/u/0/?view=cm&fs=1&tf=1&to=ben.koziol at noaa.gov>
802.392.4522
http://www.esrl.noaa.gov/nesii/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cgd.ucar.edu/pipermail/cf-metadata/attachments/20160927/dcfa3a7c/attachment-0001.html>
Received on Tue Sep 27 2016 - 11:28:36 BST

This archive was generated by hypermail 2.3.0 : Tue Sep 13 2022 - 23:02:42 BST

⇐ ⇒