Context Navigation

← Previous Ticket
Next Ticket →

#68 new enhancement

CF data model and reference implementation in Python

Reported by:	jonathan	Owned by:	cf-conventions@…
Priority:	medium	Milestone:
Component:	cf-conventions	Version:
Keywords:		Cc:	d.c.hassell@…, markh

Description

In this ticket we do not propose any change to the CF standard. This ticket concerns an abstract model for data and metadata corresponding to the existing standard (version 1.5). As a netCDF convention, up to now the CF standard has not included a data model. However, the design of CF implies a data model to some extent. Following the discussion at the GO-ESSP meeting, we now propose that the data model should be made explicit, as an independent element of CF, separate from the CF standards and conformance documents, to be updated for successive CF versions in line with those documents. We consider that defining an explicit data model will contribute to the CF goal to help in "building applications with powerful extraction, regridding, and display capabilities."

We have drafted a document that describes the data model, with an associated UML diagram to illustrate it. The description follows from the one discussed earlier this year on the CF email list, which pointed out the need for a diagram. The proposed data model avoids prescribing more than is needed for interpreting CF as it stands, in order to avoid inconsistency with future developments of CF.

The document describes both the proposed CF data model and how it is implemented in netCDF. These are distinct purposes. The same data model could be implemented in other non-netCDF-like file formats, and that would require the description of the model and implementation to be separated. We have not done that in this version of the document because we think that it would make it harder to understand at this stage.

Following discussions on the email list and at GO-ESSP, we are aware that this attempt to describe the CF data model overlaps with other work on data models, especially the Unidata CDM. It will be useful to discuss the relationship between these. The proposed CF data model corresponds less closely to netCDF storage concepts than the CDM does, and in that sense it is more abstract.

We have also developed a minimal implementation of the data model in Python, including documentation. (Note: since putting this up, we've discovered that there is an existing package called cfpython, which is not related to CF. That is confusing, so we might have to change the name of ours.) The software reads and writes CF-netCDF files, and contains the data and metadata in memory in objects called spaces in a way which is consistent with the data model. It is possible to select a subset of the spaces according to their properties, to extract subspaces by specifying ranges of coordinates or indices along the dimensions, and to modify the metadata. We describe this implementation as "minimal" because it doesn't provide any processing or graphical functions, and it doesn't extend to the level of the scientific feature types of the CDM, for instance. This software might be useful:

To illustrate the data model.

As a reference implementation of CF. In all versions of CF so far published, all changes have been introduced and are marked as provisional, because of the requirement of two demonstrated implementations of new features before changes are accepted as permanent. This software could provide one implementation. The CF checker could be another. (The cf-python software attempts to interpret netCDF files by reference to the CF convention, but does not require or check complete compliance.)

As a basis for data processing and graphical software in Python based on CF concepts. For this purpose, the API is the essence. Other code could be written which offered the same API to the CF data model. All Python code using the same API to the minimal CF data model would be interoperable at that level.

To be clear, we are not proposing this Python code as an element of CF. It could be useful to people dealing with CF-netCDF data, but this proposal is really about the CF data model, which we are proposing as an element of CF. The first of the above points is therefore the most important to this proposal.

We hope that people will consider this proposal. We will welcome comments on this ticket on both the data model and the Python API. (However, comments on the code itself would probably be better made by email, unless they are matters of principle.)

We are grateful to Bryan Lawrence and Dominic Lowe for very thoughtful discussions.

Jonathan Gregory (j.m.gregory at reading.ac.uk)
David Hassell (d.c.hassell at reading.ac.uk)

Attachments (3)

cfdm_KET.pdf (146.3 KB) - added by taylor13 9 years ago.
newCF_0.7.pdf (13.8 KB) - added by davidhassell 8 years ago.: UML diagram of version 0.7 of the proposed CF data model
cfdm_0.7.html (19.8 KB) - added by davidhassell 8 years ago.: version 0.7 of the proposed CF data model

Download all attachments as: .zip

Change History (164)

comment:1 follow-up: ↓ 2 Changed 10 years ago by ngalbraith

One quick comment about the description of attributes:

"Other properties, which are metadata that do not refer to the dimensions, and serve to describe the data the space contains. Properties may be of any data type (numeric, character or string) and can be scalars or arrays. They are attributes in the netCDF file, but we use the term "property" instead because not all CF-netCDF attributes are properties in this sense."

The spec recommends that information that's multidimensional should be stored as a variable, not an attribute; I think the statement "not all CF-netCDF attributes are properties in this sense" could be amended to say "not all CF-netCDF attributes are properties in this sense, and not all such properties are stored as attributes in CF-netCDF."

comment:2 in reply to: ↑ 1 ; follow-up: ↓ 4 Changed 10 years ago by jonathan

Dear Nan

"Other properties, which are metadata that do not refer to the dimensions, and serve to describe the data the space contains. Properties may be of any data type (numeric, character or string) and can be scalars or arrays. They are attributes in the netCDF file, but we use the term "property" instead because not all CF-netCDF attributes are properties in this sense."

The spec recommends that information that's multidimensional should be stored as a variable, not an attribute; I think the statement "not all CF-netCDF attributes are properties in this sense" could be amended to say "not all CF-netCDF attributes are properties in this sense, and not all such properties are stored as attributes in CF-netCDF."

I understand your point, but I can't think of any CF property to which it would apply. I think that all the CF metadata which is not stored in netCDF attributes is described under some other heading of the data model e.g. coordinates and transforms. The "properties" are just the ones which are left, and I think they're all CF-netCDF attributes, unless we've missed something?

Cheers

Jonathan

comment:3 follow-up: ↓ 6 Changed 10 years ago by pbentley

Jonathan, David,

I wonder if there is an opportunity here to unify, where appropriate, some of the terminology with that used in similar geospatial conceptual models (e.g. CDM, OGC). The draft document uses a handful of new terms which do not seem to be part of the CF vernacular - at least not the one with which I'm familiar from e.g. the CF mailing list. (To be fair, you do state in the opening paragraph that you are desribing an abstract data model for CF because one currently does not exist. So to that extent the use of some new terms is perhaps unavoidable.)

An example is the concept that you call dimension coordinate. This would seem to equate roughly with the well-known CF-netCDF concept (or construct) called coordinate variable, and indeed the latter term is used in the definition of dimension coordinates. Would using the existing concept name (or else CoordinateAxis, the CDM analogue) make more sense here, or are they envisaged as distinct concepts/entities?

Similarly, under the description of a space construct you say: "The data array would be missing if the space construct serves only to define a coordinate system, which we call a grid". In which case, why not refer to this concept as just that, a coordinate system rather than as a grid? IMO, coordinate system is the correct term here. It's also the term used in the CDM. As a sub-species of coordinate system, the term 'grid' has, I think, more specific connotations in the eyes/minds of the majority of spatial data users.

Lastly, I wonder if would be helpful to readers if each concept/construct was illustrated with CDL examples, in much the same way as is done for the current CF specification. If you felt that this would interrupt the flow of the document then one might consider i) inserting hyperlinks to examples, or ii) producing a separate "Illustrated CF Data Model" which included inline examples.

Regards, Phil

comment:4 in reply to: ↑ 2 ; follow-up: ↓ 5 Changed 10 years ago by ngalbraith

Replying to jonathan:

The spec recommends that information that's multidimensional should be stored as a variable, not an attribute; I think the statement "not all CF-netCDF attributes are properties in this sense" could be amended to say "not all CF-netCDF attributes are properties in this sense, and not all such properties are stored as attributes in CF-netCDF."

I understand your point, but I can't think of any CF property to which it would apply. I think that all the CF metadata which is not stored in netCDF attributes is described under some other heading of the data model e.g. coordinates and transforms. The "properties" are just the ones which are left, and I think they're all CF-netCDF attributes, unless we've missed something?

We use ancillary variables to describe properties like accuracy & provenance; these can't be stored as attributes because they're two dimensional. Maybe this kind of property doesn't belong in the description of the data model, but it should also not be ruled out by the document.

Maybe I'm not interpreting the phrase "properties, which are metadata that do not refer to the dimensions" correctly?

comment:5 in reply to: ↑ 4 Changed 10 years ago by jonathan

Replying to ngalbraith:

Thanks for the example:

We use ancillary variables to describe properties like accuracy & provenance; these can't be stored as attributes because they're two dimensional. Maybe this kind of property doesn't belong in the description of the data model, but it should also not be ruled out by the document.

Yes, this is allowed for in the data model, but they're not "properties" in the terms of this document. They are mentioned in the section about the space construct. One of the things it can contain is

A list of ancillary spaces. This corresponds to the CF-netCDF ancillary_variables attribute, which identifies other spaces that provide metadata.

That is, they are different spaces (data variables) altogether, but the relationship between the spaces is noted in this way. Is that OK?

Cheers

Jonathan

comment:6 in reply to: ↑ 3 Changed 10 years ago by jonathan

Replying to pbentley:

Thanks for your comment. I agree, we should not avoid unnecessary differences of terminology! We did not use the same terminology as CF-netCDF in all cases in order to avoid confusion, but it might be that this is not a sufficient reason. Specifically:

An example is the concept that you call dimension coordinate. This would seem to equate roughly with the well-known CF-netCDF concept (or construct) called coordinate variable, and indeed the latter term is used in the definition of dimension coordinates. Would using the existing concept name (or else CoordinateAxis, the CDM analogue) make more sense here, or are they envisaged as distinct concepts/entities?

We did not call it just a coordinate construct because of the perennial difficulty in CF-netCDF that coordinate variable (a Unidata term) and auxiliary coordinate variable (a CF term) are distinct concepts. An auxiliary coordinate variable is not a special kind of coordinate variable, in CF-netCDF terms. That is confusing, I think. To avoid it, we oppose dimension coordinate construct and auxiliary coordinate construct. In this CF data model, they are two different kinds of coordinate construct. They have a lot in common - more than they do in CF-netCDF in fact. Instead of the word dimension we could have used axis, but that is confusing too, now that we have decided that the CF-netCDF axis attribute can be applied to auxiliary coordinate variables!

under the description of a space construct you say: "The data array would be missing if the space construct serves only to define a coordinate system, which we call a grid". In which case, why not refer to this concept as just that, a coordinate system rather than as a grid? IMO, coordinate system is the correct term here. It's also the term used in the CDM. As a sub-species of coordinate system, the term 'grid' has, I think, more specific connotations in the eyes/minds of the majority of spatial data users.

I think a coordinate system refers to, for instance, lat-lon, or a Mercator projection, or vertical hybrid-pressure. In the space, there might be several such coordinate systems. We need a term that refers to a collection of them, along with transforms and cell measures, that is all the things which describe the space itself, but not the data in it. I agree, grid is not a great term for this. Suggestions welcomed.

Lastly, I wonder if would be helpful to readers if each concept/construct was illustrated with CDL examples, in much the same way as is done for the current CF specification. If you felt that this would interrupt the flow of the document then one might consider i) inserting hyperlinks to examples, or ii) producing a separate "Illustrated CF Data Model" which included inline examples.

The implementation in netCDF is very complicated. That might make the data model description approach the length of the current CF-netCDF standard. But maybe it would help if we linked the relevant sections of the standard? We did not want to put much about netCDF specifically in this document, since the idea is try to describe the model independently of its implementation. The reason for that is to help thinking about how to hold it in memory, or in other file formats.

Cheers

Jonathan

comment:7 follow-up: ↓ 8 Changed 10 years ago by caron

Hi Jonathan, David:

I think it is very important to have another independent implementation of CF, and Python is an obvious choice. So Im hoping that the Python programmers will look at your API carefully and start using it etc. We need to try out our ideas in code before we standardize them, and this seems like a great advance in that regard.

I will try to understand and comment on the data model component, and add some comparisons to the CDM.

Thanks!

John

comment:8 in reply to: ↑ 7 Changed 10 years ago by jonathan

Dear John

Replying to caron:

I will try to understand and comment on the data model component, and add some comparisons to the CDM.

That would be great! Thanks very much

Jonathan

comment:9 in reply to: ↑ description Changed 9 years ago by davidhassell

Replying to jonathan:

We thought that it may be useful to compare the proposed CF data model with the OGC CF-netCDF data model being written up by Ben Domenico and Stefano Nativi. This is definitely not meant to be a beauty contest (the two models have different purposes, anyway), rather to help a discussion on the usefulness of an abstract, core CF data model and whether or not what we have proposed is correct.

The draft CF data model is as decribed in Trac ticket #68, except we have an amended UML diagram with a more correct positioning of the Transform construct. The text desciption has not needed to be changed.

The OGC CF-netCDF data model is described in OGC's CF-netCDF Data Model extension specification (version 2.1.1). The relevant UML diagrams are figures 3 and 4 on pages 13 and 14.

The OGC model includes CF version 1.6 extensions, whilst the proposed CF data model is describing strictly CF version 1.5, so items such as discrete sampling geometries are not considered here.

All the best,

David Hassell (d.c.hassell at reading.ac.uk)

A comparison of the draft CF model and the OGC CF-netCDF model

Overall, and, unsurprisingly, the draft CF model and the OGC CF-netCDF model are largely similar.

An apparent difference is that the OGC CF-netCDF model specifically includes references to the common spatiotemporal coordinate systems (i.e. those whose axes are a subset of T, Z, Y and X), whereas the draft CF model is minimal in that it makes no reference to the nature of its components. However, the existence of spatiotemporal axes is not mandatory in either model so there is no real difference here.

The only fundamental areas in which the two models diverge is in their treatment of cell measures, grid mapping and formula terms. These differences may be characterised by the OGC CF-netCDF model staying close to the netCDF representation and the draft CF model aiming to be an abstraction of the netCDF representation. These features are dealt with as follows:

Cell measure

In the draft CF model this belongs to the grid in exactly the same way as an auxiliary coordinate does. It is a property of the grid in the same manner as an auxiliary coordinate is.

In the OGC CF-netCDF model, a cell measure variable has, and depends on, a coordinate's boundary object and does not appear to be as strongly connected to the coordinate system.

Grid mapping and formula terms

In the draft CF model, these are grouped together as transforms which either specify parameters which define the coordinate system (such as the assumed radius of the earth); or which specify a function of existing coordinates from which an auxiliary coordinate can be computed (like a formula terms object). In both cases, the transform belongs to the space but also belongs to the relevant coordinates. In the latter case, the resulting 'virtual' auxiliary coordinate may, therefore, be brought into existence with a transform object in place of its data array. The expectation is that the data array would be created from the transform's formula on demand.

The OGC CF-netCDF model retains the CF separation of horizontal and vertical transformations (grid mapping and formula terms respectively); and also their CF-netCDF place in the data model (contained respectively by the coordinate system and the vertical dimensionless coordinate).

Neither is incorrect, of course, but noting that grid mappings and formula terms are essentially the same type of object, and that they may belong to both the coordinate system and to a coordinate (or to coordinates) is an inference from CF which conforms with how these constructs are used in real life and hopefully adds to our understanding.

comment:10 Changed 9 years ago by davidhassell

Hello,

I'd like to announce that there is a new version of the CF data model's python reference implementation software - version 0.9.3 of cf-python.

There has been no change to the data model, rather this new version of the software is a more logical and correct view of the data model. All details are on the cf-python web site (linked above).

All the best,

David

comment:11 follow-up: ↓ 12 Changed 9 years ago by markh

I would like to see this ticket taken on to a formal conclusion: to facilitate this I offer myself as moderator for the ticket.

I think that it would be helpful to break down the ticket and consider some parts of what is being presented here in isolation. Firstly I would like to look at the terms of reference for the data model, what is it's scope and how it relates to the CF specification.

To this end I have a proposal, which I have outlined on ticket #88

I feel it is important to agree these details before we try to address all of the fine detail of the model, as the scope and implications are likely to affect the nature of the feedback.

I suggest that this ticket is kept for discussing the detail of the model.

mark

comment:12 in reply to: ↑ 11 Changed 9 years ago by davidhassell

Cc d.c.hassell@… added

Replying to markh:

Hello Mark,

I would like to see this ticket taken on to a formal conclusion: to facilitate this I offer myself as moderator for the ticket.

Thank you - that will be very useful.

I think that it would be helpful to break down the ticket and consider some parts of what is being presented here in isolation. Firstly I would like to look at the terms of reference for the data model, what is it's scope and how it relates to the CF specification.

To this end I have a proposal, which I have outlined on ticket #88

I feel it is important to agree these details before we try to address all of the fine detail of the model, as the scope and implications are likely to affect the nature of the feedback.

I think that this split is a good idea - if we can agree that there is a need for a CF data model, then we ought to find it easier (and be more motivated) on deciding what it should be, using the model proposed in this ticket as a starting point.

I suggest that this ticket is kept for discussing the detail of the model.

I agree.

All the best,

David

comment:13 follow-up: ↓ 14 Changed 9 years ago by ngalbraith

One more quick question! In the document, the statement "The data array would be missing if the field construct serves only to define a coordinate system, which we call a space" seems to me to limit the use of empty variables to the special case of defining a coordinate system.

Following the NODC NetCDF templates (more or less) I've been using an empty "container variable" to describe instruments; my instrument variable has ancillary variables (each with the depth dimension) containing serial numbers, manufacturers, model names etc.

Does this data model limit the use of empty variables, or am I reading the statement too literally?

Thanks - Nan

comment:14 in reply to: ↑ 13 ; follow-up: ↓ 23 Changed 9 years ago by davidhassell

Replying to ngalbraith:

Hello Nan,

Thanks for the question. I don't believe that it was our intention to limit the use of variables without data arrays.

It would, however, be very useful to see the CDL of your example, as I fear that I don't yet undertstand it - is the instrument variable 'data' or 'metadata'? Are the ancillary variables (of serial numbers, manufacturer, etc.) in some sense auxiliary coordinates? (not independent questions!). I'm sure the full CDL of would clear it up for me.

All the best,

David

Changed 9 years ago by taylor13

Attachment cfdm_KET.pdf added

comment:15 Changed 9 years ago by markh

Cc markh added

comment:16 follow-ups: ↓ 17 ↓ 19 Changed 9 years ago by davidhassell

Added on behalf of Karl (referring to the attachment cfdm_KET.pdf):

Dear Jonathan,David, and all Attached is an edited version of the data model document with detailed comments/questions to trac, but here are some general thoughts.

I think documenting the data model will set the ground work to expand CF beyond netCDF files.

I'm concerned about having multiple data models under development. It appears from the discussion that I shouldn't be concerned, but are we sure this won't cause future problems.

The UML supporting diagram didn't help me much, perhaps because of my ignorance of UML. Could a more accessible figure be developed for and others?

Is the data model most likely to be used to define an object or

structure that can be manipulated by code, and that storing these things in files requires additional specifications (tailored to particular API's or data formats) in addition to the data model? Hope that these comments coming from someone who has only heard mention of data models a few times and never seen one are somehow helpful.

Karl

comment:17 in reply to: ↑ 16 Changed 9 years ago by davidhassell

Replying to davidhassell:

Added on behalf of Karl (referring to the attachment cfdm_KET.pdf):

Dear Jonathan,David, and all

Attached is an edited version of the data model document with detailed comments/questions to trac, but here are some general thoughts.

I think documenting the data model will set the ground work to expand CF beyond netCDF files.

Yes, and could facilitate translations between file formats.

I'm concerned about having multiple data models under development. It appears from the discussion that I shouldn't be concerned, but are we sure this won't cause future problems.

This clearly something to be mindful of. In my mind, if the various (three?) models were hierarchical in that they were genuine subsets of each other, then that would seem quite safe. I don't know if this currently the case.

The UML supporting diagram didn't help me much, perhaps because of my ignorance of UML. Could a more accessible figure be developed for and others?

I know what you mean. I still have problems with the different arrow heads! However, it was a good tool for distilling our ideas whilst composing the model, and I think is is certainly at the simpler end of such diagrams. A couple of things spring to mind, one simple, one more profound: i) a key describing the connection types and ii) omitting the abstract 'Variable' box at the bottom of the diagram (including instead its features (such as a data array) within each relevant construct's box (dimension coordinate, etc.)). The 'Variable' box is a natural element to include when you've got your head in OO programming mode, but perhaps adds unnecessary complexity to the minimal model?

Is the data model most likely to be used to define an object or structure that can be manipulated by code, and that storing these things in files requires additional specifications (tailored to particular API's or data formats) in addition to the data model?

The aim of the data model to me is that it should be that it is necessary and sufficient, so any extensions in an API should be conveniences rather than truly novel relationships and IO in arbitrary data formats should be possible. If this were not the case, I would argue that the data model is missing a meaningful and useful connection which should be considered for inclusion. Perhaps this view is too idealistic?

Hope that these comments coming from someone who has only heard mention of data models a few times and never seen one are somehow helpful.

Very useful - thanks for finding the time to think about it in depth. I'll respond to some of the tracked comments in the attached pdf in a separate post.

All the best,

David

comment:18 Changed 9 years ago by davidhassell

Dear Karl,

A few first-thought responses to most of your detailed data model comments as given the attached pdf. I hope that I have interpreted your points correctly.

Controlled vocabularies (comments 7, 8, 9)

I think that these are already covered in the conventions document: 7.2 Cell measures and appendices D Dimensionless Vertical Coordinates and F Grid Mappings. In the last two cases, the both the standard names (e.g. atmosphere_ln_pressure_coordinate) and grid mapping names (e.g. albers_conical_equal_area) and are controlled Transform names, and their associated parameters (e.g. lev and longitude_of_central_meridian respectively) are similary specified.

Grid mappings (comment 10)

Yes, the usual case is surely to transform to, rather than from, unrotated lat/lon coordinates. Now that multiple grid mappings will be allowed in a CF-netCDF data variable (ticket #70), I don't think that we want to restrict the data model in this way, either.

Missing data indicator (comment 12)

I disagree that a named attribute should be part of the model, if I understand the comment correctly. Rephrasing/strengthening the original wording ('The data model supports the idea of missing data') would probably help.

Global attributes (comment 11)

Perhaps any rules of precedence between netCDF global and netCDF data variables should be omitted from the model altogether, rather being the domain of an API?

The existence of coordinate arrays (comments 4 and 5)

I think that the reference to the existing netCDF implemention is confusing things here (and elsewhere?). The intention was to highlight the identity of 'a netCDF dimension which has no identically named with no netCDF coordinate variable' to 'a CF data model dimension construct which has no coordinate array', and that such a dimension construct is required in the data model as the dimension of an auxiliary coordinate construct which accommodates a CF-netCDF string-valued scalar coordinate variable. I reckon that the original text was technically correct - but obfuscated (like this reply!), perhaps?

Many thanks, again, and all the best,

David

comment:19 in reply to: ↑ 16 Changed 9 years ago by markh

Replying to davidhassell:

Added on behalf of Karl (referring to the attachment cfdm_KET.pdf):

Hello Karl

I think documenting the data model will set the ground work to expand CF beyond netCDF files.

I agree, this is the objective. I feel that it is important to state this explicitly. I am attempting to capture the terms of reference for this work and get the agreed to provide a mandate for the data model.

#88

I will add this to the ticket.

I'm concerned about having multiple data models under development. It appears from the discussion that I shouldn't be concerned, but are we sure this won't cause future problems.

I think that liaising with other initiatives could also be part of the terms of reference. Responsibility for the relationship between the CF data model and the CF specification is already in the suggested terms.

This could be extended, but I think it is worth considering whether it is the responsibility of other projects wanting to benefit from CF to interact with it to maintain the standardisation and guard against future issues arising.

comments on this on #88 would be very useful

The UML supporting diagram didn't help me much, perhaps because of my ignorance of UML. Could a more accessible figure be developed for and others?

I think UML is a pretty widely used modelling language and I'd be slightly concerned about maintaining multiple different types of figure whilst keeping consistency between them.

I think that UML is a good option for this, but I am happy to consider other options, but I would counsel against multiple versions in different forms as raising more issues than it solves.

Is the data model most likely to be used to define an object or structure that can be manipulated by code, and that storing these things in files requires additional specifications (tailored to particular API's or data formats) in addition to the data model?

I think this is a pretty fine description. The object model defines semantics and relationships, which can be implemented in many ways, depending on the constraints of the implementation approach.

comment:20 Changed 9 years ago by Oehmke

I consider this data model a good step forward in describing the structure of the data within CF. It is relevant to our ESMF team because working with Tech-X we have developed an initial python interface to a set of powerful parallel regridding capabilities and see potential for contributing to both data model development and a reference python implementation.

The ESMF team has developed parallel regridding capabilities which work with the CF GridSpec convention for describing logically rectangular grids and with the upcoming UGrid convention for describing unstructured grids. For example, see here for a stand-alone application for generating regrid weights from CF (and non-CF) grid files. We also have the capability to read in grids from files in these formats and allow the user to interpolate between data on the grids within a user application in Fortran (or more limited, in C). We have a working prototype of a python interface to these regrid capabilities (ESMP) which we are currently extending to support the GridSpec and UGrid grid formats.

We are interested in how this work could be used as or support a reference implementation of the data model for regridding problems. A powerful feature of the ESMF strategy, relevant for the CF community, is that it has a general underlying representation of grids that enables regridding among unstructured meshes, logically rectangular grids, and potentially observational data structures. We are interested in understanding and helping to inform development of more complex representations and implementations of grids as the data model progresses.

comment:21 Changed 9 years ago by jonathan

Dear Bob

Thanks for your support. There is a discussion in ticket 88 about whether to adopt a data model in principle, which I think you are agreeing with. I hope that ticket will be concluded soon in the affirmative, and if so we will return to discussing what the data model should be, in this ticket. You are right that it will have to be developed further to support ugrid and Gridspec-M, and it may need some extensions for the discrete sampling geometry feature types of CF 1.6. At the moment, the draft is for CF 1.5.

Best wishes

Jonathan

comment:22 Changed 9 years ago by jonathan

Dear all

Ticket 88 has been accepted; that is, we've decided to adopt a data model for CF. Therefore we can resume discussing what it should be. I have updated the draft data model which David and I prepared as part of this ticket, so that it now includes the motivation for the data model, from ticket 88, as a preamble. This document is a proposal for the data model of CF 1.5, which is the first one we have to agree.

Cheers

Jonathan

comment:23 in reply to: ↑ 14 ; follow-up: ↓ 24 Changed 9 years ago by ngalbraith

Replying to davidhassell:

Thanks for the question. I don't believe that it was our intention to limit the use of variables without data arrays.

It would, however, be very useful to see the CDL of your example, as I fear that I don't yet undertstand it - is the instrument variable 'data' or 'metadata'? Are the ancillary variables (of serial numbers, manufacturer, etc.) in some sense auxiliary coordinates? (not independent questions!). I'm sure the full CDL of would clear it up for me.

Sorry this took me so long. These vary by file types at the moment, so I'll include 2 examples.

Here's a snippet from a data set where temperatures at different depths are recorded by different instrument types. The inst variable is empty, the components are not:
double TEMP(TIME, DEPTH) ;

TEMP:standard_name = "sea_water_temperature" ;
TEMP:coordinates = "TIME DEPTH LATITUDE LONGITUDE" ;
TEMP:instrument = "INST" ;

int INST ;

INST:long_name = "instruments" ;
INST:ancillary_variables = "INST_MFGR INST_MOD INST_SN INST_URL" ;

char INST_MFGR(DEPTH, strlen1) ;

INST_MFGR:long_name = "instrument manufacturer" ;

char INST_MODEL(DEPTH, strlen2) ;

INST_MODEL:long_name = "instrument model name" ;

int INST_SN(DEPTH) ;

INST_SN:long_name = "instrument serial number" ;
INST_SN:units = "1" ;

char INST_URL(DEPTH, strlen3) ;

INST_URL:long_name = "instrument reference URL" ;

Here's one where one instrument records 2 data variables (air temp and relative humidity). There are 5 more instrument descriptions in this file, but I'll just include 1.

double AIRT(TIME) ;

AIRT:standard_name = "air_temperature" ;
AIRT:coordinates = "TIME HEIGHT_RHAT LATITUDE LONGITUDE" ;
AIRT:instrument = "INST_ATRH" ;

double RELH(TIME) ;

RELH:instrument = "INST_ATRH" ;

int INST_ATRH ;

INST_ATRH:long_name = "instrument" ;
INST_ATRH:manufacturer = "Rotronic" ;
INST_ATRH:model = "MP-101A " ;
INST_ATRH:SN = "230" ;
INST_ATRH:sensor_mount = "mounted_on_surface_buoy" ;
INST_ATRH:reference = "http://frodo.whoi.edu/specs.html#hrh_mod" ;
INST_ATRH:range_airt = "-40 to +60 degC" ;
INST_ATRH:range_relh = "0 to 100 %" ;
INST_ATRH:resolution_airt = "0.02 degC" ;
INST_ATRH:resolution_relh = "0.01 %" ;
INST_ATRH:accuracy_airt = "UOP lab calibration, 0.05 degC" ;
INST_ATRH:accuracy_relh = "UOP lab calibration, 1 %" ;

comment:24 in reply to: ↑ 23 Changed 9 years ago by davidhassell

Replying to ngalbraith:

Dear Nan,

Thanks for the examples - I get it. Your examples are entirely consistent with the conceptual model, we just need to tighten up the wording.

The text The data array would be missing if the field construct serves only to define a coordinate system, which we call a space could be replaced with something very roughly along the lines of:

"A field construct may have no coordinate system (i.e. a 0-d coordinate system) if it has no dimension constructs; or may serve only to define a coordinate system if it has no data array nor metadata usually associated with a data array (e.g. ancillary variables)"

This text is clunky, I know, but I hope the intent is clear.

All the best,

David

comment:25 Changed 9 years ago by jonathan

Dear Nan

I too agree that we don't want the data model to exclude constructions which are not part of CF but not inconsistent with CF, like your instrument variable. Actually I would say there isn't a problem, because I don't think you intend the instrument variable to be a CF field at all. It would look like a data variable to the CF-checker, though, I guess, because the checker doesn't know about your instrument attribute.

I wonder why you do it with this extra pointing via instrument? Couldn't you make the instrument variables themselves ancillaries of the data variable?

double TEMP(TIME, DEPTH) ;

    TEMP:standard_name = "sea_water_temperature" ;
    TEMP:coordinates = "TIME DEPTH LATITUDE LONGITUDE" ;
    TEMP:ancillary_variables = "INST_MFGR INST_MOD INST_SN INST_URL" ;

char INST_MFGR(DEPTH, strlen1) ;

    INST_MFGR:long_name = "instrument manufacturer" ;

etc.?

Cheers

Jonathan

comment:26 Changed 9 years ago by markh

Referencing the PotentialDataModelTypes wiki, I would like to open discussion on some issues:

Firstly: the definition of an abstract Coordinate type.

This has been suggested as a container for common properties of DimensionCoordinate and AuxiliaryCoordinate types.

Is it a sensible construct to have?

Does it add value?

I think it does, which I why I propose its inclusion, there appear to be common aspects shared by DimensionCoordinate and AuxiliaryCoordinate which can be defined in one and only one place, e.g.:

Their indexing relationship to their containing ScalarField/Field

The constraint on size/shape from their reference to the ScalarField/Field

Secondly: the definitions of ScalarField and Field:

The proposed Field definition includes explicit description of the other types this type may contain. The ScalarField makes no comment on types it may contain, merely stating

and metadata describing the phenomenon values

Should type definitions be explicit about what they may reference or is this the role of the 'Relation' section (as yet not defined)?

comment:27 Changed 9 years ago by markh

Other points I would like to open discussion on:

CellMeasures:

The type is defined with the name CellMeasures, but the definition and the example suggest the type should be called CellMeasure. The description suggests this is a singular entity, like Dimension, which a Field may have multiple different instances of. Does this seem valid?

How does a CellMeasure differ from an AuxiliaryCoordinate? The way they link to a Field/ScalarField? appears very similar, the main difference appears to be in how they might be used. Is this sufficient justification for an explicit type?

CellMethod/CellMethods

One of the proposed definitions, CellMethods, defines a singular type which is an ordered list of similar entries. The other, Cellmethod, defines a type which implements a single qualification; a ScalarField might have multiple instances of this type.

Is one of these approaches more helpful than the other?

comment:28 follow-up: ↓ 32 Changed 9 years ago by jonathan

Dear Mark

Thanks for promoting discussion of this. All the issues you raise relate to differences between what David and I proposed in this ticket and in our accompanying description of the model on the one hand, and on the other hand the similar but slightly different constructs that you've described in https://cf-pcmdi.llnl.gov/trac/wiki/PotentialDataModelTypes. I would say that there is the same general answer in all cases, that in our proposal David and I attempt to interpret what the CF standard says, with the minimum of constructs. That's not the same thing as defining the objects which might be used in a software implementation. You might have more classes in that case. cf-python and our UML diagram do have more. However, the ones listed in the document are what we propose as implementation-neutral essentials.

abstract Coordinate type, of which dimension coordinates and auxiliary coordinates are particular cases. I don't think the CF standard implies this abstraction. It always talks of (Unidata) coordinate variables or auxiliary coordinate variables; when conventions apply to both, it mentions both. I agree of course that there is a similarity between these, and it's very likely it would be convenient to have an abstract class for it in software due to shared functionality, but you don't need this type to describe the CF standard or CF-netCDF data.

ScalarField and Field. The difference is small between these, and I may have missed your point. One difference might be that you see ScalarField as containing the numerical sizes of the dimensions of the field, whereas David and I suggest that the Field contains Dimensions, and the Dimensions know what their sizes are. We wrote it that way because it resembles netCDF. For instance, tas(lon,lat) knows that its dimensions are lon and lat, but the numerical sizes of lon and lat are in the definitions of the dimensions, not the definition of tas. Another difference you might be pointing to is whether the field contains the other things (coordinates, metadata) or not. I think it does, and your example appears to indicate that the ScalarField does contain metadata such as units. Yes, I think the Field does contain all the other things. This is because in CF-netCDF the data variable is the central point of reference, to which all the other information is related. That other information would not have a purpose in a CF-netCDF variable if no data variable referred to it, and this feels like a relationship of containment to us.

CellMeasure. I had put CellMeasures by mistake and have corrected it now; in the data model description we have CellMeasure, as you say - thanks. The purposes of cell measures and auxiliary coordinates are distinct in the CF standard. Coordinates locate the data in the space, and cell measures provide metrics for the space. However their form is similar, and you might treat them together in software for some purposes.

CellMethods or CellMethod. We adopted CellMethods because the order of the methods may be significant, as recorded in the CF-netCDF cell_methods attribute. If CellMethods was not a construct, the order of methods would have to be remembered in some other component of the Field, but that doesn't resemble the CF convention so closely, we thought.

We could also discuss the Transform construct of our data model. This is the biggest departure in our proposal from CF-netCDF. We are not proposing to change the CF standard, but we suggest that logically you can see the mechanisms for dimensionless vertical coordinates and grid mappings fulfilling the same function, the former for vertical coordinates, the latter (so far) for horizontal coordinates. We propose Transform as a single construct which serves both purposes; it's not an abstract layer which has two constructs as particular instances, but it does the job of both of them.

Our argument is that it's simpler to view it like this in the data model, and the distinction that is made in CF-netCDF files is just a matter of the file format. If we design a standard to encode CF data in some other file format, perhaps we would not make the distinction between dimensionless vertical coordinates and grid mappings. An analogous situation is that in our proposed data model we say that 1D coordinate variables with dimension of size 1 and scalar coordinate variables are logically the same in the data model. The distinction is just a matter of convenience in writing the CF-netCDF file.

Best wishes

Jonathan

comment:29 follow-up: ↓ 30 Changed 9 years ago by markh

Replying to jonathan:

Thanks for promoting discussion of this. All the issues you raise relate to differences between what David and I proposed in this ticket and in our accompanying description of the model on the one hand, and on the other hand the similar but slightly different constructs that you've described in https://cf-pcmdi.llnl.gov/trac/wiki/PotentialDataModelTypes. I would say that there is the same general answer in all cases, that in our proposal David and I attempt to interpret what the CF standard says, with the minimum of constructs. That's not the same thing as defining the objects which might be used in a software implementation. You might have more classes in that case. cf-python and our UML diagram do have more.

However, the ones listed in the document are what we propose as implementation-neutral essentials.

I agree with this statement of objective. I do not think that the ideas I am posting about are focussed on defining object/class definitions for a particular implementation; they are absolutely aimed at the implementation neutral model.

I feel these are differences in the interpretation of the core concepts of CF.

I consider all of the notes I have posted to be absolutely focussed on the implementation neutral model.

CellMeasure. I had put CellMeasures by mistake and have corrected it now; in the data model description we have CellMeasure, as you say - thanks. The purposes of cell measures and auxiliary coordinates are distinct in the CF standard. Coordinates locate the data in the space, and cell measures provide metrics for the space. However their form is similar, and you might treat them together in software for some purposes.

I like the use of language:

Coordinates locate the data in the space.

Cell measures provide metrics for the space.

I think these are helpful clarification statements and may well be an important part of why CellMeasure and AuxiliaryCoordinate are considered seperate types.

To further investigate this particular question, we could ask the general question:

If form is similar but purpose is distinct, should two types be distinct, or should they be related to each other?

ScalarField and Field. The difference is small between these, and I may have missed your point. One difference might be that you see ScalarField as containing the numerical sizes of the dimensions of the field, whereas David and I suggest that the Field contains Dimensions, and the Dimensions know what their sizes are. We wrote it that way because it resembles netCDF. For instance, tas(lon,lat) knows that its dimensions are lon and lat, but the numerical sizes of lon and lat are in the definitions of the dimensions, not the definition of tas.

This is an interesting point. I think that the ScalarField/Field knows the numerical sizes of all the dimensions, as an array of data values, with a particular size is part of the Field.

As such I think that in the model, this information is held by the field and provided, in relevant subsets to the types of coordiante and cell measure as necessary. This is why I consider the Dimension not to be the most useful type;; I would rather have DimensionCoordiante which is constrained to match its size to a dimension of the Field.

You raise numerous other interesting points which I will have to come back to.

mark

comment:30 in reply to: ↑ 29 ; follow-up: ↓ 31 Changed 9 years ago by davidhassell

Replying to markh:

Dear Mark,

Interesting points - thanks - but on the issue of dimensions I still take the other view and I'll try to explain why.

CellMeasure.

I agree - it should be CellMeasure, in much the same way as we have AuxiliaryCoordinate (without an 's').

To further investigate this particular question, we could ask the general question:

If form is similar but purpose is distinct, should two types be distinct, or should they be related to each other?

As far as a data model goes, I think purpose wins, that is I don't think constructs should be in some way related solely because their implmented forms might be similar.

This is an interesting point. I think that the ScalarField/Field knows the numerical sizes of all the dimensions, as an array of data values, with a particular size is part of the Field.

As such I think that in the model, this information is held by the field and provided, in relevant subsets to the types of coordiante and cell measure as necessary. This is why I consider the Dimension not to be the most useful type;; I would rather have DimensionCoordiante which is constrained to match its size to a dimension of the Field.

My take on this is that I think that you need a Dimension construct rather than a DimensionCoordinate construct, because a dimension needs an identity, as well as a size. You can't rely on the Field's data array to facilitate links between dimensions and coordinates since i) the dimension order of the array is arbitrary, ii) a Field's data array may not span all of the dimensions of its space (in CF-netCDF terms this would be the case if a data variable has as associated scalar coordinate variable) and iii) a Field may not even have a data array (in which case it serves just to define a space).

Also, a dimension may not even have an associated coordinate array, in which case I feel that DimensionCoordinate is not an appropiate name. For a dimension which does have a coordinate array, the dimension and the array have a 1-1 correspondence, and so need not be of different types.

I must admit to being confused by the name "ScalarField", as it can hold multidimensional data. I presume (?) that the name distinguishes it from a collection of one or more Fields, but I think that this is an implementation issue, and the atomic "Field" is more appropriate to the data model.

All the best,

David

comment:31 in reply to: ↑ 30 ; follow-up: ↓ 33 Changed 9 years ago by markh

Replying to davidhassell:

Replying to markh:

Hello David

This is an interesting point. I think that the ScalarField/Field knows the numerical sizes of all the dimensions, as an array of data values, with a particular size is part of the Field.

As such I think that in the model, this information is held by the field and provided, in relevant subsets to the types of coordiante and cell measure as necessary. This is why I consider the Dimension not to be the most useful type;; I would rather have DimensionCoordiante which is constrained to match its size to a dimension of the Field.

My take on this is that I think that you need a Dimension construct rather than a DimensionCoordinate construct, because a dimension needs an identity, as well as a size. You can't rely on the Field's data array to facilitate links between dimensions and coordinates since i) the dimension order of the array is arbitrary,

I think that 'i' is not a major issue for a DimensionCoordinate, it merely puts the responsibility for managing the links between the Field and the DimensionCoordiantes firmly on the relevant relation. This feels manageable to me.

ii) a Field's data array may not span all of the dimensions of its space (in CF-netCDF terms this would be the case if a data variable has as associated scalar coordinate variable)

I am not convinced by this. I think the dimensionality of a field is the dimensionality of its data array dimensions, some of which may be length 1. A scalar coordinate variable in a !NetCDF file is either a DimensionCoordiante or an AuxiliaryCoordiante, I do not think we have a ScalarCoordiante type.

A Field may increase its dimensionality by merging with another Field, length 1 DimensionCoordiantes may increase in size and length 1 AuxiliaryCoordinates may be transformed to DimensionCoordinates by this process.

iii) a Field may not even have a data array (in which case it serves just to define a space).

Point 'iii' is particularly interesting, I am considering this one more. Is it really a valid Field if it has 'no data' as opposed to an array of missing data?

Also, a dimension may not even have an associated coordinate array, in which case I feel that DimensionCoordinate is not an appropiate name. For a dimension which does have a coordinate array, the dimension and the array have a 1-1 correspondence, and so need not be of different types.

In this case, as in all others, I would suggest the dimension is an emergent property of the Field, not an entity in its own right.

I feel dimensions are an artifact of the !NetCDF encoding and not inherent to the CF data model. Their names are arbitrary and used to provide relations in the file, I do not think they have any conceptual meaning, so I would not include them in the model.

I must admit to being confused by the name "ScalarField", as it can hold multidimensional data. I presume (?) that the name distinguishes it from a collection of one or more Fields, but I think that this is an implementation issue, and the atomic "Field" is more appropriate to the data model.

I introduced the name ScalarField with two thoughts in mind:

to differentiate it from Field, which is in use with a different definition; I think it helps to have different names, once we have sorted the definition the most apropriate name will survive and Field may well be the better name.
the 'scalar' term refers to the data values stored in the Field, which are representative of scalar quantites; I think this is all CF supports at version 1.5.

I don't think it's a particularly important factor, it's mainly to avoid the 'my Field', 'your Field', 'her Field' in conversations.

mark

comment:32 in reply to: ↑ 28 ; follow-up: ↓ 34 Changed 9 years ago by markh

Replying to jonathan:

Hello Jonathan

abstract Coordinate type, of which dimension coordinates and auxiliary coordinates are particular cases. I don't think the CF standard implies this abstraction. It always talks of (Unidata) coordinate variables or auxiliary coordinate variables; when conventions apply to both, it mentions both. I agree of course that there is a similarity between these, and it's very likely it would be convenient to have an abstract class for it in software due to shared functionality, but you don't need this type to describe the CF standard or CF-netCDF data.

I agree that the CF standard always talks of (Unidata) coordinate variables or auxiliary coordinate variables but my view is that this is entirely consistent with the presence of an abstract type within the model. I also view this abstract type as fulfiling a particular and useful function.

The fact that the CF Convention does not mention an abstract type is because it is an implementation and, as such, does not have a tangible encoding for this abstract construct.

I do not view this construct as 'convenient ... in software' but inherent to the concepts in CF, so I would like to see it in the data model. I believe it fulfils the key purpose:

The abstract Coordinate type's primary purpose is to maintain the consistency between DimensionCoordinates and AuxiliaryCoordiantes.

Implementations may choose whether or not to use this concept in some explicit but uninstantiated form or whether to bypass it and simply provide implementations of the tangible types; the latter being the approach taken by the CF conventions for !NetCDF files.

I think it is important that the model maintains this consistency, in part, to facilitate the type transformation of instances of DimensionCoordiantes and AuxiliaryCoordiantes; transformations in both directions are common for many datasets, I believe. For example, #78 aims to explicitly describe such processes.

comment:33 in reply to: ↑ 31 ; follow-up: ↓ 35 Changed 9 years ago by davidhassell

Replying to markh:

Dear Mark,

My take on this is that I think that you need a Dimension construct rather than a DimensionCoordinate construct, because a dimension needs an identity, as well as a size. You can't rely on the Field's data array to facilitate links between dimensions and coordinates since i) the dimension order of the array is arbitrary,

I think that 'i' is not a major issue for a DimensionCoordinate, it merely puts the responsibility for managing the links between the Field and the DimensionCoordiantes firmly on the relevant relation. This feels manageable to me.

Surely the management entails attaching some sort of identity to each dimension?

ii) a Field's data array may not span all of the dimensions of its space (in CF-netCDF terms this would be the case if a data variable has as associated scalar coordinate variable)

I am not convinced by this. I think the dimensionality of a field is the dimensionality of its data array dimensions, some of which may be length 1. A scalar coordinate variable in a !NetCDF file is either a DimensionCoordiante or an AuxiliaryCoordiante, I do not think we have a ScalarCoordiante type.

I think that it's needlessly proscriptive to insist that a Field's data array spans all of its size 1 dimensions. Whether or not the data array spans them or not is arbitrary and so shouldn't be insisted upon. Conceptually, the idea of, say, a 2-d slice from a 3-d space (e.g. a surface air temperature field) is very natural and so should, I think, be expressible in the data model At the same time, it is just as reasonable to want to view this field in 3 dimensions, so that should, of course, be allowed, too.

I didn't mean to suggest that we need ScalarCoordinate type (I don't think we do) - I hoped merely to create a CF-netCDF analogy to a particular sort of Field so people would know what I was on about!

I feel dimensions are an artifact of the !NetCDF encoding and not inherent to the CF data model. Their names are arbitrary and used to provide relations in the file, I do not think they have any conceptual meaning, so I would not include them in the model.

Their names may be arbitrary, but the fact that they have identities and sizes is important and meaningful. The statement "This box on the table has 3 dimensions: height(size 10, going up) x width(size 20, going to my left) x length(size 30, going away from me)" sets the scene, I think, even though we don't yet know anything more about the dimensions than how big they are and which is which. If we added "The height is in centimetres and is marked into ten equal sections" then we have merely added some metadata (i.e. an array and its units) to the fundamentals already laid down. Maybe the box contains one 20x30 flat piece of paper ... :)

I introduced the name ScalarField with two thoughts in mind:

to differentiate it from Field, which is in use with a different definition; I think it helps to have different names, once we have sorted the definition the most apropriate name will survive and Field may well be the better name.
the 'scalar' term refers to the data values stored in the Field, which are representative of scalar quantites; I think this is all CF supports at version 1.5.
I don't think it's a particularly important factor, it's mainly to avoid the 'my Field', 'your Field', 'her Field' in conversations.

I get it - thanks.

All the best,

David

comment:34 in reply to: ↑ 32 Changed 9 years ago by davidhassell

Replying to markh:

Dear Mark,

I do not view this construct as 'convenient ... in software' but inherent to the concepts in CF, so I would like to see it in the data model. I believe it fulfils the key purpose:

I think it is important that the model maintains this consistency, in part, to facilitate the type transformation of instances of DimensionCoordiantes and AuxiliaryCoordiantes; transformations in both directions are common for many datasets, I believe. For example, #78 aims to explicitly describe such processes.

Whilst Jonathan and I originally considered it, the aggregation rules of #78 explicitly prohibit transforming dimension coordinates into auxiliary coordinates, and vice versa:

"A pair of matching coordinate constructs is a coordinate construct from each field of the same type (dimension or auxiliary) ..."

One reason for this being, if I recall, that these things have been defined as they for a reason, and so transforming them is unsafe.

If we agree that dimensions and auxiliary coordinates are different enough to need their own type, then in the spirit of a minimilist data model, a construct which has their shared behaviours but is otherwise unused is, perhaps, not required.

All the best,

David

comment:35 in reply to: ↑ 33 Changed 9 years ago by davidhassell

Replying to davidhassell:

Maybe the box contains one 20x30 flat piece of paper ... :)

Sorry - scratch that bit. Dimenions unused by the data array must have size 1 (as opposed to size 10 in my example). Maybe that'll teach me to take an analogy too far!

David

comment:36 follow-up: ↓ 37 Changed 9 years ago by spascoe

I'd like to join the discussion by probing a couple of points that are arising from this thread, not necessarily in order.

I feel dimensions are an artifact of the !NetCDF encoding and not inherent to the CF data model. Their names are arbitrary and used to provide relations in the file, I do not think they have any conceptual meaning, so I would not include them in the model.

Their names may be arbitrary, but the fact that they have identities and sizes is important and meaningful. The statement "This box on the table has 3 dimensions: height(size 10, going up) x width(size 20, going to my left) x length(size 30, going away from me)" sets the scene, I think, even though we don't yet know anything more about the dimensions than how big they are and which is which. If we added "The height is in centimetres and is marked into ten equal sections" then we have merely added some metadata (i.e. an array and its units) to the fundamentals already laid down. Maybe the box contains one 20x30 flat piece of paper ... :)

I am with Mark on this. Dimensions are not required distinct from their DimensionCoordinate. Consider an alternative world where each NetCDF variable was defined on it's own set of dimensions (i.e. there was no concept of shared dimensions). We could then relate a variable to it's coordinate values using attributes. E.g.

variables:
  double var1(time=100, lat=145, lon=192) ;
    var1:time_variable = 'time' ;
    var1:lat_variable = 'latitude' ;
    var:lon_variable = 'longitude' ;
  double time(t=100) ;
  double latitude(y=145) ;
  double longitude(x=192) ;

In this case var1 is still related to coordinate axes time, latitude and longitude although there exists no distinct dimension concepts within the format.

ii) a Field's data array may not span all of the dimensions of its space (in CF-netCDF terms this would be the case if a data variable has as associated scalar coordinate variable)

I am not convinced by this. I think the dimensionality of a field is the dimensionality of its data array dimensions, some of which may be length 1. A scalar coordinate variable in a !NetCDF file is either a DimensionCoordiante or an AuxiliaryCoordiante, I do not think we have a ScalarCoordiante type.

I think that it's needlessly proscriptive to insist that a Field's data array spans all of its size 1 dimensions. Whether or not the data array spans them or not is arbitrary and so shouldn't be insisted upon. Conceptually, the idea of, say, a 2-d slice from a 3-d space (e.g. a surface air temperature field) is very natural and so should, I think, be expressible in the data model At the same time, it is just as reasonable to want to view this field in 3 dimensions, so that should, of course, be allowed, too.

This sounds like a fundamental dissagreement over what is the space of a Field. Surely a 2-d slice is a Field in 2-d space. If we want to describe how that 2-d space is embedded in a 3-d space that requires extra constructs. Maybe a CellMeasure or something similar? If this is already represented in Cf it isn't apparent to me.

To further investigate this particular question, we could ask the general question:

If form is similar but purpose is distinct, should two types be distinct, or should they be related to each other?

As far as a data model goes, I think purpose wins, that is I don't think constructs should be in some way related solely because their implmented forms might be similar.

In my view form needs to be considered relative to the meta-model we are using (either informally or formally). In the end a model can be reduced to some underlying meta-model. E.g. UML or set logic or the NetCDF abstract data model. Maybe we need to reach concensus on what that meta-model is first? I get the impression that John and David are working on a NetCDF meta-model -- that is not to say the actual encoding, but the concepts underneath: dimension, variable, attribute.

I'm not sure what Mark's meta-model assumptions are yet. I think the definitions at the top of the PotentialDataModelTypes wiki page could be clarified in that regard but this is probably best left to another thread.

comment:37 in reply to: ↑ 36 ; follow-up: ↓ 44 Changed 9 years ago by Oehmke

Replying to spascoe:

In this case var1 is still related to coordinate axes time, latitude and longitude although there exists no distinct dimension concepts within the format.

ii) a Field's data array may not span all of the dimensions of its space (in CF-netCDF terms this would be the case if a data variable has as associated scalar coordinate variable)

I am not convinced by this. I think the dimensionality of a field is the dimensionality of its data array dimensions, some of which may be length 1. A scalar coordinate variable in a !NetCDF file is either a DimensionCoordiante or an AuxiliaryCoordiante, I do not think we have a ScalarCoordiante type.

I think that it's needlessly proscriptive to insist that a Field's data array spans all of its size 1 dimensions. Whether or not the data array spans them or not is arbitrary and so shouldn't be insisted upon. Conceptually, the idea of, say, a 2-d slice from a 3-d space (e.g. a surface air temperature field) is very natural and so should, I think, be expressible in the data model At the same time, it is just as reasonable to want to view this field in 3 dimensions, so that should, of course, be allowed, too.

This sounds like a fundamental dissagreement over what is the space of a Field. Surely a 2-d slice is a Field in 2-d space. If we want to describe how that 2-d space is embedded in a 3-d space that requires extra constructs. Maybe a CellMeasure or something similar? If this is already represented in Cf it isn't apparent to me.

I wanted to chime in here about the idea of allowing a Field to be a slice of the total space. In ESMF we allow an ESMF Field to be a subset of the total space defined by the associated ESMF Grid which defines its coordinates, dimension and distribution. We do this by allowing the user to specify a gridToFieldMap when the Field is created. The gridToFieldMap specifies which dimension of the Field each dimension of the Grid is associated with. Maybe something similar could be used here. For example, if you wanted to specify that a Field was a slice of another Field you could have an attribute (e.g. slice_map) which points to the Field and specifies which dimensions it maps to.

For example:

variables:
  double fieldA(size1, size2, slze3) ;

   double slice_of_fieldA(size1,size3)
    slice_of_field:slice_map="field:1,3"

However, you might want to know the index values in the other dimensions at which you're making the slice. For example. in the above what level of dimension 2 were you taking the 2D slice at to give you slice_of_fieldA? In that case you might want to make the map the length of all the dimensions of the field and use a symbol to indicate which dimensions are from the original field say '*' and put in the numbers to indicate the levels.

For example, if slice_of_fieldA is at the 10 level of dimension 2 it would look like:

variables:
  double fieldA(size1, size2, slze3) ;

   double slice_of_fieldA(size1,size3)
    slice_of_field:slice_map="field:*,10,*"

It seems to me that an attribute which specifies a relationship between Fields like slice_map should be in the other properties components of the Field, since those are per Field, whereas I believe the CellMeasures are per data value in the Field.

comment:38 Changed 9 years ago by jonathan

Dear all

What we are doing. Stephen correctly says that the data model that David and I are proposing is not a description of the actual encoding in CF-netCDF, but tries to identify underlying concepts. It's an interpretation of CF-netCDF, and in principle it could be implemented in other file formats.

Abstract coordinate type. I agree with Mark that an abstract coordinate type (of which Dimensions and AuxiliaryCoordinates are special cases) is consistent with the CF convention, but I agree with David that it is not required. As David says, our proposed model is a minimalist one, and since this construct is not necessary to describe CF metadata, it is not included. This minimalist approach is the same one as we use in the development of the CF-netCDF standard; we don't introduce ideas and mechanisms unless we have a definite need for them. Similarity of form is not a reason for introducing an abstract type. We have already discussed the point that the CellMeasure and AuxiliaryCoordinate constructs are similar in form, but they are distinct in function, and it would not be helpful to regard them as special cases of a more abstract construct in the data model, although it might well be sensible to implement them as subclasses of a general class.

Relation of Dimensions to Fields. CF-netCDF follows the Unidata convention of associating coordinate variable and data variables implicitly via the identity of dimensions:

  double var1(time,lat,lon);
  double time(time) ;
  double lat(lat) ;
  double lon(lon) ;

Stephen describes an alternative convention in which each netCDF variable is defined with its own set of dimensions:

  double var1(time, lat, lon) ;
    var1:time_variable = 'time' ;
    var1:lat_variable = 'latitude' ;
    var1:lon_variable = 'longitude' ;
  double time(t) ;
  double latitude(y) ;
  double longitude(x) ;

In this convention, time and t are numerically equal, and likewise lat and y, lon and x. This convention looks more complicated and involves redundancy, but I agree that it could be done this way, in which dimensions belong to their variables and there are explicit statements of how they are related. However, it's not CF! Our proposed data model describes a Dimension as being part of a Field because that seemed to us to be a good description of the CF-netCDF convention. The Dimension does not need to exist unless the Field exists. Similarly, the coordinate array does not need to exist unless the Dimension exists, so the coordinate array is part of our Dimension construct. It is optional part, however, because it possible to have a dimension of a data variable which does not have a 1D coordinate variable.

2D slide of 3D Field. I would say that if a slice is taken of a 3D Field, we still have a 3D Field, but one of the Dimensions has been reduced to size 1. Like David, I would say that the Field still has this Dimension, even if we drop it from the data variable, which might now be 2D. This is consistent with the CF-netCDF standard, because the size-1 dimension might be deleted, and the size-1 coordinate variable replaced with a scalar coordinate variable (listed in the coordinates attribute). Our data model follows the CF-netCDF standard in regarding size-1 coordinate variables and scalar coordinate variables as equivalent. We think this implies that the data variable does not have to include the size-1 dimensions. An important reason why size-1 dimensions have this special status is that it makes no difference to the contents of the data array whether and where they are included in the list of dimensions; you can always change the shape of it by inserting or deleting size-1 dimensions at any position.

Field without data. We recognise this possibility in our proposed data model even though it is actually not possible in CF-netCDF, since it's the data variable which identifies the dimensions and scalar coordinate variables. However a Field without data, which we call a space (but we don't recognise this as a construct in the model), could be useful in applications, for instance you might use it to specify the grid you want to a regridding application.

Thanks for this interesting discussion. Cheers

Jonathan

comment:39 Changed 9 years ago by spascoe

Relation of Dimensions to Fields. My example of the alternative dimension convention was intended to illustrate that it is possible to relate a variable to it's coordinate variables without the use of shared dimensions; thus supporting my case that dimension constructs are not inherent to the CF data model. I agree the convention isn't as convenient as the way CF actually does it. (As an aside it is quite close to the way it's represented in NetCDF4's HDF5 encoding where there isn't really shared dimensions).

2D slice of 3D Field. If bounds are defined I can see how a slice would retain the same dimensionality as its source Field. However, once you squash the unit dimension I don't see how this can be true. I read jonathan's point as saying a slice F1 of shape [1,80,160] taken from a Field F2 of shape [200,80,160] is equivalent to the same slice with the first dimension squashed, say slice F3 of shape [80,160], and both F1 and F3 are 3D Fields (i.e. rank=3). It appears F3 somehow remembers the rank of the original field F2.

comment:40 Changed 9 years ago by jonathan

Dear Stephen

Relation of Dimensions to Fields. This is intriguing but I suspect I haven't grasped your point yet. I agree that it is possible to relate a data variable to its coordinate variables without the use of shared dimensions. However the way it is done in HDF and the netCDF method you suggested are not how it is done in CF, as you say. Therefore I don't think these implementations tell us about the CF data model. The CF data model is an abstraction of how CF-netCDF describes data. In my interpretation, the data variable is central to CF-netCDF, and correspondingly we have the Field as the central concept in the data model. In CF-netCDF, the data variable has netCDF dimensions; thus in the data model the Field construct contains Dimension constructs.

The question we were discussing with Mark was where the information about the size of the dimensions resides. David and I think that the size of a dimension is the property of a Dimension, not the property of a Field. Thus, a Field has certain Dimensions, and each of these Dimensions has a size. Take your array of shape (200,80,160). I would say that this array has three dimensions, its first dimension has a size of 200, and this dimension could also have a coordinate array (which must have size 200 because it belongs to this dimension). To me this seems a more natural model of the data than the alternative of saying that there exists a data array whose first dimension has size 200, and there exists a coordinate array of size 200, and there is a mechanism which explicitly records that these are related. This is less natural because the coordinate array would not be required at all if the data array did not exist or did not have this dimension.

2D slide of 3D field. When you take a slice of shape (1,80,160) from a Field of shape (200,80,160) you do not delete the first dimension of the space, but you can delete the first dimension of the data array. The data array can be squashed to (80,160) without changing its contents. For example, suppose the 3D field is (pressure,lat,lon), and suppose we take the single level at pressure=500 mbar. The pressure dimension now has size one. There is a size-1 coordinate variable of 500 mbar. In CF-netCDF there are two alternative and equivalent ways to encode this:

  pressure=1; // dimension
  float temperature(pressure,lat,lon);
  float pressure(pressure);
  float lat(lat);
  float lon(lon);

  float temperature(lat,lon);
    temperature:coordinates="pressure";
  float pressure; // scalar coordinate variable
  float lat(lat);
  float lon(lon);

I argue that, in the data model, the Field has a pressure Dimension in both cases, with size 1, and with a coordinate value. However the data array does not need this size-1 dimension. As a further step, you could explicitly delete the size-1 pressure dimension. Then the Field is reduced to 2D.

What we are doing. Constructing the data model is a sort of reverse engineering. The object code is the CF convention. The source code, which we are trying to reconstruct, is the abstract ideas that motivated the convention. However, this process is not really a reconstruction, but a construction, since the data model was not previously defined. In fact, the authors of the convention may have had slightly different ideas in mind. So long as those different ideas didn't lead to different agreed conventions, the data model was truly agnostic about the choices involved. We could decide now on a particular interpretation of CF in unclear cases, but we don't have to. If we really can't decide what CF means on a particular issue, it may be wiser to leave it undefined until a new proposal for changing the conventions makes a decision more necessary.

Best wishes

Jonathan

comment:41 follow-up: ↓ 42 Changed 9 years ago by spascoe

Thanks for the interesting discussion. I'll have to split this up into smaller pieces.

Relation of Dimensions to Fields. I assume that we are aiming at a minimum set of concepts which describes CF. If so I argue a dimension construct is unnecessary as an entity separate from its coordinate variable. I simply believe we don't need both to represent CF data. To illustrate this I showed an alternative NetCDF encoding of a Field which didn't share any dimensions between variables.

The question we were discussing with Mark was where the information about the size of the dimensions resides. David and I think that the size of a dimension is the property of a Dimension, not the property of a Field. Thus, a Field has certain Dimensions, and each of these Dimensions has a size. Take your array of shape (200,80,160). I would say that this array has three dimensions, its first dimension has a size of 200, and this dimension could also have a coordinate array (which must have size 200 because it belongs to this dimension). To me this seems a more natural model of the data than the alternative of saying that there exists a data array whose first dimension has size 200, and there exists a coordinate array of size 200, and there is a mechanism which explicitly records that these are related. This is less natural because the coordinate array would not be required at all if the data array did not exist or did not have this dimension.

In my view the way NetCDF dimensions are shared between variables is "a mechanism which explicitly records that [variable dimensions] are related" . I think the disagreement over where dimension-size resides indicates there is redundancy in the proposed model. Eg. we can say

The size of a Field's dimension is the length of it's associated coordinate variable array
Two Fields share a dimension if they are associated with the same coordinate variable

Neither of these statements rely on the existence of a separate dimension construct.

comment:42 in reply to: ↑ 41 ; follow-up: ↓ 43 Changed 9 years ago by davidhassell

Replying to spascoe:

It's really good that this discussion is opening up - thanks.

Relation of Dimensions to Fields. I assume that we are aiming at a minimum set of concepts which describes CF.

I think so, yes.

I argue a dimension construct is unnecessary as an entity separate from its coordinate variable. I simply believe we don't need both to represent CF data. To illustrate this I showed an alternative NetCDF encoding of a Field which didn't share any dimensions between variables.

I agree. Jonathan and I do not have them as seperate entities - our Dimension construct fulfils both roles in that a Dimension's metadata includes its size and, optionally, a coordinate array.

In my view the way NetCDF dimensions are shared between variables is "a mechanism which explicitly records that [variable dimensions] are related" . I think the disagreement over where dimension-size resides indicates there is redundancy in the proposed model. Eg. we can say

The size of a Field's dimension is the length of it's associated coordinate variable array

Two Fields share a dimension if they are associated with the same coordinate variable

Neither of these statements rely on the existence of a separate dimension construct.

Perhaps we should try to answer this question first:

Is a coordinate array mandatory for every dimension of the field's space?

I think that the answer is 'no', i.e. a dimension of the field might be defined purely in index-space. In this case I don't think that the dimension is a coordinate (because index-space has no physical meaning) yet the field still needs to know the size of the dimension.

As I understand it, the current CF conventions do not force every dimension to have a coordinate array. Section 6.1.1 Geographic Regions is an example of this. In this example the dimension does, as it happens, have a 1-d auxiliary region coordinate associated with it, but the auxiliary coordinate does not fulfil the 'coordinate array for the dimension' role.

All the best,

David

comment:43 in reply to: ↑ 42 Changed 9 years ago by jonathan

Replying to davidhassell:

As I understand it, the current CF conventions do not force every dimension to have a coordinate array.

That's quite right, and in CF 1.6 (our data model is for 1.5) this possibility has been made explicit in the new section 4.5 Discrete Axis:

The spatiotemporal coordinates described in sections 4.1-4.4 are continuous variables, and other geophysical quantities may likewise serve as continuous coordinate variables, for instance density, temperature or radiation wavelength. By contrast, for some purposes there is a need for an axis of a data variable which indicates either an ordered list or an unordered collection, and does not correspond to any continuous coordinate variable. Consequently such an axis may be called “discrete”. A discrete axis has a dimension but might not have a coordinate variable. Instead, there might be one or more auxiliary coordinate variables with this dimension (see preamble to section 5). Following sections define various applications of discrete axes, for instance section 6.1.1 “Geographical regions”, section 7.3.3 “Statistics applying to portions of cells”, section 9.3 “Representation of collections of features in data variables”.

Jonathan

comment:44 in reply to: ↑ 37 ; follow-up: ↓ 46 Changed 9 years ago by markh

Replying to Oehmke:

I wanted to chime in here about the idea of allowing a Field to be a slice of the total space. In ESMF we allow an ESMF Field to be a subset of the total space defined by the associated ESMF Grid which defines its coordinates, dimension and distribution. We do this by allowing the user to specify a gridToFieldMap when the Field is created. The gridToFieldMap specifies which dimension of the Field each dimension of the Grid is associated with. Maybe something similar could be used here. For example, if you wanted to specify that a Field was a slice of another Field you could have an attribute (e.g. slice_map) which points to the Field and specifies which dimensions it maps to.

This is an interesting concept, which seems to me to be providing a mechanism for many datasets to share the same collection of coordinates. This additional capability enables this sharing to occur in cases where one dataset is a coherent subset of another.

However, the wording of the CF Conventions strongly imply that each data variable is an independent entity, and that sharing of coordinates is an encoding artifact, with no semantic meaning. the description of how global variables may be overridden individually by data variables is one example of this perspective.

As such, I don't think this concept is consistent with the CF model. I like the advantages that come of mandating that each Field must be a self contained, self describing instance.

It is then the metadata which must be used to compare and relate variables, not interrelated characteristics, like subsetted dimensions; this feels more interoperable to me.

Whilst I can see it's use in a number of implementation cases, I think it is implementation specific; I don't support this approach for the CF data model.

comment:45 Changed 9 years ago by markh

On the Relation of Dimensions to Fields:

I think that dimensions are properties of Fields, I do not think that they have an identity or any explicit independent characteristics. As such, I do not think that the 'Dimension' construct is a useful concept for the data model, in fact I think it is somewhat confusing.

I would rather see a Field with emergent properties of 'dimensionality' and 'shape' derived from the data of the Field.

There is no controlled terminology for dimension naming in the CF Conventions and no semantics implied by dimension constructs in CF !NetCDF files. I suggest this supports my perspective.

Where a dimension of a Field has a one dimensional numerical monatonic coordinate related to that dimension, this coordinate may have a special type, indicating it as the primary characterisation mechanism for the dimension, with a Field constraint that each dimension may only have one or zero of these DimensionCoordinate instances per dimension.

I believe that this perspective minimally captures the semantics of the CF conventions for !NetCDF files whilst preserving the implementation agnostic nature of the data model.

I also do not believe the data model should support data free Fields, unless there is a strong justification to do so; as pointed out, these are not able to be encoded as CF !NetCDF Files (although a 'missing data Field' is).

comment:46 in reply to: ↑ 44 ; follow-up: ↓ 53 Changed 9 years ago by Oehmke

Replying to markh:

Replying to Oehmke:

I wanted to chime in here about the idea of allowing a Field to be a slice of the total space. In ESMF we allow an ESMF Field to be a subset of the total space defined by the associated ESMF Grid which defines its coordinates, dimension and distribution. We do this by allowing the user to specify a gridToFieldMap when the Field is created. The gridToFieldMap specifies which dimension of the Field each dimension of the Grid is associated with. Maybe something similar could be used here. For example, if you wanted to specify that a Field was a slice of another Field you could have an attribute (e.g. slice_map) which points to the Field and specifies which dimensions it maps to.

This is an interesting concept, which seems to me to be providing a mechanism for many datasets to share the same collection of coordinates. This additional capability enables this sharing to occur in cases where one dataset is a coherent subset of another.

However, the wording of the CF Conventions strongly imply that each data variable is an independent entity, and that sharing of coordinates is an encoding artifact, with no semantic meaning. the description of how global variables may be overridden individually by data variables is one example of this perspective.

As such, I don't think this concept is consistent with the CF model. I like the advantages that come of mandating that each Field must be a self contained, self describing instance.

It is then the metadata which must be used to compare and relate variables, not interrelated characteristics, like subsetted dimensions; this feels more interoperable to me.

Whilst I can see it's use in a number of implementation cases, I think it is implementation specific; I don't support this approach for the CF data model.

You could use it to share coordinates as you describe, but I was thinking of it more as useful metadata for describing the relation ship between two Fields when one is a slice of the other. From the previous discussion it looks like when you take a slice of Field you lose the information about what level you took the slice at. Maybe there's an existing CF way of indicating this that I haven't seen? Though, I guess this is probably drifting away from the topic at hand.

I agree that dimensions in CF are probably best represented as properties of Fields, rather than independent entities. The 'dimension' construct is useful, but since the goal is to capture CF conventions as closely as possible without added entities it seems unnecessary.

comment:47 follow-up: ↓ 51 Changed 9 years ago by caron

I have to strongly object to the idea that dimensions are not independent objects in the CF data model. My apology for not having closely read all the posts, but here I go:

Variables/Fields? are finite samplings of a continuous function

1a. Variables represent functions Rn -> R (R= real numbers, Rn = R x R x .. x R, the product space of reals). The canonical use case is that Rn represents earth space and time. Variables can in fact be vector valued, but its simpler to just consider scalar functions, eg the range is R.

1b. We represent this function by sampling it at distinct points, and recording those values in a multidimensional array.

The Coordinate System defines the domain of the Field, that is, Rn.

2a. Each coordinate is a sampled function on Rm -> R, where Rm is a subset of Rn.

2b. The set of Coordinates for the field is the field's coordinate system.

The set of Dimensions for the Field represent the sampling.

3a. The Field must share the sampling with its Coordinates, so that each point in the sampling can be located in the domain.

3b. Its typical that there is more than one physical value at each sampled point, which we represent as multiple Fields with the same coordinate system and therefore shared dimensions.

This interplay between the Fields/Variables?, their coordinate system and their shared dimensions, is simple and elegant. One can and sometimes must do more complicated things, but you really dont want to mess with the simple and common case.

AFAIU, this is the essence of the model that Jonathan and David are proposing, but I will try to catch up on the nuances of what you all are discussing. Apologies if my terminology is not quite right.

comment:48 Changed 9 years ago by caron

What is a slice of a 3D Field?

When the cell bounds have extent in the reduced dimension, its probably more natural to consider it a reduced 3D field. If no extent (eg a point sample) then 2D might be more natural.

If you think of it as a 2D field, in the language of topology, its a "2D manifold embedded in a 3D space". This is especially useful when you think of it as a sampling of a continuous function, and/or you visualize it in coordinate space.

comment:49 Changed 9 years ago by caron

2D Coordinates

Consider the canonical swath dataset:

dim:

scan = 1234; xscan = 2345;

var:

float data(xscan,scan);

data:coordinates = "lat lon";

float lat(xscan, scan); float lon(xscan, scan);

Note that there is not a 1-1 correspondence of coordinate and dimension. A good example to keep in mind to make sure you are not being misled by the special characteristics of gridded data.

comment:50 follow-up: ↓ 58 Changed 9 years ago by spascoe

I think John's approach of expressing the CF-model in mathematical terms is the only way we are going to get a clear data model. I'm definitely not quibbling with the existence of dimensions when expressed in these terms. The advantage of mathematical terminology is that it avoids potentially ambiguous terms such as "contains", "is related to". Thanks John!

I may have miss-understood but I think John introduces several concepts that were thus far absent from Jonathan's and/or Mark's approaches:

A variable as a continuous function that is sampled
A Coordinate System as the domain of a Field (maybe this is Jonathan's Transform?)

comment:51 in reply to: ↑ 47 ; follow-up: ↓ 60 Changed 9 years ago by markh

Replying to caron:

This is an interesting approach, which I would like to understand further.

I have to strongly object to the idea that dimensions are not independent objects in the CF data model. My apology for not having closely read all the posts, but here I go:

Variables/Fields? are finite samplings of a continuous function

I agree

1a. Variables represent functions Rn -> R (R= real numbers, Rn = R x R x .. x R, the product space of reals). The canonical use case is that Rn represents earth space and time. Variables can in fact be vector valued, but its simpler to just consider scalar functions, eg the range is R.

I think for CF 1.5 it is better that we stick to scalar variables.

1b. We represent this function by sampling it at distinct points, and recording those values in a multidimensional array.

I agree

The Coordinate System defines the domain of the Field, that is, Rn.

Not sure about this. The basis for a subset of the domain can be specified by a Coordiante System referenced by DimensionCoordiantes, but this is all optional. Many of the dimensions of the Field are not in any way related to a Coordiante System, such as ones representing time, ensemble etc.

Unless your use of the term coordinate system and mine are significantly different. Do you mean 'a collection of Coordinates'? (as in 2b)

2a. Each coordinate is a sampled function on Rm -> R, where Rm is a subset of Rn.

2b. The set of Coordinates for the field is the field's coordinate system.

I wonder, as above, if we could use a differnt, unambiguous term for a set of Coordiantes?

Also, are you defining the degrees of freedom for a particular Field here, I don't think that is defined by the Coordinates.

The set of Dimensions for the Field represent the sampling.

I wonder, could I equally say

the dimensionality and extent of the Field's data array represents the sampling

3a. The Field must share the sampling with its Coordinates, so that each point in the sampling can be located in the domain.

Consistency is key, so a mechanism is required to to match the coordinates to the field, but I don't think this yet tells us which mechanism must be used.

This also raises the question:

Is a Field valid if aspects of its sampling are not described by any Coordinates?

3b. Its typical that there is more than one physical value at each sampled point, which we represent as multiple Fields with the same coordinate system and therefore shared dimensions.

I don't think this is consistent with the CF conventions. I think that a Field defines a set of sampling points for 1 physical value. I don't think any Fields share coordiantes or dimensions.

Two fields can be compared to see if their coordinates and dimensions are the same, but they are not shared.

This interplay between the Fields/Variables?, their coordinate system and their shared dimensions, is simple and elegant. One can and sometimes must do more complicated things, but you really dont want to mess with the simple and common case.

AFAIU, this is the essence of the model that Jonathan and David are proposing, but I will try to catch up on the nuances of what you all are discussing.

Apologies if my terminology is not quite right.

we'll happily work on that, the ideas are of great interest

in your example:

2D Coordinates

Consider the canonical swath dataset:

dim:

    scan = 1234; xscan = 2345;

var:

    float data(xscan,scan);

        data:coordinates = "lat lon";

    float lat(xscan, scan); float lon(xscan, scan);

are 'scan' and 'xscan' crucial pieces of information?

thank you

mark

comment:52 Changed 9 years ago by markh

On the Relation of Dimensions to Fields:

Having stated my current view on this I would like to further explore the alternative presented, that dimensions are independent constructs. A few questions arise in my mind:

Do Dimension instances have an identity?
1. If so, how does this manifest itself?
Do a Field's Dimensions define the dimensionality and shape of the Field?
1. Is the Field's data array constrained to match the dimensionality and shape as defined?
What does a Dimension instance with no Coordiante look like?

comment:53 in reply to: ↑ 46 Changed 9 years ago by markh

Replying to Oehmke:

You could use it to share coordinates as you describe, but I was thinking of it more as useful metadata for describing the relation ship between two Fields when one is a slice of the other. From the previous discussion it looks like when you take a slice of Field you lose the information about what level you took the slice at. Maybe there's an existing CF way of indicating this that I haven't seen? Though, I guess this is probably drifting away from the topic at hand.

I think that this case is handled in CF by the 2D slice Field still having a coordinate of some type, with a single value, but all the same coordinate metadata as the equivalent coordinate on the 3D Field.

Thus I can access the information that the 2D slice Field is a subset of the 3D Field by comparing the coordinates, finding the ones which match metadata and identifying that the value in the 2D Field's coordinate is one of the values existing in the 3D Field's coordinate. If this is a DimensionCoordinate or equivalent then this also provides me with the index value the slice was taken across.

comment:54 follow-ups: ↓ 55 ↓ 56 Changed 9 years ago by davidhassell

Hello,

There's a lot going on, here, so I'd like to summarize what I think is being debated. As far as I can recall, these are the only points where there is lack of concensus, but please forgive and correct me if I've overlooked some previous points, or, indeed if you don't agree that with my reduction. Three items are of interest:

1. Should there be a Dimension construct?

2. Should data-free Fields which just describe a space be allowed?

3. The independence of Fields

Item 1. is revolving around the questions "is a coordinate array mandatory for every dimension of the field's space?" and "what is the nature of a dimension if it has no coordinate array?"

Item 2. is focusing on whether or not such an entity is allowable in CF, and whether or not such an entity should be possible within the confines of the data model. The second question links back to item 1. in that such a possibility is highly dependent on how the dimensions are constructed.

I propose that item 3. is a diversion, as Bob suggests. The sharing of Field components is an application dependent optimization which should not be part of the data model.

Hope this helps, all the best,

David

comment:55 in reply to: ↑ 54 ; follow-ups: ↓ 57 ↓ 68 Changed 9 years ago by davidhassell

Replying to davidhassell:

Hello,

I'd like to follow up on these points with some ideas and examples that Jonathan and I have been discussing.

Given that a coordinate array is not mandatory for a dimension of the field's space (are we all happy with this?), and presuming that "emergent property" means "inferable from other aspects of the field", then I ask:

How can a Dimension be an emergent property of the field in the acceptable case that there is a size one dimension with no coordinate array which is not spanned by the data array?

I say that the answer is that it can't. Therefore the Dimension has to be component of the field in its own right, and the dimension size is the only thing which must exist about the Dimension.

Another example is inspired by John's swath case:

How can a Dimension be an emergent property of the field in the acceptable case that the data array has two equally sized dimensions, both of which do not have coordinate arrays, but are spanned by a 2-d auxiliary coordinate?

This is more likely than the first example, as it is quite possible you might not bother to create eastings and northings 1-d coords, but still wish (indeed, you are required) to supply 2D lat and lon auxiliary coords.John comments in the swath example "Note that there is not a 1-1 correspondence of coordinate and dimension", which is true as well as being a perfectly reasonable use case.

We think that the idea of a data-free field which just describes a space is a natural one given our view on item 1.. Note that a space is not a construct of our data model. It is natural because everything you need to define it is not dependent on the field's data array. So, whilst such a thing is not explicit in CF1.5, we feel it does no harm! Also a new need for it can be foreseen in CF1.6 section 9. In the discussions (though not yet in sect 9) the possibility was raised of a kind of feature which described location only e.g. the position of a front on a meteorological chart, with no data.

On a terminology point, Mark's coordinate system is not, as I understand it, the same as the field's space. Mark's kind of coordinate system is a selection of dimensions and coordinates that one might make from the complete set that define the space, for a particular purpose like a grid mapping.

All the best,

David

comment:56 in reply to: ↑ 54 Changed 9 years ago by edavis

Hi all,

Replying to davidhassell:

3. The independence of Fields

[snip]

I propose that item 3. is a diversion, as Bob suggests. The sharing of Field components is an application dependent optimization which should not be part of the data model.

I disagree. Capturing which fields are on a shared domain seems an important role for the CF data model. While comparing the coordinates of two fields to see if they are the same (both coordinate values and other information/attributes about the coordinate) is possible, finding some methods to capture information about shared domains seems preferable. Also, that comparison isn't possible for a coordinate in index space.

I'm not sure how to capture this in the data model. But in the netCDF-CF encoding, using shared dimensions to indicate a shared domain is a very simple and, it turns out, quite powerful construct. For simple cases, shared dimensions and netCDF coordinate variables are enough to map index space into coordinate space. In more complex situations (e.g., the swath case that John mentions above) you need more explicit information, (e.g., "data:coordiantes='lat lon'", again from John's example). But even there the shared dimensions can be seen as a shared index space.

Anyway, that's my $0.02.

Cheers,

Ethan

comment:57 in reply to: ↑ 55 ; follow-up: ↓ 62 Changed 9 years ago by markh

Replying to davidhassell:

Given that a coordinate array is not mandatory for a dimension of the field's space (are we all happy with this?), and presuming that "emergent property" means "inferable from other aspects of the field", then I ask:

How can a Dimension be an emergent property of the field in the acceptable case that there is a size one dimension with no coordinate array which is not spanned by the data array?

This seems to be pushing at a question about arrays. Are array dimensions of length 1 mere artifacts or are they real properties?

Is an array

shape = (1,1,13,1,1,17,1)

the same as an array

shape = (13,17)

comment:58 in reply to: ↑ 50 Changed 9 years ago by caron

Replying to spascoe:

I think John's approach of expressing the CF-model in mathematical terms is the only way we are going to get a clear data model.

Long ago I wrote up these ideas, which might be useful to review as background, if you like that sort of thing. Its a bit more elaborate than is needed, and unfortunately the figures got lost. The summary I gave above I think covers the important points.

Anyway, here it is, im sure a real mathematician could improve it, so feel free to send me comments:

http://www.unidata.ucar.edu/staff/caron/papers/CoordMath.htm

comment:59 Changed 9 years ago by caron

These are general observations on Dimensions, which may shed some light on "dimensions without coordinates".

1) In the netcdf data model, all we have are multidimensional arrays, all dimensions are global objects, and one can only define variables using global dimensions. These syntactic elements have to be used for everything. The obvious example of the limitation of this is having to use global dimensions for char variables, since netcdf-3 doesnt have strings:

  dimensions:
    var_strlen = 56;
  variables:
    char var(var_strlen);

really could be an unnamed "anonymous" dimension:

  variables:
    char var(56);

although better is to have a variable length object, as in netcdf-4:

  variables:
    string var;

2) In particular, dimensions are sometimes used as part of the domain, and sometimes as part of the range, when thinking about variables as sampled functions. For example:

 dimensions:
  time = UNLIMITED;
  vec  = 3;
 variables:
  float wind(time, vec);

probably represents a function whose domain is time and range is a 3-vector, but one cant tell that off-hand. Sometimes you can tell by examining the coordinates, but really its a "you just have to know" kind of thing.

(However, in this example, its not a bad idea to add a nominal coordinate variable for the vec dimension, eg {"x","y","z"} or whatever. Also, we might advise to split it into 3 separate variables).

The point is, different intentions for how a dimension is used to define the multidimensionsal variable can be confusing; you may have to disambiguate them for a precise data model.

Coordinates are completely general for all types of scientific data. But what CF has done over the years is to define a very elegant set of Conventions that both define general coordinate systems, and specialize netcdf for earth science data. Besides standard names, most of the specialization comes from handling space and time coordinates. So, IMO the important use cases are defining georeferencing coordinate systems for actual scientific data.

comment:60 in reply to: ↑ 51 Changed 9 years ago by caron

Replying to markh:

Replying to caron:

...

The Coordinate System defines the domain of the Field, that is, Rn.

Not sure about this. The basis for a subset of the domain can be specified by a Coordiante System referenced by DimensionCoordiantes, but this is all optional. Many of the dimensions of the Field are not in any way related to a Coordiante System, such as ones representing time, ensemble etc.

Unless your use of the term coordinate system and mine are significantly different. Do you mean 'a collection of Coordinates'? (as in 2b)

yes.

2a. Each coordinate is a sampled function on Rm -> R, where Rm is a subset of Rn.

2b. The set of Coordinates for the field is the field's coordinate system.

I wonder, as above, if we could use a differnt, unambiguous term for a set of Coordiantes?

What is your definition of "Coordinate System"?

Also, are you defining the degrees of freedom for a particular Field here, I don't think that is defined by the Coordinates.

The set of Dimensions for the Field represent the sampling.

I wonder, could I equally say

the dimensionality and extent of the Field's data array represents the sampling

In my understanding, the number of Dimensions is the dimensionality of the Field's domain, and the number of coordinates is the dimensionality of the space that the field is embedded in.

Eg a satellite image might have 2 Dimensions (and so be a 2D image), but it might have 3 coordinates, say, x,y, and z, which describes the 3 dimensional space the image is embedded in.

Very important to distinguish those two dimensionalities!

?

3a. The Field must share the sampling with its Coordinates, so that each point in the sampling can be located in the domain.

Consistency is key, so a mechanism is required to to match the coordinates to the field, but I don't think this yet tells us which mechanism must be used.

This also raises the question:

Is a Field valid if aspects of its sampling are not described by any Coordinates?

not sure what valid means, but thats certainly one possible meaning.

3b. Its typical that there is more than one physical value at each sampled point, which we represent as multiple Fields with the same coordinate system and therefore shared dimensions.

I don't think this is consistent with the CF conventions. I think that a Field defines a set of sampling points for 1 physical value. I don't think any Fields share coordiantes or dimensions.

Two fields can be compared to see if their coordinates and dimensions are the same, but they are not shared.

I guess if they are the same, then they are shared. In a UML diagram, you would/could show these as shared.

This interplay between the Fields/Variables?, their coordinate system and their shared dimensions, is simple and elegant. One can and sometimes must do more complicated things, but you really dont want to mess with the simple and common case.

AFAIU, this is the essence of the model that Jonathan and David are proposing, but I will try to catch up on the nuances of what you all are discussing.

Apologies if my terminology is not quite right.

we'll happily work on that, the ideas are of great interest

in your example:
2D Coordinates

Consider the canonical swath dataset:

dim:

    scan = 1234; xscan = 2345;

var:

    float data(xscan,scan);

        data:coordinates = "lat lon";

    float lat(xscan, scan); float lon(xscan, scan);
are 'scan' and 'xscan' crucial pieces of information?

yes, very crucial. let me give an example if you didnt have them:

suppose some idiot flipped dimensions:

float data(lat,lon);
  data:coordinates = "lat lon";
float lat(lon,lat);
float lon(lon,lat);

now look at it without dimensions:

float data(1000,1000);
  data:coordinates = "lat lon";
float lat(1000,1000);
float lon(1000,1000);

so, the shared dimensions are the way we correctly match indices between fields and coordinates, and between fields and fields.

comment:61 follow-up: ↓ 65 Changed 9 years ago by caron

I disagree with this part of the proposed CF data model:

"Data variables stored in CF-netCDF files are often not independent, because they share coordinate variables. However, we view this solely as a means of saving disk space, and we assume that software will be able to alter any field construct in memory without affecting other field constructs. For instance, if the coordinates of one field construct are modified, it will not affect any other field construct. Explicit tests of equality will be required to establish whether two data variables have the same coordinates. Such tests are necessary in general if CF is applied to a dataset comprising more than one file, because different variables may then reside in different files, with their own coordinate variables."

As i interpret it, the current situation is that if two variables have the same list of coordinates, then they have the same coordinate system. This is not a convenience to save space, its a fundamental property of the dataset. Not sure why you'd throw that away and require software to figure it out again?

We do need special mechanisms to handle datasets that span files, but thats not a good reason to restrict the data model to independent fields. The data model is the abstraction we want to create encodings for. So what is the best abstraction?

comment:62 in reply to: ↑ 57 ; follow-up: ↓ 63 Changed 9 years ago by caron

Replying to markh:

Replying to davidhassell:

Is an array

shape = (1,1,13,1,1,17,1)

the same as an array

shape = (13,17)

One cant figure out everything from the syntax, so the answer is maybe.

If all the dimensions of length 1 have coordinate variables, and are removed and replaced by scalar coordinate variables, then those are equivalent in their semantics.

comment:63 in reply to: ↑ 62 ; follow-up: ↓ 64 Changed 9 years ago by markh

Replying to caron:

One cant figure out everything from the syntax, so the answer is maybe.

If all the dimensions of length 1 have coordinate variables, and are removed and replaced by scalar coordinate variables, then those are equivalent in their semantics.

To clarify, I do not intend this question to have any reference to coordinates of any sort. My question regards the properties of arrays of data, which our model is dependent on.

I would like to be able to make a clear statement, within the terms of reference of the CF data model, about an array, such as a Field's data array or a Coordinate's array of values.

I would like to know whether the dimensionality and shape of such an array are inherent properties, including the definition of array dimensions of size 1.

Is an array

shape = (1,1,13,1,1,17,1)

the same as an array

shape = (13,17)

if the 221 values in the two arrays are respectively identical?

The !NetCDF implementation explicitly distinguishes between these two, as do many others. Does the CF data model?

comment:64 in reply to: ↑ 63 ; follow-up: ↓ 70 Changed 9 years ago by jonathan

Dear Mark

I would like to know whether the dimensionality and shape of such an array are inherent properties, including the definition of array dimensions of size 1.

Is an array

shape = (1,1,13,1,1,17,1)

the same as an array

shape = (13,17)

if the 221 values in the two arrays are respectively identical?

The NetCDF implementation explicitly distinguishes between these two, as do many others. Does the CF data model?

The answer to this question must be No, they do not have the same shape. However, CF-netCDF has two mechanisms for recording size-one coordinates, as you know. One of them uses a size-one netCDF dimension of the data array, with a coordinate variable having that dimension; the other is a CF-netCDF scalar coordinate variable, which is dimensionless and is named by the coordinates attribute of the data variable but does not appear in the list of dimensions. CF section 5.7 says "Scalar coordinate variables have the same information content and can be used in the same contexts as a size one coordinate variable."

In our proposed CF data model, David and I interpret this as meaning that size-one coordinate variables and scalar coordinate variables are logically equivalent. We therefore say "In this data model, a CF-netCDF scalar coordinate variable is regarded as a dimension construct with a size of unity." That means there can be Dimensions of the Field which do not appear as dimensions of the data array. I think this is the same as John said in an earlier entry; the dimensionality of the Field is not necessarily the same as the dimensionality of the data array. However, it is only size-one dimensions which are optional for the data array. They can have this special status, of course, because it does not make any difference to the order of the elements of the array whether and where size-one dimensions are included in the list of dimensions.

I think this is a different question from the one about whether there can be Dimensions without coordinate values, isn't it? David and I think the answer to this other question is Yes. The CF standard explicitly provides such constructs (in CF 1.6 section 4.5), and David gave above the example of 2D latitude and longitude aux coord variables whose 1D dimensions do not have coordinate variables.

Best wishes

Jonathan

comment:65 in reply to: ↑ 61 ; follow-up: ↓ 66 Changed 9 years ago by jonathan

Dear John

Replying to caron:

I disagree with this part of the proposed CF data model:

"Data variables stored in CF-netCDF files are often not independent, because they share coordinate variables. However, we view this solely as a means of saving disk space, and we assume that software will be able to alter any field construct in memory without affecting other field constructs. For instance, if the coordinates of one field construct are modified, it will not affect any other field construct. Explicit tests of equality will be required to establish whether two data variables have the same coordinates. Such tests are necessary in general if CF is applied to a dataset comprising more than one file, because different variables may then reside in different files, with their own coordinate variables."

As i interpret it, the current situation is that if two variables have the same list of coordinates, then they have the same coordinate system. This is not a convenience to save space, its a fundamental property of the dataset. Not sure why you'd throw that away and require software to figure it out again?

We do need special mechanisms to handle datasets that span files, but thats not a good reason to restrict the data model to independent fields. The data model is the abstraction we want to create encodings for. So what is the best abstraction?

It seems that David and I disagree with you and Ethan about this, as regards the CF data model. We agree with you, of course, that in a CF-netCDF dataset if two data variables have the same list of dimensions they must have the same coordinate space. But we do not think that sharing the dimensions should give them a special status.

Perhaps we were not correct to say that this is solely a means of saving disk space. We would agree that it is a natural thing to do in a CF-netCDF file. However, what is done in a CF-netCDF file isn't necessarily the best abstraction. I would say there are several reasons why we should regard Fields as logically independent, even if they share dimensions:

The CF standard is focussed on the data variable, which is the central part of a Field construct. Everything relates to the data variable. There are few parts of the CF standard which talk about relationships between data variables. I don't think there is any statement in the CF standard which recognises a special relationship between data variables which share their coordinate spaces.

As you say above, we need mechanisms to handle the relationship between data variables in different files, when they cannot share coordinates. Although CF-netCDF is a convention for single netCDF files at the moment, I think we all recognise this isn't really adequate, and in the data model we should allow for the requirement for datasets to span many files. A good example is the CMIP5 datasets. In these there are many Fields with dimensions (time,lat,lon) spanning many files covering different ranges of times. Software that processes these datasets should regard them as being in the same (lat,lon) space even though they are not in the same netCDF files.

Moreover, even within a single CF-netCDF file you might have, for instance, two latitude coordinate variables that were actually equal e.g. dimensions lat1=180 and lat2=180 with the same 1D coordinate variable of 180 values. That's not the best way to write a file, but it could happen; one way it might happen is if the contents of the file had been assembled from two different sources of data. I don't want to regard data variables which use lat1 as sharing the same grid while data variables that use lat2 are on a different grid. I think that software which followed that model would be rather inconvenient. I would like my analysis software to test whether two Fields have the same grid when I ask it to do something with them, like compare them or add them. The analysis software can save itself time in this comparison, as a shortcut, by noticing if they use the same coordinate variables in a CF-netCDF file, but if they don't, it should compare the dimensions and coordinate variables, which will inevitably take a bit longer.

We would like the CF data model also to apply to data which is not CF-netCDF. It should apply, in particular, to data which is being served as though it were netCDF but is actually being obtained from data in some other format. If it's not coming from netCDF files, the convention of shared coordinate variables may not apply to it.

Best wishes

Jonathan

comment:66 in reply to: ↑ 65 ; follow-ups: ↓ 67 ↓ 73 Changed 9 years ago by jonathan

Replying to jonathan:

My last posting (late last night!) ended rather abruptly! To summarise: I agree that in a CF-netCDF file there is a natural association and a special status for data variables which share coordinate variables. However, I feel that in the CF data model we should take a more general approach which doesn't rely on the variables concerned being stored in a single netCDF file. This requires that the relationship, of sharing the coordinate space, has to be recognised just as readily between data variables that do not share files or coordinate variables in the data source that they come from. The most economical way of doing this, I think, is to regard Fields as independent, and to test whether their coordinates, or a subset of their coordinates, are equal, when it is relevant to do so in processing or analysing the data. When they do share coordinate variables, it is particularly easy to recognise that the coordinates are equal. I also feel that the idea of independent data variables is a reasonable interpretation of the CF standard as a whole, in which data variables are generally seen as independently self-describing.

I hope that's better ... what do you think?

Cheers

Jonathan

comment:67 in reply to: ↑ 66 Changed 9 years ago by markh

Replying to jonathan:

My last posting (late last night!) ended rather abruptly! To summarise: I agree that in a CF-netCDF file there is a natural association and a special status for data variables which share coordinate variables. However, I feel that in the CF data model we should take a more general approach which doesn't rely on the variables concerned being stored in a single netCDF file. This requires that the relationship, of sharing the coordinate space, has to be recognised just as readily between data variables that do not share files or coordinate variables in the data source that they come from. The most economical way of doing this, I think, is to regard Fields as independent, and to test whether their coordinates, or a subset of their coordinates, are equal, when it is relevant to do so in processing or analysing the data. When they do share coordinate variables, it is particularly easy to recognise that the coordinates are equal. I also feel that the idea of independent data variables is a reasonable interpretation of the CF standard as a whole, in which data variables are generally seen as independently self-describing.

I hope that's better ... what do you think?

I agree with this perspective.

comment:68 in reply to: ↑ 55 Changed 9 years ago by Oehmke

Replying to davidhassell:

We think that the idea of a data-free field which just describes a space is a natural one given our view on item 1.. Note that a space is not a construct of our data model. It is natural because everything you need to define it is not dependent on the field's data array. So, whilst such a thing is not explicit in CF1.5, we feel it does no harm! Also a new need for it can be foreseen in CF1.6 section 9. In the discussions (though not yet in sect 9) the possibility was raised of a kind of feature which described location only e.g. the position of a front on a meteorological chart, with no data.

Sorry to respond to this older item, but I just wanted to say that I think it would definitely be a good idea to allow data free Fields which serve to describe a space. Our (ESMF) regrid application allows the user to generate regridding weights between grids defined in files. If it is possible to define these grids without associated data fields in CF it would certainly be more efficient than always having to have an associated data variable. We currently support the version 1.6 GRIDSPEC and upcoming UGrid formats, so this is probably more applicable for the data model for that version of CF. However, if it's allowable within CF 1.5 it would be good to have it in that data model as well both for regridding and also for consistency's sake.

comment:69 Changed 9 years ago by jonathan

I have changed the data model proposal to say that dimensions of size one may be omitted from the data array. It previously said they must be omitted, which doesn't agree with the CF standard. Jonathan

comment:70 in reply to: ↑ 64 ; follow-up: ↓ 72 Changed 9 years ago by markh

Replying to jonathan:

The answer to this question must be No, [array(1,1,13,1,1,17,1) and array(13,17)] do not have the same shape.

In our proposed CF data model, David and I interpret this as meaning that size-one coordinate variables and scalar coordinate variables are logically equivalent. We therefore say "In this data model, a CF-netCDF scalar coordinate variable is regarded as a dimension construct with a size of unity." That means there can be Dimensions of the Field which do not appear as dimensions of the data array. I think this is the same as John said in an earlier entry; the dimensionality of the Field is not necessarily the same as the dimensionality of the data array. However, it is only size-one dimensions which are optional for the data array. They can have this special status, of course, because it does not make any difference to the order of the elements of the array whether and where size-one dimensions are included in the list of dimensions.

I think there is another interpretation, consistent with the Conventions:

The dimensionality and extent of the Field's data defines the domain and sampling of the Field in index space.

All Coords map to dimensions of the Field's data, further defining the domain and sampling of the Field. Consistency of shape (dimensionality and extent) is required for valid mappings.

A Field may define one Coordinate for each of it's dimensions which explicitly defines that dimension.

This feels to me like a less complex model than the definition of a Dimension construct.

The consistency and differences between the two types of Coord are explicitly defined as is the Role of the Field in defining a Coord as a DimensionCoordinate. I think this encapsulates information with the correct types in a consistent manner.

I have update PotentialDataModelTypes to better reflect this interpretation.

comment:71 follow-up: ↓ 78 Changed 9 years ago by markh

A further thought: I do not think that the interpretation I posted (markh) prevents an implementation from providing a Field without data. I'm not convinced this is a requirement for the model, but it has been viewed as a positive feature for some implementations in comments.

In general, an implementation may provide features whilst implementing the model consistently.

For example, a Field's data array could be implemented to allow a 'Null Array', which has dimensionality, extent and order properties but does not define values for any array location. This is distinct from an array of 'missing data'.

In this way an implementation may deliver a 'data free' Field.

Our primary implementation (!NetCDF) does not support this feature, as far as I understand but other implementations may be able to.

comment:72 in reply to: ↑ 70 Changed 9 years ago by jonathan

Dear Mark

In your model, how is a scalar coordinate variable interpreted? It does not correspond to a dimension of the data array.

Here's a CF-netCDF example:

  dimensions:
    lon=96;
    lat=72;
  variables:
    float precipitation(lat,lon);
      precipitation:standard_name="precipitation_flux";
      precipitation:units="kg m-2 s-1";
      precipitation:coordinates="time";
    float time; // scalar coordinate variable
      time:units="days since 2012-10-1";
      time:standard_name="time";
    float lat(lat);
      lat:units="degrees_north";
      lat:standard_name="latitude";
    float lon(lon);
      lon:units="degrees_east";
      lon:standard_name="longitude";

In the data model David and I proposed, this would be interpreted as follows:

The Field has
- Dimensions time, lat and lon.
- a data array dimensioned (lat,lon).
- a standard_name property and a units property.
The time Dimension has
- size 1.
- a data array of this size.
- a standard_name property and a units property.
The lat Dimension has
- size 72.
- a data array of this size.
- a standard_name property and a units property.
The lon Dimension has
- size 96.
- a data array of this size.
- a standard_name property and a units property.

Note that the Dimension construct contains the dimension coordinate array.

If the CF-netCDF file contained a dimension time=1, coordinate variable time(time) and data variable precipitation(time,lat,lon), the data array of the Field would also be dimensioned (time,lat,lon), but nothing else would be different.

Cheers

Jonathan

comment:73 in reply to: ↑ 66 ; follow-up: ↓ 75 Changed 9 years ago by caron

Replying to jonathan:

Replying to jonathan:

My last posting (late last night!) ended rather abruptly! To summarise: I agree that in a CF-netCDF file there is a natural association and a special status for data variables which share coordinate variables. However, I feel that in the CF data model we should take a more general approach which doesn't rely on the variables concerned being stored in a single netCDF file. This requires that the relationship, of sharing the coordinate space, has to be recognised just as readily between data variables that do not share files or coordinate variables in the data source that they come from. The most economical way of doing this, I think, is to regard Fields as independent, and to test whether their coordinates, or a subset of their coordinates, are equal, when it is relevant to do so in processing or analysing the data. When they do share coordinate variables, it is particularly easy to recognise that the coordinates are equal. I also feel that the idea of independent data variables is a reasonable interpretation of the CF standard as a whole, in which data variables are generally seen as independently self-describing.

I hope that's better ... what do you think?

Cheers

Jonathan

Hi Jonathan:

Maybe another way to look at it is: should the CF data model have a way to indicate that two fields share the same domain/sampling/coordinate system? If so, then one can say that, for two fields, if all of their coordinate variables match, then they have the same coordinate system. One can tell if they match either because the coordinate variables are the same, or because their values and metadata are identical.

Would that work?

John

comment:74 follow-up: ↓ 88 Changed 9 years ago by markh

Replying to jonathan:

Dear Mark

In your model, how is a scalar coordinate variable interpreted? It does not correspond to a dimension of the data array.

Hello Jonathan

Using the types I have advocated, I would specify you example as:

Field:
- standard_name="precipitation_flux"
- units="kg m-2 s-1"
- data = arr(shape(72,96))
- DimCoords =
  - dim0 -> Coordinate:
    - units="degrees_east"
    - standard_name="longitude"
    - points=arr(shape(96))
  - dim1 -> Coordinate:
    - units="degrees_north"
    - standard_name="latitude"
    - points=arr(shape(72))
- AuxCoords =
  - [none] -> Coordinate:
    - units="days since 2012-10-1"
    - standard_name="time"
    - points = arr(shape(1,))

In this case the CF !NetCDF Scalar Coordinate Variable is interpreted as an instance of a Coordinate type, meeting the constraints for that type. It is referenced by the Field via [none] dimensions, as there are no Field dimensions for it to map to.

comment:75 in reply to: ↑ 73 ; follow-ups: ↓ 76 ↓ 77 ↓ 79 Changed 9 years ago by jonathan

Dear John

Replying to caron:

Maybe another way to look at it is: should the CF data model have a way to indicate that two fields share the same domain/sampling/coordinate system? If so, then one can say that, for two fields, if all of their coordinate variables match, then they have the same coordinate system. One can tell if they match either because the coordinate variables are the same, or because their values and metadata are identical.

Would that work?

Yes! I think so. It would be a good idea for us to include in the data model a statement of rules to determine whether the spaces are the same for two data variables. By space David and I mean everything about the Field except its data and the properties of the data, so actually that's not just coordinates but also, for instance, CellMeasures. For example, I would not want to add two Fields together if they had different cell areas arrays, even if their coordinates were the same. Would you agree?

The rules for equivalence of spaces would probably be similar to, but simpler than, the rules for whether Fields can be aggregated. (We proposed some rules for that in ticket 78 and an accompanying document. However, aggregation is not part of the data model; it's something you can do with the Fields which the data model describes.

Best wishes

Jonathan

comment:76 in reply to: ↑ 75 Changed 9 years ago by edavis

Hi Jonathan,

Replying to jonathan:

Yes! I think so. It would be a good idea for us to include in the data model a statement of rules to determine whether the spaces are the same for two data variables. By space David and I mean everything about the Field except its data and the properties of the data, so actually that's not just coordinates but also, for instance, CellMeasures. For example, I would not want to add two Fields together if they had different cell areas arrays, even if their coordinates were the same. Would you agree?

Are you thinking about the rules to determine if two fields share the same space as an implementation detail? For instance,

"A client must check <some conditions> to determine if two fields exist in the same coordinate reference system with the same extent, sampling, etc. (and thus can be directly compared)."

Or as something that is explicitly captured in the abstract data model and can be mapped to by encodings of the CF data model? For instance, the data model might say something like:

"Two fields that reference the same space exist in the same coordinate reference system with the same extent, sampling, etc. (and thus can be directly compared)."

and, building on that, the CF netCDF encoding might give a few mappings:

"Data variables that share the same (auxiliary) coordinate variables can be mapped to fields that reference the same space."

"Multiple data variables that have identical coordinate variables can be mapped to fields that reference the same space."

I guess one problem with capturing all this in the data model is that the notion of a space would have to get more complicated to capture the case where multiple variables are on the same X and Y coordinates but have different Z coordinates (e.g., temperature on pressure levels vs on sigma levels).

Cheers,

Ethan

comment:77 in reply to: ↑ 75 Changed 9 years ago by Oehmke

Replying to jonathan:

Dear John

Replying to caron:

Maybe another way to look at it is: should the CF data model have a way to indicate that two fields share the same domain/sampling/coordinate system? If so, then one can say that, for two fields, if all of their coordinate variables match, then they have the same coordinate system. One can tell if they match either because the coordinate variables are the same, or because their values and metadata are identical.

Would that work?

Yes! I think so. It would be a good idea for us to include in the data model a statement of rules to determine whether the spaces are the same for two data variables. By space David and I mean everything about the Field except its data and the properties of the data, so actually that's not just coordinates but also, for instance, CellMeasures. For example, I would not want to add two Fields together if they had different cell areas arrays, even if their coordinates were the same. Would you agree?

The rules for equivalence of spaces would probably be similar to, but simpler than, the rules for whether Fields can be aggregated. (We proposed some rules for that in ticket 78 and an accompanying document. However, aggregation is not part of the data model; it's something you can do with the Fields which the data model describes.

This would be very useful for us (ESMF) as well. It would help us to determine when a regrid matrix computed for one Field could be applied to another Field saving the cost of the computing the matrix again. When reading in and creating Fields within ESMF it would also help us to determine when two Fields could be created pointing to the same grid saving the cost of creating another grid unnecessarily.

I also agree about having the CellMeasures match, variables like cell area would affect certain kinds of regridding (like conservative), and so would determine if a matrix could be shared.

comment:78 in reply to: ↑ 71 Changed 9 years ago by Oehmke

Replying to markh:

A further thought: I do not think that the interpretation I posted (markh) prevents an implementation from providing a Field without data. I'm not convinced this is a requirement for the model, but it has been viewed as a positive feature for some implementations in comments.

In general, an implementation may provide features whilst implementing the model consistently.

For example, a Field's data array could be implemented to allow a 'Null Array', which has dimensionality, extent and order properties but does not define values for any array location. This is distinct from an array of 'missing data'.

In this way an implementation may deliver a 'data free' Field.

Our primary implementation (!NetCDF) does not support this feature, as far as I understand but other implementations may be able to.

To express a space without a data field you could use just a scalar variable and associate coordinates with it (similar to how the UGrid work defines a mesh). Like:

double space;
  space:coordinates = "LAT LON" ;

double lat(num_lat);
double lon(num_lon);

However, maybe you could consider this just an implementation detail in NETCDF and in this case use the dimensions of the coord arrays as the sizes of the Fields in the data models???

comment:79 in reply to: ↑ 75 ; follow-up: ↓ 80 Changed 9 years ago by caron

Replying to jonathan:

Dear John

Replying to caron:

Maybe another way to look at it is: should the CF data model have a way to indicate that two fields share the same domain/sampling/coordinate system? If so, then one can say that, for two fields, if all of their coordinate variables match, then they have the same coordinate system. One can tell if they match either because the coordinate variables are the same, or because their values and metadata are identical.

Would that work?

Yes! I think so. It would be a good idea for us to include in the data model a statement of rules to determine whether the spaces are the same for two data variables. By space David and I mean everything about the Field except its data and the properties of the data, so actually that's not just coordinates but also, for instance, CellMeasures. For example, I would not want to add two Fields together if they had different cell areas arrays, even if their coordinates were the same. Would you agree?

yes, i agree

The rules for equivalence of spaces would probably be similar to, but simpler than, the rules for whether Fields can be aggregated. (We proposed some rules for that in ticket 78 and an accompanying document. However, aggregation is not part of the data model; it's something you can do with the Fields which the data model describes.

agree!

comment:80 in reply to: ↑ 79 ; follow-up: ↓ 81 Changed 9 years ago by davidhassell

Hello,

Replying to caron:

Maybe another way to look at it is: should the CF data model have a way to indicate that two fields share the same domain/sampling/coordinate system? If so, then one can say that, for two fields, if all of their coordinate variables match, then they have the same coordinate system. One can tell if they match either because the coordinate variables are the same, or because their values and metadata are identical.

Would that work?

Yes! I think so. It would be a good idea for us to include in the data model a statement of rules to determine whether the spaces are the same for two data variables. By space David and I mean everything about the Field except its data and the properties of the data, so actually that's not just coordinates but also, for instance, CellMeasures. For example, I would not want to add two Fields together if they had different cell areas arrays, even if their coordinates were the same. Would you agree?

yes, i agree

I am a little wary of including just the rules for equality of space in the data model, because it is just one of a number of different types of commonly used consistency that we might be interested in between two spaces. All or None, I say. Four I can think of:

Equality (Are two spaces equal in every aspect?)
Aggregatability (Can two spaces be joined to form a larger space)
Combinability (Do the spaces of two fields allow them to be combined (e.g. added)
Assignability (Does the space of one field allow its data to be assigned to a subspace of a larger field)

The last two need to support the notion of broadcasting, each in different ways. For example, we might want to subtract a 3-d field with a size 1 height from each level of a multilevel 3-d field, or assign a 2-d sea surface temperature (SST) field spanning the Nino-3 region of the Pacific to each global SST field in a timeseries.

All the best,

David

comment:81 in reply to: ↑ 80 ; follow-up: ↓ 82 Changed 8 years ago by markh

Replying to davidhassell:

I am a little wary of including just the rules for equality of space in the data model, because it is just one of a number of different types of commonly used consistency that we might be interested in between two spaces. All or None, I say.

I agree with your sentiment here David. I wonder if the examples you and others are highlighting are useful cases of how the model will be used and hence what it needs to support informationally, but not part of the model.

I support the position that:

The CF data model defines Fields as independent entities.

If we take this position then all comparisons, consistency, operations, etc. are evaluated by inspection of the metadata definitions of the respective Field instances and interpreting similarities and differences appropriately.

Could we come up with a generalised statement to that effect and leave the interpretation of how to implement this for particular use cases out of scope for the codification of the data model?

comment:82 in reply to: ↑ 81 ; follow-up: ↓ 83 Changed 8 years ago by caron

Replying to markh:

Replying to davidhassell:

I am a little wary of including just the rules for equality of space in the data model, because it is just one of a number of different types of commonly used consistency that we might be interested in between two spaces. All or None, I say.

I agree with your sentiment here David. I wonder if the examples you and others are highlighting are useful cases of how the model will be used and hence what it needs to support informationally, but not part of the model.

I support the position that:

The CF data model defines Fields as independent entities.

If we take this position then all comparisons, consistency, operations, etc. are evaluated by inspection of the metadata definitions of the respective Field instances and interpreting similarities and differences appropriately.

Could we come up with a generalised statement to that effect and leave the interpretation of how to implement this for particular use cases out of scope for the codification of the data model?

Hi Mark:

Could you say what you mean by "The CF data model defines Fields as independent entities", and what motivates you to want to say that?

thanks, John

comment:83 in reply to: ↑ 82 ; follow-ups: ↓ 85 ↓ 86 Changed 8 years ago by markh

Replying to caron:

Hi Mark:

Could you say what you mean by "The CF data model defines Fields as independent entities", and what motivates you to want to say that?

thanks, John

Hello John

I think a useful abstraction of CF 1.5 data variables in !NetCDF files is to unpack each of the data variables, as Fields, with all of the referenced metadata provided to each one of the Fields. This involves conceptually copying instances of coordinate variables etc. to each of the Fields individually.

This is clearly possible from a file and I think it is the most general approach.

I would consider sharing of coordinate variables, or any other metadata instances, in !NetCDF files as a specialisation of a general concept, that if two instances of a type are equal by inspection, they are conceptually equivalent.

Identification of equivalent metadata instances may always be achieved by inspection and evaluation of equality.

I can look at any two coordinates, inspect their definitions and values, and identify that they are the same. Once I have done this I can choose to implement a representation which uses a singleton for these two coordinates.

I can treat this as a mechanism only, with no conceptual sharing, such that any change made by one 'owner' first makes a copy, then changes its own copy.

Alternatively, I can choose to state that the metadata instance is truly shared, one instance between many owners. However, by choosing to do this, I have added another conceptual constraint: if any metadata is changed, it is changed for all of the Fields which share it, their coherence is preserved. Put another way, operations must always operate on the whole coherent set, not individual members.

I do not think this constraint should always be applied, as in many cases the CF !NetCDF file creator has utilised this mechanism as a storage optimisation, with no implied connectivity between the metadata.

I would prefer to model the concept of 'shared metadata' separately from the implementation detail of storage optimisation. I think it is too problematic to impose this concept on a large range of datasets which may well not have intended this interpretation.

I think that we may well look to model this concept of explicit sharing, as part of the expansion into discrete sampling geometries and the extra data types introduced by CF 1.6 (currently out of scope).

For now, I think we should leave this concept, and just model what is explicit in CF 1.5: this leads me to state that sharing of metadata instances across Fields in CF 1.5 is an implementation detail with no implied concept of shared ownership. Co-location by evaluation is equivalent to co-location by implemented sharing.

all the best

mark

comment:84 Changed 8 years ago by markh

Coordinate and Field Relations

I would like to provide a couple of links to a UML concept, which I think is relevant to the discussion on how Fields relate to their Coordinates: Qualified Association.

Qualified Associations

I think this is a reasonable conceptual match to the approach taken by CF of providing a semantic free qualified relationship between each dimension of a Field and i) a DimensionCoordinate (or None); ii) any AuxiliaryCoordinates.

In a NetCDF file, this is achieved using the NetCDF Dimension as a reference. CF applies no semantics to the Dimension names, rather it uses Dimensions to define a Data Variable's dimensionality, ordering and extent and to associate a Data Variable with its Coordinate Variables and Auxiliary Coordinates. It also provides the required singleton (or none) relationship between a Data Variable and each Data Variable Dimension's Coordinate Variable.

comment:85 in reply to: ↑ 83 Changed 8 years ago by davidhassell

Replying to markh:

Hello Mark,

I think that you have described the shared metadata issues well - thanks. I agree with your analysis.

All the best,

David

comment:86 in reply to: ↑ 83 Changed 8 years ago by caron

Replying to markh:

Replying to caron:

Hi Mark:

Could you say what you mean by "The CF data model defines Fields as independent entities", and what motivates you to want to say that?

thanks, John

Hello John

I think a useful abstraction of CF 1.5 data variables in !NetCDF files is to unpack each of the data variables, as Fields, with all of the referenced metadata provided to each one of the Fields. This involves conceptually copying instances of coordinate variables etc. to each of the Fields individually.

This is clearly possible from a file and I think it is the most general approach.

I would consider sharing of coordinate variables, or any other metadata instances, in !NetCDF files as a specialisation of a general concept, that if two instances of a type are equal by inspection, they are conceptually equivalent.

Identification of equivalent metadata instances may always be achieved by inspection and evaluation of equality.

I can look at any two coordinates, inspect their definitions and values, and identify that they are the same. Once I have done this I can choose to implement a representation which uses a singleton for these two coordinates.

I can treat this as a mechanism only, with no conceptual sharing, such that any change made by one 'owner' first makes a copy, then changes its own copy.

Alternatively, I can choose to state that the metadata instance is truly shared, one instance between many owners. However, by choosing to do this, I have added another conceptual constraint: if any metadata is changed, it is changed for all of the Fields which share it, their coherence is preserved. Put another way, operations must always operate on the whole coherent set, not individual members.

I do not think this constraint should always be applied, as in many cases the CF !NetCDF file creator has utilised this mechanism as a storage optimisation, with no implied connectivity between the metadata.

I would prefer to model the concept of 'shared metadata' separately from the implementation detail of storage optimisation. I think it is too problematic to impose this concept on a large range of datasets which may well not have intended this interpretation.

I think that we may well look to model this concept of explicit sharing, as part of the expansion into discrete sampling geometries and the extra data types introduced by CF 1.6 (currently out of scope).

For now, I think we should leave this concept, and just model what is explicit in CF 1.5: this leads me to state that sharing of metadata instances across Fields in CF 1.5 is an implementation detail with no implied concept of shared ownership. Co-location by evaluation is equivalent to co-location by implemented sharing.

all the best

mark

Hi Mark:

From my POV, you are mixing the data model a little bit with what software has to do to implement it.

I would say that its useful to capture shared coordinate systems in the CF abstract data model. CF also describes how to encode that. I think the notion that the coordinates are the same when they are identical or when their values and metadata are identical is a good one.

What software has to do to keep track of that while processing multiple variables and/or multiple files shouldnt really drive the data model, if possible. But of course in practice, developers will have to pay attention to those kinds of things.

Regards, John

comment:87 follow-up: ↓ 89 Changed 8 years ago by jonathan

Dear John, Mark, David, Bob, Ethan

I too agree with Mark's arguments for the independence of Fields. I think that logically the CF data model regards each Field as having its own set of coordinates, although in a CF-netCDF file they may actually be shared. John suggested that the data model should have a way to indicate that two fields share the same space. I interpreted that as meaning "a way to decide whether two fields share the same space" and agreed with it, but David and Mark somewhat disagreed.

If this notion of equality of spaces is not obvious in the CF-netCDF standard, it could be argued that we shouldn't include it in the data model. It is true that the CF standard doesn't say anything about it explicitly. On the other hand, the CF standard makes use of the Unidata convention that defines a coordinate variable, and that convention can naturally be interpreted as implying that data variables which share coordinate variables have a special relationship. I don't agree with that interpretation, because I think data variables might have just the same special relationship if they had distinct (but equal) coordinate variables in the same file, or if they were in different files but had equal coordinate variables.

Therefore I would argue that if the CF data model asserts that each Field is independent, and sharing coordinate variables doesn't indicate a special relationship, it would be helpful to supplement this by saying how you can instead decide whether two Fields share the same space. I agree with David that there are other kinds of condition, necessary for aggregation etc., but I feel that they're not quite so basic as this. In its abstract, the CF standard says that it "enables users of data from different sources to decide which quantities are comparable". Deciding whether they have the same space is one aspect of that, isn't it. I don't think it's too much of a stretch to include it in the data model. It could be a case where the data model says something more clearly than the CF-netCDF standard so far does. In contributions that various people have made, we already seem to have quite a lot of agreement about what equality of spaces should entail.

Cheers

Jonathan

comment:88 in reply to: ↑ 74 Changed 8 years ago by jonathan

Replying to markh:

Using the types I have advocated, I would specify you example as:

Field:
- standard_name="precipitation_flux"
- units="kg m-2 s-1"
- data = arr(shape(72,96))
- DimCoords =
  - dim0 -> Coordinate:
    - units="degrees_east"
    - standard_name="longitude"
    - points=arr(shape(96))
  - dim1 -> Coordinate:
    - units="degrees_north"
    - standard_name="latitude"
    - points=arr(shape(72))
- AuxCoords =
  - [none] -> Coordinate:
    - units="days since 2012-10-1"
    - standard_name="time"
    - points = arr(shape(1,))

In this case the CF !NetCDF Scalar Coordinate Variable is interpreted as an instance of a Coordinate type, meeting the constraints for that type. It is referenced by the Field via [none] dimensions, as there are no Field dimensions for it to map to.

This differs from the way David and I describe it in various ways:

The numerical value of the dimensions is repeated. I feel that it is neater to store it only once. Is that what you would do, in fact?
The scalar coordinate appears to be of a different type from the multivalued coordinates, whereas the CF-netCDF convention implies (I think) that a scalar coordnate is logically equivalent to a size-one coordinate (and thus of the same type as a multivalued coordinate).
In our model, because the dimension sizes and coordinate values belong together in the Dimension construct, an auxiliary coordinate variable with that Dimension shares the coordinate values as well as the size of a dimension. In your model, I suppose it wouldn't. For instance, lat(y,x) is really a function of the y and x coordinates, not just of the sizes of their dimensions. It could conceivably even be a data variable in its own right, with those dimensions and coordinates.

Best wishes

Jonathan

comment:89 in reply to: ↑ 87 ; follow-up: ↓ 90 Changed 8 years ago by stevehankin

Replying to jonathan:

Dear John, Mark, David, Bob, Ethan

I too agree with Mark's arguments for the independence of Fields. I think that logically the CF data model regards each Field as having its own set of coordinates, although in a CF-netCDF file they may actually be shared. John suggested that the data model should have a way to indicate that two fields share the same space. I interpreted that as meaning "a way to decide whether two fields share the same space" and agreed with it, but David and Mark somewhat disagreed.

If this notion of equality of spaces is not obvious in the CF-netCDF standard, it could be argued that we shouldn't include it in the data model. It is true that the CF standard doesn't say anything about it explicitly. On the other hand, the CF standard makes use of the Unidata convention that defines a coordinate variable, and that convention can naturally be interpreted as implying that data variables which share coordinate variables have a special relationship. I don't agree with that interpretation, because I think data variables might have just the same special relationship if they had distinct (but equal) coordinate variables in the same file, or if they were in different files but had equal coordinate variables.

Therefore I would argue that if the CF data model asserts that each Field is independent, and sharing coordinate variables doesn't indicate a special relationship, it would be helpful to supplement this by saying how you can instead decide whether two Fields share the same space.

The ambiguity in the title of this ticket, "CF data model and reference implementation", may be the root of this debate. All agree that variables have the special (shared) coordinate relationship if

they share named coordinate variable(s); or
they have distinct, but identical coordinate variables in the same dataset

No.1 is a degenerate form of no.2, so from a data modeling perspective I hear Jonathan arguing that no. 2 alone is sufficient. On the other hand from an implementation pov no. 1 is overwhelmingly the more common, and it can be implemented far more efficiently than no. 2, as it bypasses the need to inspect coordinate values. The conflict is Model vs Implementation. Which leads to a statement like "helpful to supplement this by saying how you can instead decide whether two Fields share the same space" -- suggesting that implementation guidance be given while documenting the model.

I'm wondering whether the more interesting (and harder) case is discussed elsewhere in this lengthy ticket: Namely closure principles in CF (have they been stated?) give us that when we perform allowable subsetting operations on a CF dataset, they yield a new (virtual) CF dataset. Two subset fields have the shared coordinate relationship when their subsetted coordinates meet criteria no. 2. From an implementation pov this is an important new case, because it is not about files at all. This relationship between Fields needs to be evaluated after subset fields are defined. (See text of Jonathan's para. with multiple references to 'files'.)

[the preceding not really a discussion about aggregation rules, it's true]

I agree with David that there are other kinds of condition, necessary for aggregation etc., but I feel that they're not quite so basic as this. In its abstract, the CF standard says that it "enables users of data from different sources to decide which quantities are comparable". Deciding whether they have the same space is one aspect of that, isn't it. I don't think it's too much of a stretch to include it in the data model. It could be a case where the data model says something more clearly than the CF-netCDF standard so far does. In contributions that various people have made, we already seem to have quite a lot of agreement about what equality of spaces should entail.

Cheers

Jonathan

comment:90 in reply to: ↑ 89 Changed 8 years ago by markh

Replying to stevehankin:

The ambiguity in the title of this ticket, "CF data model and reference implementation", may be the root of this debate. All agree that variables have the special (shared) coordinate relationship if

they share named coordinate variable(s); or
they have distinct, but identical coordinate variables in the same dataset
No.1 is a degenerate form of no.2, so from a data modeling perspective I hear Jonathan arguing that no. 2 alone is sufficient. On the other hand from an implementation pov no. 1 is overwhelmingly the more common, and it can be implemented far more efficiently than no. 2, as it bypasses the need to inspect coordinate values. The conflict is Model vs Implementation. Which leads to a statement like "helpful to supplement this by saying how you can instead decide whether two Fields share the same space" -- suggesting that implementation guidance be given while documenting the model.

I'm wondering whether the more interesting (and harder) case is discussed elsewhere in this lengthy ticket: Namely closure principles in CF (have they been stated?) give us that when we perform allowable subsetting operations on a CF dataset, they yield a new (virtual) CF dataset. Two subset fields have the shared coordinate relationship when their subsetted coordinates meet criteria no. 2. From an implementation pov this is an important new case, because it is not about files at all. This relationship between Fields needs to be evaluated after subset fields are defined. (See text of Jonathan's para. with multiple references to 'files'.)

Hello Steve

I agree with your analysis about the ambiguity of the title of the ticket. It was with this firmly in mind that an attempt to agree terms of reference for the data model was put onto a separate ticket (#88). This discussion yielded agreement, including:

The data model provides a logical, implementation neutral, abstraction of the concepts defined by CF.

On this basis, I think that:

they share named coordinate variable(s);

should be treated as an implementation detail, valuable, but below the required level of abstraction of the model. The model should deal with the general case and leave freedom for implementations to provide powerful features such as this one in !NetCDF files.

mark

comment:91 follow-up: ↓ 109 Changed 8 years ago by markh

Replying to jonathan:

...

This differs from the way David and I describe it in various ways:

The numerical value of the dimensions is repeated. I feel that it is neater to store it only once. Is that what you would do, in fact?

The shape(size, order) of each array of values, be they data or coordinate values, is a property of that array. Thus the only place that the dimensions' extent and ordering is stored is as a property of the Field's data. Coordinates of a Field are constrained to match sizes as part of their relationship.

A Dimension construct appears to repeat the storage of this information, in the Dimension instance and in the data arrays, introducing more complex constraints on the Field's data and its coordinate's values.

The scalar coordinate appears to be of a different type from the multivalued coordinates, whereas the CF-netCDF convention implies (I think) that a scalar coordinate is logically equivalent to a size-one coordinate (and thus of the same type as a multivalued coordinate).

The type of all 3 is the same, the are Coordinate instances.

They differ in how the Field refers to them. In the example, I have defined a 2D data array, so only 2 DimCoords are permitted. If the example dataset were a !NetCDF file the conventions enable (but do not force?) the Field to be defined as 3D, with one of the dimensions of size 1. In this case the Field may refer to the Time Coordinate as a DimCoord, e.g.:

Field:
- data = arr(shape(1,72,96))
- DimCoords =
  - dim2 -> Coordinate:
    - units="days since 2012-10-1"
    - standard_name="time"
    - points = arr(shape(1,))

I think these are two subtly different Fields, both of which are useful. It just happens that the !NetCDF encoding is not clear which one is intended, the conventions talk of a 'convenience feature'. As changing a field from 2D to 3D by adding size 1 dimensions is a well defined and manageable operation I think it is a reasonable position to decide on a default behaviour and allow downstream users to reinterpret the Field instance if they require different behaviour.

In our model, because the dimension sizes and coordinate values belong together in the Dimension construct, an auxiliary coordinate variable with that Dimension shares the coordinate values as well as the size of a dimension. In your model, I suppose it wouldn't. For instance, lat(y,x) is really a function of the y and x coordinates, not just of the sizes of their dimensions. It could conceivably even be a data variable in its own right, with those dimensions and coordinates.

I'm afraid I didn't follow this properly; please could you restate your point so that I may understand the perspective.

thank you

mark

comment:92 Changed 8 years ago by edavis

We all seem to agree that the data model should be able to capture the notion of two fields sharing the same space. I think we also agree that the data model should define the set of conditions that are sufficient to recognize that two fields share the same space.

Given that, a dataset containing two fields that are both on the same space could be represented with a diagram something like this:

 +--------+   +--------+
 | field1 |   | field2 |
 +--------+   +--------+
    |            |
    |            |    +--------+
    \------------\--> | space1 |
                      +--------+

That relationship of a shared space is the abstract concept we want to model.

How one maps that concept to an actual dataset stored in a file or files is an implementation detail for each encoding to decide. Some encodings could use shared coordinate variables or dimensions to indicate a shared space, others may require data users/providers to explicitly check all the appropriate conditions.

Where it seems there is some disagreement is whether the current CF-netCDF encoding requires data users/providers to explicitly check all the appropriate conditions. Or, if instead, it provides some encoding conveniences in the form of shared dimensions and shared coordinate variables that indicate the same relationship.

Ethan

comment:93 follow-up: ↓ 94 Changed 8 years ago by stevehankin

The netCDF data model captures, not just the fact that two fields share the same coordinate reference system (a.k.a. "space"), but also that they may be related by sharing degenerate subsets of that coordinate system. Restated in more concrete terms, it is important in CF to recognize that a surface variable has a special relationship to a 3D variable that shares the same lat-long coordinates.

Can someone clarify where we are retaining these subtleties in the current discussions?

In general, I find my Spidey-senses quivering over the degree to which the CF data model wants to detach itself from the netCDF data model. Detaching from the netCDF file format has clear value. What advantages are we after in detaching from the netCDF data model? The netCDF data model has an impressive, successful track record expressing the multi-dimensional space concepts that are embedded in CF.

comment:94 in reply to: ↑ 93 ; follow-up: ↓ 95 Changed 8 years ago by davidhassell

Replying to stevehankin:

The netCDF data model captures, not just the fact that two fields share the same coordinate reference system (a.k.a. "space"), but also that they may be related by sharing degenerate subsets of that coordinate system. Restated in more concrete terms, it is important in CF to recognize that a surface variable has a special relationship to a 3D variable that shares the same lat-long coordinates.

Can someone clarify where we are retaining these subtleties in the current discussions?

In general, I find my Spidey-senses quivering over the degree to which the CF data model wants to detach itself from the netCDF data model. Detaching from the netCDF file format has clear value. What advantages are we after in detaching from the netCDF data model? The netCDF data model has an impressive, successful track record expressing the multi-dimensional space concepts that are embedded in CF.

I rather would say that the data model should provide the framework for ascertaining whether one field has a relationship to another. In your example, the data model should make it easy to formulate the tests required to see if the 2-d field is broadcastable across the 3-d field. There are many different types of special relationship.

Please forgive me if I'm missing something obvious here, but does the netCDF data model include the notion of shared spaces? It is nearly always implied in real netCDF files, but is not required or enforced by the Common Data Model. For example, there is nothing stopping me from duplicating, with different names, dimensions and coordinates for every variable in a netCDF file. It would still be a CF compliant netCDF file.

All the best,

David

comment:95 in reply to: ↑ 94 ; follow-up: ↓ 103 Changed 8 years ago by davidhassell

Replying to davidhassell:

Replying to stevehankin:

Please forgive me if I'm missing something obvious here, but does the netCDF data model include the notion of shared spaces? It is nearly always implied in real netCDF files, but is not required or enforced by the Common Data Model. For example, there is nothing stopping me from duplicating, with different names, dimensions and coordinates for every variable in a netCDF file. It would still be a CF compliant netCDF file.

In addition to what I wrote earlier, would it fair to say that the Common Data Model (CDM) is one for the storage of data, whilst the CF data model we are discussing is one for the manipulation of data?

All the best,

David

comment:96 follow-up: ↓ 97 Changed 8 years ago by stevehankin

We're into some subtle areas of terminology and (human) semantics, but here is what I would say:

The netCDF Data Model (which is a foundational part of CDM)provides a way to *express* the relationship between spaces (coordinate reference systems) through sharing of coordinate variables. Having expressed it thusly, it also provides a way for software implementations to ascertain whether one field has a relationship to another

It would be very unusual for a CF data provider to create two coordinate variables that contained identical coordinate information in the self-same dataset. If s/he were to do so, it would likely imply a deliberate differentiating between spaces (for some reason).

My guess of what is behind the current confusions is this (am I correct?): Datasets that have been created in formats other than netCDF (say, format "foobar") may contain duplicated coordinate information in them with no explicit linkages indicating the coordinates are the same. In order to treat these files as if they are CF files, a software implementation needs to examine the coordinates in detail and determine where they may be identically duplicated. This situation should imho fall under the heading of "CF library support for format foobar". This weaker form of modeling shared coordinate spaces should not imho be regarded as a feature of the CF data model. We would be degrading the data model to do so.

Your question about storage versus manipulation: The notion of shared coordinate spaces and subspaces is fundamental to the manipulation of CF variables. Consider, for example, the common case of computing anomalies by subtracting a climatological mean field (X-Y-Z space) from individual model time steps (identical X-Y-Z plus T).

comment:97 in reply to: ↑ 96 ; follow-up: ↓ 104 Changed 8 years ago by Oehmke

Replying to stevehankin:

We're into some subtle areas of terminology and (human) semantics, but here is what I would say:

The netCDF Data Model (which is a foundational part of CDM)provides a way to *express* the relationship between spaces (coordinate reference systems) through sharing of coordinate variables. Having expressed it thusly, it also provides a way for software implementations to ascertain whether one field has a relationship to another

It would be very unusual for a CF data provider to create two coordinate variables that contained identical coordinate information in the self-same dataset. If s/he were to do so, it would likely imply a deliberate differentiating between spaces (for some reason).

My guess of what is behind the current confusions is this (am I correct?): Datasets that have been created in formats other than netCDF (say, format "foobar") may contain duplicated coordinate information in them with no explicit linkages indicating the coordinates are the same. In order to treat these files as if they are CF files, a software implementation needs to examine the coordinates in detail and determine where they may be identically duplicated. This situation should imho fall under the heading of "CF library support for format footer". This weaker form of modeling shared coordinate spaces should not imho be regarded as a feature of the CF data model. We would be degrading the data model to do so.

I'm a little confused by what this means. Are you saying by this paragraph that the CF model shouldn't consider spaces which are automatically examined and calculated to be the same to be shared? Therefore formats which don't support some way of indicating sharing (e.g. foobar) would lack this feature of the data model.

I have to say I'm a little torn about automatically calculating sharing. We do it in some cases with regrid calculations in ESMF where if two sets of Fields have identical looking grids then there's no reason to recalculate the weights, because the results would be exactly the same. However, I could see it being a problem in other cases, for example, where the user is reading in a set of Fields and the software keeps insisting that two have the same space just because they accidentally happen to have the same set of coordinates, or maybe as you suggest above the user deliberately wants to have them separate despite them being the same.

Your question about storage versus manipulation: The notion of shared coordinate spaces and subspaces is fundamental to the manipulation of CF variables. Consider, for example, the common case of computing anomalies by subtracting a climatological mean field (X-Y-Z space) from individual model time steps (identical X-Y-Z plus T).

comment:98 follow-up: ↓ 99 Changed 8 years ago by caron

Im wondering what the mysterious "!NetCDF" means?

comment:99 in reply to: ↑ 98 Changed 8 years ago by bnl

Half way down this ticket we stopped mentioning identity, and I'm not sure we resolved whether we wanted dimensions and spaces to have identity. I know we believe that grids have identity, since we invented gridspec. So what about spaces?

Is some of this discussion confusing whether we know, via identifiers or *convention*, whether two fields share the same space, or whether we have to guess, (1) via inspection (as might have to do in the case of aggregation), or (2), via *custom* (the same named dimensions in the same file).

We need to decide whether the data model or the implementation, or both, ought to distinguish these different choices. Of course we still have the choice of spaces being integral to fields (i.e. composed in), in which case we'd have to invent a new abstraction to allow us to assert equality of spaces, if only for having rules for aggregation.

comment:100 follow-up: ↓ 101 Changed 8 years ago by stevehankin

Hi Brian. I'm confused over points that are more basic:

What's our definition of a "space"? How is it distinct from "coordinate reference system" and "grid"? (see (*) below)
How did we morph from an effort to write down and clarify the CF model (which has been built upon the netCDF data model since its earliest days as COARDS) into a reconsideration of the foundation netCDF data model concepts? What problems are we hoping to solve by this?
What does "!NetCDF" mean?

August 24 discussions above indicate confusion over the term "space". http://www.met.rdg.ac.uk/~jonathan/CF_metadata/cfdm.html says a space is a coordinate system

comment:101 in reply to: ↑ 100 ; follow-up: ↓ 102 Changed 8 years ago by markh

Replying to caron:

Im wondering what the mysterious "!NetCDF" means?

Replying to stevehankin:

What does "!NetCDF" mean?

tl;dr The '!' means nothing it is a mistake I kept making. It should say NetCDF.

I'm afraid to say that I think this is my foul up. I have used this term in a number of my posts, replied to by others. I have used Trac for a long time and I am conscious of its facility to provide 'wiki' type links for camel case terms. This can be turned off by prefixing the camelcase text with a '!'

It looks like some clever soul tweaked this version of Trac not to pick up NetCDF as a link term, but I hadn't noticed this, so I blithely used !NetCDF to render NetCDF to the screen, not realising that the '!' markup was coming through intact.

As such i have confused contributors, sorry. I have never meant anything by this term other than NetCDF. If anyone else has used wilfully used it with meaning please speak here, but I never meant anything by it.

mark

comment:102 in reply to: ↑ 101 Changed 8 years ago by caron

Replying to markh:

Replying to caron:

Im wondering what the mysterious "!NetCDF" means?

Replying to stevehankin:

What does "!NetCDF" mean?

tl;dr The '!' means nothing it is a mistake I kept making. It should say NetCDF.

I'm afraid to say that I think this is my foul up. I have used this term in a number of my posts, replied to by others. I have used Trac for a long time and I am conscious of its facility to provide 'wiki' type links for camel case terms. This can be turned off by prefixing the camelcase text with a '!'

It looks like some clever soul tweaked this version of Trac not to pick up NetCDF as a link term, but I hadn't noticed this, so I blithely used !NetCDF to render NetCDF to the screen, not realising that the '!' markup was coming through intact.

mystery solved!

comment:103 in reply to: ↑ 95 Changed 8 years ago by caron

In addition to what I wrote earlier, would it fair to say that the Common Data Model (CDM) is one for the storage of data, whilst the CF data model we are discussing is one for the manipulation of data?

Hi David:

As the author of the CDM, I would say that the CDM is an abstract data model, in which the details of the storage and the specifics of the API are supposed to be minimized. Many storage file formats are supported; to date only a Java API has been developed.

The CDM is expressed in UML, a language for object oriented design, because thats the common language of computer science for this purpose, and because Java is an OO language. OTOH, I havent developed the UML very deeply to show the available methods (which would better descipe the possible manipulations), really the Java API is the source of that. So the UML is really a very high level description, but probably at a similar level of detail as your efforts.

From my POV, ironically, I see the CF model as a bit more influenced by storage concerns. Of course, in the end, the data model had better have a very clean mapping to the actual storage, if you want to actually get anything done!

John

comment:104 in reply to: ↑ 97 Changed 8 years ago by caron

Hi Robert:

Replying to Oehmke:

Replying to stevehankin:

We're into some subtle areas of terminology and (human) semantics, but here is what I would say:

The netCDF Data Model (which is a foundational part of CDM)provides a way to *express* the relationship between spaces (coordinate reference systems) through sharing of coordinate variables. Having expressed it thusly, it also provides a way for software implementations to ascertain whether one field has a relationship to another

It would be very unusual for a CF data provider to create two coordinate variables that contained identical coordinate information in the self-same dataset. If s/he were to do so, it would likely imply a deliberate differentiating between spaces (for some reason).

My guess of what is behind the current confusions is this (am I correct?): Datasets that have been created in formats other than netCDF (say, format "foobar") may contain duplicated coordinate information in them with no explicit linkages indicating the coordinates are the same. In order to treat these files as if they are CF files, a software implementation needs to examine the coordinates in detail and determine where they may be identically duplicated. This situation should imho fall under the heading of "CF library support for format footer". This weaker form of modeling shared coordinate spaces should not imho be regarded as a feature of the CF data model. We would be degrading the data model to do so.

I'm a little confused by what this means. Are you saying by this paragraph that the CF model shouldn't consider spaces which are automatically examined and calculated to be the same to be shared? Therefore formats which don't support some way of indicating sharing (e.g. foobar) would lack this feature of the data model.

My understanding of what Steve is saying is that if we throw away shared dimensions and force applications to examine coordinate values in order to establish that two variables "share the same space", we would be degrading the data model from what we have now.

OTOH, it seems we have a rough consensus to allow both methods (shared dimensions, examination of values)

I have to say I'm a little torn about automatically calculating sharing. We do it in some cases with regrid calculations in ESMF where if two sets of Fields have identical looking grids then there's no reason to recalculate the weights, because the results would be exactly the same. However, I could see it being a problem in other cases, for example, where the user is reading in a set of Fields and the software keeps insisting that two have the same space just because they accidentally happen to have the same set of coordinates, or maybe as you suggest above the user deliberately wants to have them separate despite them being the same.

I think we need to insist that if two variables share the exact same coordinate values, then they "have the same space". If there are caveats to that, let us list them.

Its possible you have another use case in mind, but Ive never seen two variables in the same dataset accidentally share coordinates.

HDF does not have shared dimensions, IMO with unfortunate results to interoperability. Read the 4-part series for more info than you want, starting here :

http://www.unidata.ucar.edu/blogs/developer/en/entry/dimensions_scales

Regards, John

comment:105 follow-up: ↓ 107 Changed 8 years ago by markh

I think we should be careful in talking of sharing a Space: many of the cases discussed here are of data variables which share some Coordinates, but not all. I prefer to talk of shared Coordinates etc. as I think it is clearer.

I would like to understand if there are views on any implied constraints on how a set of CF NetCDF Data Variables which share some Coordinate variables must be interpreted.

for example:

Can a number of CF Data variables share Coordinate variables if they exist in different files?

Do I change the meaning of my data variables if I extract them and their referenced metadata so I have one data variable per file with all the relevant metadata (coords etc.) reproduced across the files?

Is the sharing of a CF NetCDF Coordinate variable a transient phenomenon, a snapshot of my data's domain at a particular instant/state, or is it a characteristic of my data, like the unit of measure?

If I write a CF NetCDF file manipulation tool which resamples data variables onto a courser grid in a new file, does it have to resample every data variable which 'shares a Coordinate' or can it operate on individual data variables independently?

comment:106 follow-up: ↓ 108 Changed 8 years ago by markh

I'd also like to take this opportunity to try and stress that I think that shared Coordinates are a really useful feature with powerful capabilities and I do not want to see that diminished, that is not the intent at all.

The angle I am focussed on is modelling CF, independent of NetCDF. For this model to be useful it has to be implemented, and NetCDF is an excellent implementation, bringing with it many powerful features.

I would like to be able to look at the CF and NetCDF as working together to deliver data and metadata storage and exchange capabilities. The models for the two are complementary, not in conflict, or with one in domination over the other.

In parallel to this, I would like to use CF concepts independent of NetCDF, for example to assist in metadata aware data processing. This drives the search for CF abstractions from the NetCDF implementation with general interpretations, which can be implemented in different ways by different applications.

This does not change how I would choose to use NetCDF, the activities and implementations are complementary.

comment:107 in reply to: ↑ 105 Changed 8 years ago by stevehankin

Replying to markh:

I think we should be careful in talking of sharing a Space: many of the cases discussed here are of data variables which share some Coordinates, but not all. I prefer to talk of shared Coordinates etc. as I think it is clearer.

I would like to understand if there are views on any implied constraints on how a set of CF NetCDF Data Variables which share some Coordinate variables must be interpreted.

for example:

Can a number of CF Data variables share Coordinate variables if they exist in different files?

I believe that we cannot escape the need for the concept of a "dataset" in our discussions -- a "dataset" being the virtual file in which all of the data and metadata under discussion lie in the same container. Aggregation machinery creates datasets (virtual files) out of physical files (more granular datasets).

A netCDF union aggregation (http://www.unidata.ucar.edu/software/netcdf/ncml/v2.2/Cookbook.html) takes variables that are in separate files, but have identical coordinates, and turns them into a new, larger dataset (virtual file) in which they have normal netCDF shared dimensions. Gridspec datasets are another interesting (and more complex) case (https://ice.txcorp.com/trac/modave/wiki/CFProposalGridspec).

Do I change the meaning of my data variables if I extract them and their referenced metadata so I have one data variable per file with all the relevant metadata (coords etc.) reproduced across the files?

This would be the inverse of the union operation. For sure there is a risk (or maybe intentional) loss of subtle semantics when variables are spread across independent files. For example, the person doing this may know that some of the variables were satellite observations and others were model outputs on the same coordinate system and want to make sure that distinction is made clearer.

Is the sharing of a CF NetCDF Coordinate variable a transient phenomenon, a snapshot of my data's domain at a particular instant/state, or is it a characteristic of my data, like the unit of measure?

I doubt there is a fixed answer to this. When one regrids a variable, for example, that may be an act of violence (a transient action), or a high fidelity operation (a legitimate, permanent change), depending upon underlying knowledge of the mathematical characteristics of the variable and its coordinates.

If I write a CF NetCDF file manipulation tool which resamples data variables onto a courser grid in a new file, does it have to resample every data variable which 'shares a Coordinate' or can it operate on individual data variables independently?

As of v1.6 CF says little about the relationships *between* variables. The current CF discussions about vector quantities and staggered grids illustrate that even the most obvious relationships are not yet firmly characterized. I'd suppose a flexible utility program such as you describe should leave control over such decisions to the user.

comment:108 in reply to: ↑ 106 Changed 8 years ago by stevehankin

Replying to markh:

I'd also like to take this opportunity to try and stress that I think that shared Coordinates are a really useful feature with powerful capabilities and I do not want to see that diminished, that is not the intent at all.

The angle I am focussed on is modelling CF, independent of NetCDF. For this model to be useful it has to be implemented, and NetCDF is an excellent implementation, bringing with it many powerful features.

I would like to be able to look at the CF and NetCDF as working together to deliver data and metadata storage and exchange capabilities. The models for the two are complementary, not in conflict, or with one in domination over the other.

In parallel to this, I would like to use CF concepts independent of NetCDF, for example to assist in metadata aware data processing. This drives the search for CF abstractions from the NetCDF implementation with general interpretations, which can be implemented in different ways by different applications.

This does not change how I would choose to use NetCDF, the activities and implementations are complementary.

Good that you raised this point. What is "netCDF"? Is it a file format? Or is it a data model with associated API's (behaviors)? Today it is both, though for many years the file format was private/unpublished and the emphasis was firmly on the data model. How are you intending the word "netCDF" above? Questions such as you have raised seem to me ambiguous unless you qualify the word "netCDF" by saying "netCDF data model" or "netCDF file format".

The CF data model (conceptually) is imho built on the foundation of the netCDF data model. The highly prized interoperability of CF datasets comes from its underpinnings in the netCDF data model. This is why today legacy satellite data files can be made to look and feel like modern CF datasets when served through OPeNDAP via TDS and can be read by generic CF applications.

By the way -- IMHO the process of writing down the CF data model should as much as possible try to understand and erase the distinctions between "CF" and "CDM". This is the path to broader interoperability. Similarly, we should strive for a single method of expressing aggregations. If in the end we were to have different aggregation machinery in Python and Java it would be a serious blow to interoperability.

comment:109 in reply to: ↑ 91 ; follow-up: ↓ 119 Changed 8 years ago by jonathan

Dear Mark

I am not sure whether we are debating non-existent distinctions regarding the Dimension constructs, nor whether I can make it any clearer, but I'll try again in words. :-)

David and I propose:

The Field has
- Dimensions time, lat and lon.
- a data array dimensioned (lat,lon)
The lat Dimension has
- size 72
- a data array of this size

You propose:

The Field has
- a data array dimensioned (72,96)
The lat Coordinate has
- a data array dimensioned (72)

You write, "The shape(size, order) of each array of values, be they data or coordinate values, is a property of that array. Thus the only place that the dimensions' extent and ordering is stored is as a property of the Field's data. Coordinates of a Field are constrained to match sizes as part of their relationship." I understand this to mean that in your proposal there is also a mechanism which requires the two occurrences of 72 to be equal. To make that happen, you must know there is some logical association between them; in a netCDF file, the relationship is that they refer to the same netCDF dimension of course.

This is what our arrangement does too. Our Dimension construct contains the netCDF dimension and the 1D coordinate variable of the same name. You comment, "A Dimension construct appears to repeat the storage of this information, in the Dimension instance and in the data arrays." That is not our intention. The Dimension construct contains the size of the dimension. The Field data array is declared using these Dimensions, in just the same way that a netCDF data variable is declared using the dimensions. There is no explicit mechanism to link the two 72s, because (logically) there is only one 72, which is stored in the Dimension construct. The difference between our points of view this far is small. It is just that the Dimension, in our model, has a distinct identity.

I think there may be a somewhat greater difference when we come to the further point which I made, not clearly enough. I wrote, "In our model, because the dimension sizes and coordinate values belong together in the Dimension construct, an auxiliary coordinate variable with that Dimension shares the coordinate values as well as the size of a dimension." For instance, if we have Dimensions of x and y (projection coordinates), we might have a longitude Auxiliary coordinate construct whose data array is dimensioned (y,x). In our model, the Dimensions y and x provide not only the sizes of the dimensions, but also 1D x- and y-coordinate variables. Thus the 2D longitude variable is a function of these coordinate variables. That seems fine to me because it genuinely reflects the situation. The association is there in your model as well, but slightly less direct, I suppose. Because the 72 and 96 of the 2D auxiliary coordinate variable are constrained somehow to correspond to the 72 and 96 of the field, and to the 72 and 96 of the 1D coordinate variables, the 1D coordinate variables are indirectly associated with the dimensions of the 2D coordinate variable.

On the scalar or size-1 coordinate variables, again we almost agree, but perhaps not quite. We take the view that they are Dimensions of the Field with size 1, but not necessarily used to dimension the data variable. In the example I gave at the top, time is a Dimension of the Field and has size 1, and is omitted from the dimensions of the Field data array. As you say, the CF convention says that size-1 and scalar coordinate variables are equivalent. We agree that inserting and removing them from the Field data array is easy. In our model, therefore, size-1 Dimensions are optionally used to the dimension the Field data array. We think the user may want the choice. When you slice one level from a 3D Field, for instance, you may wish the new Field to be 3D, with the vertical Dimension having size 1. The API could provide a separate method to eliminate size-1 dimensions in case the user does not want them, or it could be a choice to be made in the slice method. The bottom line is that in our data model the Field has Dimensions, but its data array can omit Dimensions of size 1. That's another reason why the Dimensions have a distinct identity in the Field in our model.

Best wishes

Jonathan

comment:110 follow-up: ↓ 116 Changed 8 years ago by jonathan

Dear Steve, Mark, et al.

Comments on a couple of points recently raised.

Departure from the netCDF data model

Regarding tests for equality of spaces, Steve asks where we want to depart from the netCDF data model. Subsequently Steve comments, It would be very unusual for a CF data provider to create two coordinate variables that contained identical coordinate information in the self-same dataset. If s/he were to do so, it would likely imply a deliberate differentiating between spaces (for some reason).

I agree that this is unlikely in the same file, but it could happen, and it might be accidental. It might happen if one file had been made by assembling variables from separate files, without checking whether the coordinate variables would be duplicated.

But a more important case is the one of aggregation of multiple files into one virtual dataset. I think this is the reason why we must go beyond the netCDF data model and the CF convention as it currently stands, since they refer only to single files in principle. Aggregation may involve concatenating many data variables into one, but before you get to that stage, you have to recognise that the variables may have some identical coordinates. I would put it as a requirement that two variables in the same file, with the same lat and lon (shared) coordinates (but different time coordinates, say) are in exactly the same logical relationship as two variables in different files, with equal (but necessarily distinct) lat and lon coordinate variables. It is more work to verify the second case, but that's a matter of implementation, as has already been said.

A third case, as Steve has also mentioned, is if the data is not netCDF anyway, and has no way of sharing coordinate variables. In that case you have to check the hard way. But again, this is an implementation issue which should not prevent the recognition of the same logical relationship.

What is a space?

Steve asks, What's our definition of a "space"? How is it distinct from "coordinate reference system" and "grid"? In our proposed data model, David and I use "space" to mean "Field without data", and without metadata describing the contents of the data array such as standard_name. The space has the dimension coordinates, auxiliary coordinates, cell methods and cell measures. When we described it as being a "coordinate system", we meant only, "a set of coordinates in a particular relationship." We did not mean "coordinate reference system" in the more technical and restrictive senses of OGC and so on.

Mark comments, I think we should be careful in talking of sharing a Space: many of the cases discussed here are of data variables which share some Coordinates, but not all. I prefer to talk of shared Coordinates etc. as I think it is clearer." These are distinct, and we should be careful, I agree. One issue is comparison of coordinates: are they identical (shared) or equal. The equality of two spaces means also that they have the same set of coordinates, cell measures, etc.

Cheers

Jonathan

comment:111 follow-ups: ↓ 117 ↓ 118 Changed 8 years ago by jonblower

Hi all,

Just a brief comment (sorry, I can't pretend I've read and understood all the above in detail!) I wonder if the general ISO/OGC terminology might be helpful here, particularly if this discussion somehow transcends NetCDF and intends to cover other file formats that may be encoding forms of the same conceptual data model.

(For those of you who are dying to read an ISO spec, the most relevant one here is ISO19123, which deals with Coverages. This spec is known to be incomplete and a little ambiguous with respect to this community's needs - however it's not a bad starting point and codifies a number of relevant concepts. This reminds me that I really should write up my work on applying and modifying the Coverage model to CF-NetCDF data...)

In particular, the above term "space" seems to be equivalent to the concept of a "domain" in the ISO world. A domain is simply the set of positions for which we may have values (the values are encoded in the "range", which is the equivalent of a NetCDF data array). For gridded data, the domain will be a ReferenceableGrid (which is a grid that may be curvilinear or rectilinear but is referenced to some real-world coordinate reference system). Grids are made up of cells, but the ISO model lacks an equivalent of CF's cell_methods, unless I've missed something.

There's a whole heap of concepts that could be relevant, but I would start by suggesting the use of the term "domain".

comment:112 Changed 8 years ago by jonathan

Dear Jon

"Coverage" sounds like a 2D concept to me, and "coordinate reference system" is usually understood to be 2D. Hence I think "space" is better, because it is at least 3D, and often used to mean n-dimensional, which is what we mean.

Cheers

Jonathan

comment:113 follow-up: ↓ 114 Changed 8 years ago by jonblower

Dear Jonathan,

It's true that the term "coverage" has history in 2D satellite imagery - but the modern definition is simply a function that maps positions in a domain to data values. The domain can be an nD grid or any other type of geometry, and can be applied to in situ observations or swaths as much as to gridded data.

Coordinate reference systems can also be nD. There is no 2D restriction.

(Actually "space" is equivalent to "domain", not "coverage". Personally I prefer the term "domain" to "space", as it feels more precise to me.)

Best wishes, Jon

comment:114 in reply to: ↑ 113 ; follow-up: ↓ 115 Changed 8 years ago by markh

Replying to jonblower:

Dear Jonathan,

It's true that the term "coverage" has history in 2D satellite imagery - but the modern definition is simply a function that maps positions in a domain to data values. The domain can be an nD grid or any other type of geometry, and can be applied to in situ observations or swaths as much as to gridded data.

Coordinate reference systems can also be nD. There is no 2D restriction.

(Actually "space" is equivalent to "domain", not "coverage". Personally I prefer the term "domain" to "space", as it feels more precise to me.)

Best wishes, Jon

I am with Jon on this

I agree with the general point about harmonising terminology and the specific point about the term domain.

I would be very pleased if we would state that a Field defines a domain

I think this terminology is consistent with that used by caron in earlier parts of the discussion.

Perhaps:

Fields are finite samplings of a continuous scalar function.

A Field defines a domain and defines a phenomenon which is sampled according to the defined domain.

comment:115 in reply to: ↑ 114 Changed 8 years ago by caron

Replying to markh:

Replying to jonblower:

Dear Jonathan,

It's true that the term "coverage" has history in 2D satellite imagery - but the modern definition is simply a function that maps positions in a domain to data values. The domain can be an nD grid or any other type of geometry, and can be applied to in situ observations or swaths as much as to gridded data.

Coordinate reference systems can also be nD. There is no 2D restriction.

(Actually "space" is equivalent to "domain", not "coverage". Personally I prefer the term "domain" to "space", as it feels more precise to me.)

Best wishes, Jon

I am with Jon on this

I agree with the general point about harmonising terminology and the specific point about the term domain.

I would be very pleased if we would state that a Field defines a domain

I think this terminology is consistent with that used by caron in earlier parts of the discussion.

Perhaps:

Fields are finite samplings of a continuous scalar function.

A Field defines a domain and defines a phenomenon which is sampled according to the defined domain.

Yes, I agree with Jon and Mark here. The argument is that we should try to use the terminology thats already in existence in similar work, as it will lead to broader acceptance of ours. So, if "space" is really a synonym for "domain", then "domain" is probably a better choice. If its not, then lets clarify that.

Regards, John

comment:116 in reply to: ↑ 110 Changed 8 years ago by stevehankin

Replying to jonathan:

Dear Steve, Mark, et al.

Comments on a couple of points recently raised.

Departure from the netCDF data model

Regarding tests for equality of spaces, Steve asks where we want to depart from the netCDF data model. Subsequently Steve comments, It would be very unusual for a CF data provider to create two coordinate variables that contained identical coordinate information in the self-same dataset. If s/he were to do so, it would likely imply a deliberate differentiating between spaces (for some reason).

I agree that this is unlikely in the same file, but it could happen, and it might be accidental. It might happen if one file had been made by assembling variables from separate files, without checking whether the coordinate variables would be duplicated.

Support for the CF code may become very difficult with this outlook. The preceding paragraph suggests that the CF libraries have requirements to make sense of malformed datasets. This seems a slippery slope. How do you bound a requirement like this?

But a more important case is the one of aggregation of multiple files into one virtual dataset. I think this is the reason why we must go beyond the netCDF data model and the CF convention as it currently stands, since they refer only to single files in principle. Aggregation may involve concatenating many data variables into one, but before you get to that stage, you have to recognise that the variables may have some identical coordinates. I would put it as a requirement that two variables in the same file, with the same lat and lon (shared) coordinates (but different time coordinates, say) are in exactly the same logical relationship as two variables in different files, with equal (but necessarily distinct) lat and lon coordinate variables. It is more work to verify the second case, but that's a matter of implementation, as has already been said.

A third case, as Steve has also mentioned, is if the data is not netCDF anyway, and has no way of sharing coordinate variables. In that case you have to check the hard way. But again, this is an implementation issue which should not prevent the recognition of the same logical relationship.

I wonder if there may be much less disagreement than it seems here. We are lumping many distinct behaviors under the single heading of "CF and Python": support for non-netCDF datasets, for malformed datasets, and for aggregations. We might find the disagreement disappearing if we shared a mental picture of the architecture of the code. Thinking of the code as extensible layers of functionality, here is the code architecture that I am imagining:

        >>> non-NetCDF file formats <<<
                  |
                  |   #1.  Code package that
                  |   supports various non-netCDF formats
                  |   and malformed CF data.  This code makes
                  |   them behave like well-formed CF-netCDF
                 \/
        >>> well-formed CF netCDF files <<<
                  |
                  |   #2. Code package that
                  |   aggregates.  Well-formed CF-netCDF datasets
                  |   become larger virtual datasets
                  |
                 \/
        >>> CF virtual files <<<
                  |
                  |  #3.  Code package that
                  |  makes data in the CF data model
                  |  easily accessible to clients

What is a space?

Steve asks, What's our definition of a "space"? How is it distinct from "coordinate reference system" and "grid"? In our proposed data model, David and I use "space" to mean "Field without data", and without metadata describing the contents of the data array such as standard_name. The space has the dimension coordinates, auxiliary coordinates, cell methods and cell measures. When we described it as being a "coordinate system", we meant only, "a set of coordinates in a particular relationship." We did not mean "coordinate reference system" in the more technical and restrictive senses of OGC and so on.

Mark comments, I think we should be careful in talking of sharing a Space: many of the cases discussed here are of data variables which share some Coordinates, but not all. I prefer to talk of shared Coordinates etc. as I think it is clearer." These are distinct, and we should be careful, I agree. One issue is comparison of coordinates: are they identical (shared) or equal. The equality of two spaces means also that they have the same set of coordinates, cell measures, etc.

Cheers

Jonathan

comment:117 in reply to: ↑ 111 Changed 8 years ago by edavis

Hi all,

Replying to jonblower:

... I wonder if the general ISO/OGC terminology might be helpful here, particularly if this discussion somehow transcends NetCDF and intends to cover other file formats that may be encoding forms of the same conceptual data model.

Just wanted to mention that the OGC CF-netCDF Standards Working Group is working on a document that describes a mapping of CF-netCDF to the ISO/OGC coverage data model. I can't remember if that has been discussed on this list yet.

Ben, Stefano: Is that document publicly available?

Ethan

comment:118 in reply to: ↑ 111 Changed 8 years ago by edavis

Hi again,

Replying to jonblower:

In particular, the above term "space" seems to be equivalent to the concept of a "domain" in the ISO world. A domain is simply the set of positions for which we may have values (the values are encoded in the "range", which is the equivalent of a NetCDF data array).

The "range" in the ISO/OGC coverage model that Jon mentions can hold multiple values for each position in the "domain". So, multiple CF-netCDF data variables that share the same "domain" can be modeled as a single coverage.

On the other hand, I don't believe the coverage model could be used to capture the notion that two variables share a subset of their domain. An idea mentioned by both Steve:

On 30 Oct 2012, Steve Hankin stevehankin wrote:

The netCDF data model captures, not just the fact that two fields share the same coordinate reference system (a.k.a. "space"), but also that they may be related by sharing degenerate subsets of that coordinate system. Restated in more concrete terms, it is important in CF to recognize that a surface variable has a special relationship to a 3D variable that shares the same lat-long coordinates.

and Mark:

On 6 Nov 2012, Mark Hedley markh wrote:

I think we should be careful in talking of sharing a Space: many of the cases discussed here are of data variables which share some Coordinates, but not all. I prefer to talk of shared Coordinates etc. as I think it is clearer.

Ethan

comment:119 in reply to: ↑ 109 Changed 8 years ago by markh

Replying to jonathan: Hello Jonathan

I think that there are subtle differences about information storage between our perspectives, but I do not think these are the crucial factor.

Part of my concern is terminology, I do not like the name Dimension you have given to the construct you have defined. I have struggled to come up with an alternative name, I am struggling to capture what this construct is.

I have concerns about the definition of identity for Dimension constructs, I do not think we have ever had a vocabulary within CF to do this for NetCDF files: NetCDF Dimensions are used as references within CF with no implied semantics. As such they do not feel like good candidate constructs to me. I prefer the ordered list of 'dims' to be a simple property of the Field.

I prefer to consider NetCDF dimensions as a way of linking Fields to their coordinates. I think the CF data model should take an abstracted view of this and just mandate that coordinates must be related to the dimensions of the data.

I think this brings particular benefit by providing similarities between Coordinates. Specifically, there are two types: the general type (Coord) and the constrained type (Coordinate), which must be numerical, one dimensional and monotonic.

How these are used by a Field is the responsibility of the Field, which may define 0or1 Coordinates for each of its dims and as many Coords as required, linked to the dims appropriately.

I have prepared a sketch of this in UML to illustrate Field and Coordinate

This uses qualified associations, as described in markh.

I feel this is a better description of the current conventions for NetCDF than the Dimension construct and provides flexibility to implementers of the model to manage the relationships using appropriate means for that implementation.

I think it has particular benefit in providing clear definition of how Coords may be identified as potential Coordinates and how a Field explicitly defines its dim_coords and aux_coords.

Regarding single valued coordinates, I see no need to provide a construct of a Dimension of size 1 which does not match up to part of the Field's data, I think it is enough to allow a Coordinate to be used to define a Field's dimension. The Field may be changed, for example increasing dimensionality, and the referencing will be updated at that point.

comment:120 follow-up: ↓ 124 Changed 8 years ago by markh

Replying to stevehankin:

I wonder if there may be much less disagreement than it seems here. We are lumping many distinct behaviors under the single heading of "CF and Python": support for non-netCDF datasets, for malformed datasets, and for aggregations. We might find the disagreement disappearing if we shared a mental picture of the architecture of the code.

Hello Steve

I think there is some misalignment about what we are aiming for. I think in hindsight I should have started a new ticket instead of:

davidhassell:

I suggest that this ticket is kept for discussing the detail of the model.

I agree.

which I think has lead to a number of mis-communications, but never mind.

At that point we agreed the scope of this ticket to be defined by 88

The scope is thus to define an abstract model, consistent with the CF conventions for NetCDF files at version 1.5.

Within this scope I don't think there is a code architecture. Instead I think there is a conceptual model, a set of types/constructs with clearly defined attributes and relations.

This model will map to how CF datasets are stored in NetCDF files, an implementation of the CF model.

Any other implementations, such as ones inferred by your diagrams, I would like to consider out of scope for this ticket and concentrate efforts on defining types and relations independent of any implementation. Do you think this is a practical approach?

comment:121 Changed 8 years ago by mgschultz

Dear all,

now, come on please! This discussion has left planet Earth and is bound to frustrate everyone but a tiny bunch of nerds. Who in his/her right mind is going to care about a distinction between "dimensions", "dims", "Coords", and "Coordinates" (and possibly also "coordinates", "Dimensions", etc.)? I can see that this issue as such requires clarification, but more than anything else, I believe it requires a decision and then an understandable text that provides clear guidance what is meant plus an explanation why things were defined the way they are.

It may help to adopt a tabular approach here and list the various properties or features of what is meant with each concept, and then decide which of these is most appropriate for the data model as such, and which concepts are implementation rules. As I was simpl yunable to read through all the detailed comments here, I may fail in puttign this together correctly, but from my humble understanding, one could start as follows:

let A be "dimension" as implemented in netcdf, let B be "dimension" as "one component of something defining the extent of a field", let C be a "spatio-temporal dimension" [and there may be more]

Property	A	B	C
defines extent of variable	X	X	X
links variable to space and/or time			X
allows linking of variables ("shared space")	X		?
allows common sub-setting (slicing) of variables	X

etc.

Sorry, if you consider my comment too direct (and please yell at me privately if you think it inadequate), but please don't leave the community behind. We also want to fly to the moon ;-)

Cheers,

Martin

comment:122 Changed 8 years ago by jonblower

In case it helps, here's the terminology I'm currently using in our internal APIs. It's based on the ISO terminology but I've had to fill in some gaps where ISO was not adequate. Feel free to use or ignore as you wish!

Term	Meaning	Equivalent in NetCDF	Origin
Domain	A set of spatiotemporal positions	Derived from Dimensions and Coordinate Variables	ISO19123
Range	A set of data values	Variable	ISO19123
Coverage	Maps positions in a Domain to values from a Range. Can contain multiple variables as long as they share a Domain.	Variable with georeferencing	ISO19123
Grid	nD array with no data and no georeferencing implied	List of Dimensions	ISO19123
GridAxis	One of the axes of a Grid	Dimension	I made this up
ReferenceableGrid	A Grid with spatiotemporal referencing (but no data values). It is also a Domain.	No direct equivalent	ISO19123
ReferenceableAxis	A GridAxis that is referenced to x,y,z or t. Can be composed to form ReferenceableGrids	Dimension with associated xyzt Coordinate Variable	I made this up
GridValuesMatrix	A Range that organizes its values in a Grid topology	Multidimensional Variable	ISO19123

For curvilinear grids, I model these as a 2D ReferenceableGrid that cannot be decomposed into 1D ReferenceableAxes.

Following up on Ethan's point, you can detect that different Coverages might share some of their Domains with other Coverages (e.g. a 3D Coverage might share two axes with a 2D Coverage) by inspection of the axes involved. However, looking at other kinds of subsetting relationships is hard, and is also hard to express in a data model.

Hope this is somehow helpful. In designing our internal APIs it's certainly been helpful to me to figure out the distinction between all these concepts.

Jon

comment:123 follow-up: ↓ 125 Changed 8 years ago by mgschultz

Thanks, Jon! This helps indeed!

Just a quick question, though: how would you call things and code the data in the followng two cases:

1) spectral variables (netcdf dimensions (time, level, 2, wavenumber) -- the 2 to distinguish between real and imaginary part); in a certain sense these do have spatio-temporal referencing

2) output from "irregular" grids such as icosahedric, where you basically have lists of points with lon/lat information, but no "system" to decompose them into lon and lat vectors ?

Martin

comment:124 in reply to: ↑ 120 ; follow-up: ↓ 127 Changed 8 years ago by davidhassell

Replying to markh:

Hi Mark, et al.,

It was mentioned that ticket #88 says that the data model should be consistent with the CF conventions for netCDF files. However, ticket #88 explicitly removes netCDF considerations from the CF data model.

In #88 it was agreed that Independence of file format distinguishes it from the CF-netCDF data model and the CDM, and that this will set the ground work to expand CF beyond netCDF files.

I think that this really important, and bearing in mind Martin's insightful comments, perhaps noting this again could help clarify matters.

All the best,

David

comment:125 in reply to: ↑ 123 ; follow-up: ↓ 126 Changed 8 years ago by jonblower

Replying to mgschultz: Hi Martin! To be honest I'm not sure how I would deal with spectral data or "irregular" grids - I think they would both require different constructs since the classes above are just for rectangular grids. I think there has been work on CF irregular grids elsewhere - I guess the data model just needs to be able to record the topology of interconnections. Jon

comment:126 in reply to: ↑ 125 Changed 8 years ago by Oehmke

Replying to jonblower:

Replying to mgschultz: Hi Martin! To be honest I'm not sure how I would deal with spectral data or "irregular" grids - I think they would both require different constructs since the classes above are just for rectangular grids. I think there has been work on CF irregular grids elsewhere - I guess the data model just needs to be able to record the topology of interconnections. Jon

You might be thinking of the UGrid group (of which I'm a member). They have been working on a CF proposal for representing unstructured grids in CF. We've been iterating on a CF proposal for awhile, but it hasn't been submitted yet. You can find a recent version here if you're interested.

comment:127 in reply to: ↑ 124 ; follow-up: ↓ 128 Changed 8 years ago by stevehankin

Replying to davidhassell:

Replying to markh:

Hi Mark, et al.,

It was mentioned that ticket #88 says that the data model should be consistent with the CF conventions for netCDF files. However, ticket #88 explicitly removes netCDF considerations from the CF data model.

In #88 it was agreed that Independence of file format distinguishes it from the CF-netCDF data model and the CDM, and that this will set the ground work to expand CF beyond netCDF files.

Hi Mark,

The preceding quote lies deep within the discussions in the body of ticket #88 and does not imho necessarily represent the spirit of the terms of reference or the level of agreement that exists.

Frankly, our entire CF discussion feels like it is exposing a gulf in communications that exists between two parallel universes: the world of Java netCDF and the world of non-Java netCDF. How can we be speaking of setting the ground work to expand CF beyond netCDF files, when we have been actively working with CF datasets in other formats for 5 years or more (GRIB, legacy satellite programs, etc.)? Ticket #78 (Aggregation rules) states The CF-netCDF convention at present applies only to individual files, but there is a common and increasing need to be able to treat a collection files as a single dataset, but this statement is patently untrue. We have had a working solutions to aggregations since COARDS days (over a decade ago), and we have been performing aggregations of virtual datasets (not just physical files) since those ancient times.

A central goal of the CF data modeling effort should be to help unite these two universes as quickly, painless, and effectively as it can. The universe of Java netCDF-CF is presently a quantum leap ahead of the parallel non-Java universe. We will not achieve our goal of uniting these universes unless we understand what has already been developed in the parallel Java universe, and include those advances in the baseline of our thinking and planning.

I think that this really important, and bearing in mind Martin's insightful comments, perhaps noting this again could help clarify matters.

All the best,

David

comment:128 in reply to: ↑ 127 ; follow-up: ↓ 129 Changed 8 years ago by markh

Replying to stevehankin:

Hello Steve

That is a really helpful clarification and, I think, something of a restatement of comments you made previously that I don't think I had properly taken on board.

I will consider this carefully and how it affects the issues we are discussing.

thank you mark

comment:129 in reply to: ↑ 128 ; follow-ups: ↓ 131 ↓ 142 Changed 8 years ago by stevehankin

Replying to markh:

Replying to stevehankin:

Hello Steve

That is a really helpful clarification and, I think, something of a restatement of comments you made previously that I don't think I had properly taken on board.

I will consider this carefully and how it affects the issues we are discussing.

thank you mark

For what it's worth, my own group at PMEL has for years done development work in both the Java and the non-Java universes. We are acutely aware of the gulf that exists between these two. For example, in order to harness the aggregation capabilities of the Java netCDF from non-Java clients, we must host the data in a TDS web server and read it through OPeNDAP. Ditto for reading the non-netCDF formats. Today we are seeing some of the reverse; advances in CF gridspec and CF ugrid are taking place in the non-Java universe and leaving the Java side behind. There's a need (and an opportunity) for a break-through in interoperability!

comment:130 follow-up: ↓ 133 Changed 8 years ago by markh

Whilst pondering the open questions, I would like to raise one more.

I would like to consider a separation of concepts, to try and keep each one as simple as possible.

The first is a derived quantity, which I characterise as a construct with a number of inputs and a single, well defined answer, the result from processing the set of input parameters appropriately. An example of this:
- a parameterised vertical coordinate, providing for the calculation of a 3D height coordinate from input parameters of a defined formula, an orography field and two parameters.
The second is a collection of reference information, which may be used for many different purposes, to comprehend, transform, re-interpret the data. This operates similarly to a unit of measure for a quantity. An example of this:
- a specific geospatial coordinate reference system which a coordinate is defined with respect to.

I would treat these as two different types:

DerivedCoord
CoordRefSystem

or some other names, to be agreed upon.

I have updated the diagram I posted previously to reflect my view on this in a potential cf data model diagram.

In these terms, I view

CF NetCDF Grid Mapping variables as CoordRefSystem entities, providing intelligent software with the required context to perform many operations.

CF NetCDF dimensionless vertical coordinate variables as DerivedCoord? entities.

comment:131 in reply to: ↑ 129 ; follow-up: ↓ 132 Changed 8 years ago by bnl

IMHO one of the reasons why this has happened is that NCML has become a defacto standard by usage, but not by definition. You would have thought one could have built an NCML type aggregator in other languages, but two of the many reasons stopping that are the ongoing worries that NCML might a) not be well defined (from a coding point of view), and b) be a moving target.

Replying to stevehankin:

Replying to markh:

Replying to stevehankin:

Hello Steve

That is a really helpful clarification and, I think, something of a restatement of comments you made previously that I don't think I had properly taken on board.

I will consider this carefully and how it affects the issues we are discussing.

thank you mark

For what it's worth, my own group at PMEL has for years done development work in both the Java and the non-Java universes. We are acutely aware of the gulf that exists between these two. For example, in order to harness the aggregation capabilities of the Java netCDF from non-Java clients, we must host the data in a TDS web server and read it through OPeNDAP. Ditto for reading the non-netCDF formats. Today we are seeing some of the reverse; advances in CF gridspec and CF ugrid are taking place in the non-Java universe and leaving the Java side behind. There's a need (and an opportunity) for a break-through in interoperability!

comment:132 in reply to: ↑ 131 Changed 8 years ago by caron

Replying to bnl:

IMHO one of the reasons why this has happened is that NCML has become a defacto standard by usage, but not by definition. You would have thought one could have built an NCML type aggregator in other languages, but two of the many reasons stopping that are the ongoing worries that NCML might a) not be well defined (from a coding point of view), and b) be a moving target.

Yes, i agree that NcML is not completely defined and has continued to evolve. I think that the non-aggregation part of NcML could mostly be standardized, and I would welcome that effort.

Standardization on aggregation, OTOH, would probably be premature. In essence, NcML aggregation is just "version 1" of what needs to happen. We have the start of "version 2" in the latest TDS/CDM version 4.3, and it works better on the server, partly because in some cases, an aggregation results in multiple logical datasets, whereas NcML is about a single logical dataset.

Its also important to note that NcML is just a configuration language for the CDM data model. One really wants to define aggregation on the data model.

John

comment:133 in reply to: ↑ 130 Changed 8 years ago by caron

Replying to markh:

Whilst pondering the open questions, I would like to raise one more.

I would like to consider a separation of concepts, to try and keep each one as simple as possible.

The first is a derived quantity, which I characterise as a construct with a number of inputs and a single, well defined answer, the result from processing the set of input parameters appropriately. An example of this:
a parameterised vertical coordinate, providing for the calculation of a 3D height coordinate from input parameters of a defined formula, an orography field and two parameters.
The second is a collection of reference information, which may be used for many different purposes, to comprehend, transform, re-interpret the data. This operates similarly to a unit of measure for a quantity. An example of this:
a specific geospatial coordinate reference system which a coordinate is defined with respect to.

I would treat these as two different types:

DerivedCoord
CoordRefSystem
or some other names, to be agreed upon.

I have updated the diagram I posted previously to reflect my view on this in a potential cf data model diagram.

In these terms, I view

CF NetCDF Grid Mapping variables as CoordRefSystem entities, providing intelligent software with the required context to perform many operations.

CF NetCDF dimensionless vertical coordinate variables as DerivedCoord? entities.

FYI, the CDM coordinate system object model defines an abstract "coordinate transformation" object, with subclasses "horizontal transform" and "vertical transform", which are specified by a grid_mapping variable and a dimensionless vertical coordinate, respectively.

Its helpful to separate the CF encoding, which for historical reasons look somewhat different, from the data model abstractions, which I think are not so different.

John

comment:134 follow-ups: ↓ 138 ↓ 161 Changed 8 years ago by jonathan

Dear all

This has become a rather wide-ranging discussion, which is interesting and useful, but at the same time I'd like to recall that the aim of this ticket is to agree a CF data model description. The ticket began with the document that David and I proposed. I wonder if we could focus on this document, and discuss our agreements and disagreements about it, like we do with changes to the convention?

I think there are several threads which relate specifically to aspects of this document. If I have been sufficiently clear below, it might be useful if anyone who dissents from these summaries could say so. If we agree on them, we have made progress.

What to call the space in which the field exists. David and I called it a space, but we are OK with calling it a domain instead, following Jon's suggestion, which various people support. As far as I can see, no-one disagrees with domain. But note that it is not purely spatiotemporal (unlike Jon's table), since CF allows non-spatiotemporal coordinates. Also, it is not defined only by coordinates, but also cell measures, cell methods and transforms (i.e. grid mapping and formula terms), as has already been discussed.

Whether the data model should specifically mention the possibility of a domain existing without a data in it i.e. a field with no data. David and I included this by saying that the data array is optional in the field. There has been agreement that this could be useful, but Mark has argued that since CF-netCDF doesn't currently have a convention for a field with no data, we shouldn't mention it in the data model. I can't argue against this, since it's the usual view we take about CF, that we don't add to it until we need to. So we should remove it for the moment from the proposed data model (but keep it in mind as an appealing and probably useful generalisation). OK?

Whether dimensions have an independent identity in the field. This is a subtle question, it seems. If it has no consequence at all for how we write applications, then it's a useless question and and there's no need to settle it. I am not sure if this is the case. In my last posting, I said that I thought the idea of a 1D dimension and 1D coordinate belonging together comes from the Unidata netCDF convention of equality of name between coordinate variable and dimension. Mark disagrees. However, part of the problem is nomenclature, Mark feels, and I agree in not liking to use the word dimension for this concept. I note that Jon uses the term GridAxis, which he defines as one of the axes of a Grid. This feels like the same concept as David and I are getting at. We would have liked to use the word axis for it, but unfortunately in another CF ticket it has been agreed that the CF axis attribute can be given to auxiliary (possibly multidimensional) coordinate variables, so axis alone would now be a confusing word for a specifically 1D idea. However, inspired by Jon's table, I'd like to make a new suggestion. Could we call a 1D axis a domain axis? Our CF data model would then distinguish domain axis construct (corresponding to netCDF dimension and the 1D coordinate variable of the same name, if there is one) and auxiliary coordinate construct (corresponding to a CF-netCDF auxiliary coordinate variable). Both the auxiliary coordinates and the field itself refer to the domain axes. What do others think?

Whether fields are independent as far as the data model is concerned. I think that everyone agrees with this, provided that we also note that it is possible to test, using the information described by the data model, whether two fields are defined with the same domain, and that if the fields are actually contained in a single CF-netCDF file, identity of dimensions and coordinates may make it easier to verify that two fields are defined with the same domain. Is that OK with everyone? (As an aside, in response to Steve, I do not agree that a CF-netCDF file which has duplicate coordinate variables is malformed. It is not ideal, but it's legal, and I expect it could be produced by generally well-behaved software.)

I think we all agree that there is a need for the CF data model to envisage fields which are distributed over several CF-netCDF files or which reside in non-netCDF files or in memory. Steve argues that we are already doing it, and he's right. Nonetheless the CF-netCDF convention does not deal with this. It is for individual netCDF files and says nothing about aggregation rules or non-netCDF files. The question is whether, despite this limitation, we should write the CF data model description to permit the idea of fields spread over several datasets or in non-netCDF files. Do we all agree that we should do that?

I suspect we all agree that we do not want the CF data model to contain rules for aggregation. John has just made such a remark. We acknowledge the need to be able to do this, but we are not ready to standardise it. The CDM and NCML support aggregation according to certain rules. David and I have proposed a different set of aggregation rules in ticket 78, which David has implemented in cf-python. Aggregation rules are a step beyond the CF data model at the moment. Is that OK?

I note that we have another thread about transforms, grid mappings and formula terms, but I need to think more about that one!

Best wishes

Jonathan

comment:135 Changed 8 years ago by mgschultz

Dear Jonathan,

very nice summary from my point of view. Only one philosophical question concerning the point about "fields distributed over several CF-netcdf files": in my perhaps naive understanding, the data model should treat this as one field, and it is an implementation issue to allow for the field being distributed across several "sources". One can think of a generic memory mapping concept here, where the application must be aware of the total dimensionality of the domain and manage separately which chunks of this domain are currently loaded (as well as which chunks may be available at all). In fact, I recently came across some website where people made a point about ensembles which can be regarded as a typical sixths dimension (3*space, time, variable, ensemble member). This might require another distribution of fields across files, and in this case the fact that the fields belong together might not even be apparent from any netcdf dimension value.

Best regards,

Martin

comment:136 Changed 8 years ago by stevehankin

Thanks for the nice summary, Jonathan.

I agree with Martin the data model should treat [a field distributed over multiple files] as one field. It seems intuitively that the data model should not talk about "files" at all -- only about the abstractions of "domains", "fields" and "datasets". Can someone articulate the counter-argument?

Regarding bullet 4 above -- the significance of a domain that gets defined more than once, repeated independently through the definition of multiple fields (a.k.a. Whether fields ["domains"?] are independent as far as the data model is concerned): I don't think this is an issue for the data model at all. Including duplicated domain definitions in the data model will generate complexity downstream. For what benefit? Multiply defined domains are an artifact of the specific files that have been created (yes, mal-formed from the perspective of the data model). They are best dealt with as an implementation issue, imho. My bias is to keep the data model as simple (and "pure") as possible.

comment:137 follow-up: ↓ 141 Changed 8 years ago by bnl

Great summary Jonathan.

Taking your bullets one at at time:

I think it's ok to generalise domain beyond spatio-temporal coordinates alone.
A couple of use cases to consider for the dataless case: gridspec definitions, and irregular grids (I think in the instance of defining a data model, we should be a little more future looking than we are in our normal CF behaviour, there isn't much point in developing a new abstraction that can't cope with things we know are in our queue right now).
I think domain axis nicely avoids the difficulty that dimension variously can be interpreted as an axis and a length ... +1 from me.
I'm happy with this, I rather think it'd be nice to allow in the data model for a specific identity to apply to the domain though, so you can specifically share if you want to (rather than implicitly). This of course immediately allows the idea of a domain definition in one file, and the data in others, and plays straight into the dataless option ... even if we do not in our current implementation allow this!
So my previous point is stating that I agree with this!
Agreed. Aggregation rules are an implementation detail wrt the data model.

comment:138 in reply to: ↑ 134 ; follow-up: ↓ 139 Changed 8 years ago by markh

Replying to jonathan:

Thank you for the summary Jonathan. I am happy with all but one of the statements you make; I think this represents real progress in our model development.

I am not yet convinced about the Dimension/DomainAxis? construct. I agree that Dimension is an unhelpful name but I do not think this is just an issue of nomenclature.

I'll try to summarise my concerns:

By defining a type in the data model, I feel that we are mandating that all applications implement this type, yet I do not feel it is necessary. I think the model is cleaner leaving the management of the relationship between Fields and all varieties of coordinate to the implementation and just mandating the nature of the relationship. As such I prefer a less specified model, with a qualified association.
We appear to be mandating that an instance of a DomainAxis must exist, even when it is defined by no coordinate, this feels overly prescriptive and unhelpful to me. I think it is valid to answer the question 'what defines this dimension of the Field' with the response 'Nothing'.
- The information carried by a DomainAxis not containing a coordinate appears to me to be nothing (as I think the definition of extent of the array is held by the Field)
I think the difference between an AuxiliaryCoordinate construct and a DomainAxis type, containing what seems to me to be a constrained AuxiliaryCoordinate is overstated by the model. These types are very similar in my mind, I would really like the model to reflect this similarity.
- They mainly differ in how they are used by their containing Field, I would prefer this to be managed more clearly by the Field.

You rightly bring up the question: 'What difference does this make to an implementing application?'

My concern is that we are overly constraining an implementation to handle the relationships between Fields and their coordinates using this construct. The model can be less proscriptive whilst still capturing all of the constraints which the CF model mandates. It just feels unnecessary to me, I don't think we should model unnecessary things.

I think we should err on the side of less proscriptive modelling to enable implementation freedom where we can. I think that the qualified association approach offers a better balance for this.

comment:139 in reply to: ↑ 138 ; follow-up: ↓ 143 Changed 8 years ago by davidhassell

Replying to markh:

Dear Mark,

Domain axis is fine with me, too.

The DomainAxis is a thorny one. I offer some alternate view points to your concerns.

By defining a type in the data model, I feel that we are mandating that all applications implement this type, yet I do not feel it is necessary.

I agree that it is not necessary, but I don't think that we are suggesting such mandatory behaviour. As an example, cf-python, which is faithful to the proposed data model, doesn't have a DomainAxis (python) object.

(Skipable detail alert!) I have preferred so far to be lightweight and store strings for axis identities and a simple table relating each string to its size. The field contains an unordered collection of coordinates and the field knows whether each one belongs conceptually to a domain axis or is an auxiliary coordinate (in which case the field knows the ordered list of domain axes which it requires). Conversely, it is easy for the field to check whether a domain axis has a coordinate array.

This captures the spirit and functionality of the data model because domain axes do exist independently of coordinate arrays, even though there is no DomainAxis object which explicitly contains a coordinate.

The information carried by a DomainAxis not containing a coordinate appears to me to be nothing (as I think the definition of extent of the array is held by the Field)

I think a domain axis not containing a coordinate array carries information on its size and its identity. Its size will be 1 and its identity is required by other constructs which need an ordered list of domain axes, such as any construct which contains a data array (cell measure, auxiliary corodinate, field, etc.)

I think the difference between an AuxiliaryCoordinate construct and a DomainAxis type, containing what seems to me to be a constrained AuxiliaryCoordinate is overstated by the model. These types are very similar in my mind, I would really like the model to reflect this similarity.

It seems to me that a DomainAxis construct is quite different to an AuxiliaryCoordinate construct. It may contain what is essentially a constrained AuxiliaryCoordinate, but I don't think that it is one itself.

I don't think I fully understand the 'dim' and 'dims' boxes in your very helpful UML diagram, but it seems to me that your 'dim' is in fact very similar to our DomainAxis construct in that it has an identity and it is not a coordinate but may (or may not) contain a constrained coordinate. If this is all wrong, please correct me.

All the best,

David

comment:140 Changed 8 years ago by jonblower

To follow up Mark and David's last comments. I think there are often differences between data models and real code - just because a class exists in a data model doesn't mean that it has to exist in an API. Although pure "model-driven developers" might disagree with me, I find that adhering too literally to a data model in code can lead to very awkward code that doesn't follow good practice or idioms of the language in question (such code also tends to be verbose). So I would be completely comfortable having, for example, DomainAxis in a data model, but having different ways of achieving this.

(For example, I'm writing an API to deal with data using some of the ISO data model, but I'm certainly not going to implement all those classes literally! The important thing is to make sure that the API can express the same concepts accurately for the intended applications.)

Literal translations of abstract data models tend to be more useful in building data-exchange formats than APIs. (E.g. The NetCDF data model translates quite simply and uncontroversially into NcML, but the same data model translates into many different APIs, which reflect the characteristics of the language involved.)

Just some thoughts, Jon

comment:141 in reply to: ↑ 137 Changed 8 years ago by Oehmke

Replying to bnl:

Great summary Jonathan.

Taking your bullets one at at time:

I think it's ok to generalise domain beyond spatio-temporal coordinates alone.
A couple of use cases to consider for the dataless case: gridspec definitions, and irregular grids (I think in the instance of defining a data model, we should be a little more future looking than we are in our normal CF behaviour, there isn't much point in developing a new abstraction that can't cope with things we know are in our queue right now).

I would like to second this. Our ESMF regridding software can already generate interpolation weights for a subset of the GRIDSPEC convention where we don't require a data field. I know that this is for a version of CF beyond the one under consideration, but for the sake of consistency between different versions of the data model it seems better to build a property like not requiring data in from the beginning rather than change it later.

comment:142 in reply to: ↑ 129 Changed 8 years ago by markh

Replying to stevehankin:

Replying to markh:

Replying to stevehankin:

Hello Steve

That is a really helpful clarification and, I think, something of a restatement of comments you made previously that I don't think I had properly taken on board.

I will consider this carefully and how it affects the issues we are discussing.

thank you mark

For what it's worth, my own group at PMEL has for years done development work in both the Java and the non-Java universes. We are acutely aware of the gulf that exists between these two. For example, in order to harness the aggregation capabilities of the Java netCDF from non-Java clients, we must host the data in a TDS web server and read it through OPeNDAP. Ditto for reading the non-netCDF formats. Today we are seeing some of the reverse; advances in CF gridspec and CF ugrid are taking place in the non-Java universe and leaving the Java side behind. There's a need (and an opportunity) for a break-through in interoperability!

Hello Steve

I have been considering these topics over the last week and reading up on netCDF and the CDM.

I think that where I have referred to netCDF in posts I have referred to the API, the CDM and the encoding, but primarily the API.

My perspective on the world is that CF is currently a convention for using the netCDF API and storing information in netCDF files. If we can deliver a data model specification then CF may be considered as an implementation neutral model with controlled terminologies. In this case, netCDF is an implementation of CF, but one of potentially many.

This relationship between CF and netCDF could be characterised by saying:

CF is an implementation agnostic data model with a set of controlled terminologies.
All CF concepts can be represented in netCDF, although netCDF may represent concepts outside of CF;
CF provides a generalised way of modelling its concepts, along with a set of specialisations for how these may be used in netCDF.

The conventions document exists for netCDF file encoding, perhaps further information would help for how the CF model interacts with netCDF's CDM.

As you have stated, there is a challenge in how to keep the communities and their tool sets connected. I wonder if the development and maintenance of the abstracted CF model is the tool to do this, if approached correctly. If each implementing community is able to keep in contact with the model, and show how it is doing so, then this could facilitate the cross fertilisation of ideas and concepts.

There may be an activity to develop further the aggregation rules that David has posted about, once the core model is in place; I see real value in doing this as a community activity, involving implementers from different tool development communities.

As an example, I have taken my sketch of a potential CF data model, and superimposed on it a number on constructs from the CDM, to try and illustrate how an implementer might provide information on how their implementation works with the model. I see such things as the responsibility of the implementers, but I say this on the assumption that implementers are also modellers (most of us possess many hats) involved in the data model development community.

I think that if we keep the expectations of the model clear and keep talking them over, as we are doing, while allowing sufficient freedom and flexibility to implementations by modelling only what is required, then we can use this process to move implementation communities closer together.

comment:143 in reply to: ↑ 139 Changed 8 years ago by markh

Replying to davidhassell:

I agree that it is not necessary, but I don't think that we are suggesting such mandatory behaviour. As an example, cf-python, which is faithful to the proposed data model, doesn't have a DomainAxis (python) object.

This feels to me like an example of where I would prefer to see the data model say less, and implementers make choices. If the data model has a DomainAxis type and cf-python implements the model, I would look for a DomainAxis python object in the implementation.

If you haven't used one, I would suggest that the type might not be required for the model.

I think a domain axis not containing a coordinate array carries information on its size and its identity. Its size will be 1 and its identity is required by other constructs which need an ordered list of domain axes, such as any construct which contains a data array (cell measure, auxiliary corodinate, field, etc.)

This doesn't feel like identity to me. There is no controlled terminology or definition, what you are describing feels to me like a reference, a fine detail of plumbing which doesn't need to be captured in a model, other than by referencing; UMl does this quite well without getting into specifics.

I don't think I fully understand the 'dim' and 'dims' boxes in your very helpful UML diagram, but it seems to me that your 'dim' is in fact very similar to our DomainAxis construct in that it has an identity and it is not a coordinate but may (or may not) contain a constrained coordinate. If this is all wrong, please correct me.

It is very similar, you are absolutely right; I would say the key difference is that the dim and dims boxes, defining 'qualified associations' are more loosely modelled than a DomainAxis construct, they are an instruction to an implementer to manage the association somehow, with little specification on how, only on what the outcome should be: a constrained relationship based on data-dimensions. The key here is the constraint on the relationship, so that is all that is modelled.

I prefer this 'lighter touch' approach, reducing the complexity of the construct from a DomainAxis to a Coordinate (just a constrained AuxCoord).

Replying to jonblower:

To follow up Mark and David's last comments. I think there are often differences between data models and real code - just because a class exists in a data model doesn't mean that it has to exist in an API. Although pure "model-driven developers" might disagree with me, I find that adhering too literally to a data model in code can lead to very awkward code that doesn't follow good practice or idioms of the language in question (such code also tends to be verbose). So I would be completely comfortable having, for example, DomainAxis in a data model, but having different ways of achieving this.

I understand this perspective, but I would suggest that what you are describing is a way of implementing a poorly constructed model. If the model already exists, with a type I find unhelpful, I might also choose not to implement it, but I would be wary of doing so, as the modellers may be capturing something I don't understand, or that may be built upon in future iterations.

However, we are in the situation where the model is being developed; I would be sad to see a Type I feel is unnecessary make its way into the model, only for me to ignore it and reimplement that relationship in a different way. I would have to think hard before taking this approach with an implementation.

In this case I would like to keep the detail of modelling the relationship between a Field and the constrained Coordinates which define some or all of its degrees of freedom to a minimum. I think that my approach does this by only defining that:

a relationship exists to optionally define each of a Fields data dimensions;
the relationship constrains the number of defining Coordinates per data dimension to 1 or 0;
the Coordinates are 1-dimensional, strictly monotonic, numerical ;
Consistency between the Coordinate's length the length of the relevant Field's data dimension must exist;

I think there is value in doing this rather than defining a specific type which has these responsibilities.

I think the CF convention's description of scalar coordinates fits nicely into this approach, enabling any AuxCoord with one numerical value to be thought of as representing an uninstantiated data dimension. This becomes particularly useful when merging datasets together, all operations can be achieved by metadata inspection and promotion of suitable AuxCoord instances to Coordinates.

comment:144 Changed 8 years ago by markh

The CF Conventions for NetCDF files inherit their specification of Coordinate Variables from the NetCDF User Guide, although the CF phrasing is slightly stronger:

A Coordinate Variable is a one-dimensional variable with the same name as its dimension [e.g., time(time)], and it is defined as a numeric data type with values that are ordered monotonically. Missing values are not allowed in coordinate variables.

For the data model, is it sufficient to state that a Coordinate must be strictly monotonic?

This criteria mandates that there are no repeating values and an ordering process is available which has been used on the data. As long as equality, greater than and less than operators are defined for a data type, a set of values may be monotonic, whether or not they are numeric.

Am I missing something crucial here?

It seems a little over-prescriptive for the model to define that all Coordinates must be numeric to me.

comment:145 follow-up: ↓ 146 Changed 8 years ago by stevehankin

Insisting that coordinate variables be numeric greatly simplifies the writing of client code and documentation; it allows one to side-step the messy issues about collating sequences of text string (treatment of case, etc.). Can we articulate some compelling real world use cases for non-numeric coordinates that would balance the increased complexity they would carry along?

Do we agree on this philosophy?: our goal is not to be as general as possible, but instead to find the appropriate balance point between generality and simplicity. (Finding the "appropriate balance point" requires us to articulate and weigh use cases.)

comment:146 in reply to: ↑ 145 Changed 8 years ago by markh

Replying to stevehankin:

Insisting that coordinate variables be numeric greatly simplifies the writing of client code and documentation; it allows one to side-step the messy issues about collating sequences of text string (treatment of case, etc.). Can we articulate some compelling real world use cases for non-numeric coordinates that would balance the increased complexity they would carry along?

A case I have in mind is regarding the definition of correctly referenced dates.

The CF Specification for NetCDF files has a particular approach to defining dates and times, which provides coordinates with numeric values, complex units and associated reference calendars. I do not think this approach should be replicated as is in the data model.

I think the model can afford to be more generic, merely stating the requirements for dates and times and enabling many implementations of dates and times, as long as they support these requirements: dates, calendars, times, time-zones etc.

This keeps the data model at an appropriately high level, but it does cause a slight concern with data model Coordinates, if these are constrained to be numerical.

My view on this is that if a Coordinate's data values are definitively sortable, then strict monotonicity can always be evaluated. Thus we can constrain Coordinates to be strictly monotonic and one dimensional within the scope of the data model.

Whilst the allure of sorting strings exists, I understand the myriad of complications this can bring, so I can see why strings might be deemed not to be definitively sortable, while other things may be.

comment:147 follow-ups: ↓ 155 ↓ 156 Changed 8 years ago by stevehankin

Two lines of discussion, both pointing in the same direction

I understand the appeal you see in keeping the data model as general as possible. As previously discussed, the title of this ticket muddies the waters ("data model and reference implementation"). Is it your intention for this topic to apply only in the documentation of the data model, but not in the Python code? (Which would already be a source of confusion ...)

Significant elements of CF suffer from too much generality rather than too little. The merits of offering multiple ways to encode the same information, need to be weighed against greater difficulty for users to learn and for designers and implementers to support the standard. The community's ability to support CF is resource-limited -- has been for many years. Cases like this that add complexity without adding expressive power should be evaluated against those pragmatic considerations. Stated another way, there's a down side to new directions that you successfully pioneer in Python; they will either generate new requirements for writers of Java code and supporters of legacy applications or they will reduce interoperability.

signed - Mr. Wet Blanket ('tis better to say 'no' in standards discussions)

comment:148 follow-up: ↓ 150 Changed 8 years ago by caron

Hi Mark, Steve:

I agree that arbitrary strings for coordinates is a big pain, and I wouldnt advise it. However, for the specific case for date/time, i would support ISO-8601 formatted strings (but not any ole' string) as datetime coordinates, as i have mentioned before. I think they are superior to "offset since <date>" that we have now.

Note this would be a change from current "udunit only" dates. since we are nearing a proposal on modifying calendar attribute, its worth weighing a change in datetime representation.

(however i will be on vacation for 6 weeks, so cant say much more until after that)

regards, John

comment:149 Changed 8 years ago by jonathan

Dear all

I think we should stick with numerically encoded dates. I would argue that support for strings as a means to specify date/times is a task for APIs, rather than that the standard should allow them. I agree with Mr Wet Blanket that it is more robust to keep the standard as simple as possible (but no simpler).

Best wishes

Jonathan

comment:150 in reply to: ↑ 148 ; follow-up: ↓ 151 Changed 8 years ago by bnl

Replying to caron:

However, for the specific case for date/time, i would support ISO-8601 formatted strings (but not any ole' string) as datetime coordinates, as i have mentioned before. I think they are superior to "offset since <date>" that we have now.

Absolutely agree. The problems with using "offset since date" are legion, particularly in comparisons, and using ISO-8601 would hand many date problems to libraries. Which is not to say one couldn't keep doing offset date, but best practice should be to use ISO-8601 (which is why it's encoded as a standard) (except for wierd model calendars).

comment:151 in reply to: ↑ 150 ; follow-up: ↓ 152 Changed 8 years ago by stevehankin

Replying to bnl:

Replying to caron:

However, for the specific case for date/time, i would support ISO-8601 formatted strings (but not any ole' string) as datetime coordinates, as i have mentioned before. I think they are superior to "offset since <date>" that we have now.

Absolutely agree. The problems with using "offset since date" are legion, particularly in comparisons, and using ISO-8601 would hand many date problems to libraries. Which is not to say one couldn't keep doing offset date, but best practice should be to use ISO-8601 (which is why it's encoded as a standard) (except for weird model calendars).

(Hi Brian. We've had this discussion before. :-) ) I agree on the up sides. But we need to balance against the down sides:

"weird model calendars" are common in the climate community. Is there a way to encode weird model calendars in ISO-8601? If not then a client CF application would need to support both "offset since date" and ISO-8601 in order to avoid creating an interoperability breach.
ditto bullet 1 for climatological time axes
we should articulate and understand the "legion" problems of "offset since date" encoding. I can think of 2 problems that come up over and over: i) using a mixed Gregorian-Julian calendar and crossing the 1541-ish boundary (do ISO-8601 libraries behave "scientifically" across this boundary?); ii) folks who lobby for units of "months" and "years" in order to reduce effort writing CF files (a relatively minor convenience issue).

comment:152 in reply to: ↑ 151 ; follow-up: ↓ 153 Changed 8 years ago by jonblower

Replying to stevehankin:

Strictly speaking, ISO8601 only supports Gregorian dates. The standard requires dates to be consecutive, so mixed Gregorian/Julian? is disallowed (good!). But you could still use a similar syntax (yyyy-mm-ddThh:mm:ss.SSSZ) for other calendars, even "weird model ones" - they wouldn't be ISO8601, but they would be unambiguous if coupled with a "calendar" attribute.

comment:153 in reply to: ↑ 152 ; follow-up: ↓ 154 Changed 8 years ago by stevehankin

Replying to jonblower:

Replying to stevehankin:

Strictly speaking, ISO8601 only supports Gregorian dates. The standard requires dates to be consecutive, so mixed Gregorian/Julian? is disallowed (good!). But you could still use a similar syntax (yyyy-mm-ddThh:mm:ss.SSSZ) for other calendars, even "weird model ones" - they wouldn't be ISO8601, but they would be unambiguous if coupled with a "calendar" attribute.

Brian describes that using ISO-8601 would hand many date problems to libraries. A big plus. But presumably the encodings you propose would not be correctly interpreted by existing libraries and clients that use them. Is there some community that has already done this? Can you articulate a roll-out sequence for the use of ISO dates in CF that wouldn't create as many problems as it solves? All strategies seem to lead to the need for applications to support BOTH time encodings or create a major breach in CF interoperability. An application that support ISO but not "offset since date" could hardly be said to be "CF compliant".

If there's a clean answer, it seems like it would have to lie in a significant "climate time test bed" activity that in the process wrote the code libraries and proposed climate enhancements to ISO-8601.

comment:154 in reply to: ↑ 153 Changed 8 years ago by jonblower

Replying to stevehankin: Just to clarify, I'm not pushing one way or the other for CF supporting strings as time axis elements - just stating the facts regarding ISO860l. But, since you ask, here's what I might do if I were advocating for date strings:

Don't call them ISO8601 date strings. That would define them to be Gregorian and might lead to the wrong results from existing ISO libraries. Instead, give them another name, like "CF date strings" (which use an ISO-like syntax that we should define).
Consider restricting the full ISO grammar (which allows for time periods, year-week syntax and partial time strings among other things), so that implementations only have to implement the parts that we really need in CF (I don't know exactly what these are, we'd need to think).
The calendar attribute would of course be mandatory (or have some defined default).

I don't know about Python, but the implementation of the above is already done in Java-land. Both ncWMS and the Unidata CDM use the excellent joda-time libraries to implement date string parsing and formatting for all the CF calendars (I think). However, I've never looked at climatological time axes.

Hope this helps, Jon

comment:155 in reply to: ↑ 147 Changed 8 years ago by markh

Replying to stevehankin:

I understand the appeal you see in keeping the data model as general as possible. As previously discussed, the title of this ticket muddies the waters ("data model and reference implementation").

I agree, the title and description block are not helpful.

Is it your intention for this topic to apply only in the documentation of the data model, but not in the Python code? (Which would already be a source of confusion ...)

Absolutely, I think this topic must only apply to the documentation of the format independent data model. Implementation discussions should be for context only.

Perhaps it is time to close this ticket and open up a new child ticket with that scope explicitly defined to attempt to draw these discussions to a conclusion.

(I fear this ticket has gone far beyond the scope of anyone reading it from start to end.)

comment:156 in reply to: ↑ 147 ; follow-up: ↓ 159 Changed 8 years ago by mgschultz

Replying to stevehankin:

I would like to second Mark here and suggest to close this ticket and organize the further discussion in at least two different streams. Ideally, the time between closing this ticket and opening new ones should be used to update the document (http://www.met.rdg.ac.uk/~jonathan/CF_metadata/cfdm.html) and associated information based on what seems to have come out of this discussion. That process would be very helpful to identify which aspects have been resolved and where is further discussion needed.

On a more general line, I am wondering if it wouldn't be more productive to adopt the general CF philosophy of pragmatism and focus efforts first on the reference python implementation. This will certainly lead to specific design questions which should then be brought to light and discussed in context also of a language-independent data model. Of course this data model may then be somewhat less general, but it may therefore become simpler which is generally a good principle I find.

comment:157 follow-up: ↓ 158 Changed 8 years ago by markh

Replying to jonblower:

I don't know about Python, but the implementation of the above is already done in Java-land. Both ncWMS and the Unidata CDM use the excellent joda-time libraries to implement date string parsing and formatting for all the CF calendars (I think). However, I've never looked at climatological time axes.

This illustrates my point. I am not advocating a discussion here on how to represent time, instead I am advocating that the data model does not take on this question at all, it is for implementations to deal with.

I would like the model to merely state:

Dates and times, with respect to a variety of Calendars, may be defined as values (points and bounds) for Coordinates and AuxiliaryCoordinates.

(perhaps with an addendum to highlight aggregates of discontinuous bounded values such as climatologies if required).

If we can agree this we can save the discussion of date and time implementations for another day (I can hardly wait ;), the data model shouldn't define such details.

comment:158 in reply to: ↑ 157 Changed 8 years ago by stevehankin

Replying to markh:

Replying to jonblower:

I don't know about Python, but the implementation of the above is already done in Java-land. Both ncWMS and the Unidata CDM use the excellent joda-time libraries to implement date string parsing and formatting for all the CF calendars (I think). However, I've never looked at climatological time axes.

This illustrates my point. I am not advocating a discussion here on how to represent time, instead I am advocating that the data model does not take on this question at all, it is for implementations to deal with.

I would like the model to merely state:

Dates and times, with respect to a variety of Calendars, may be defined as values (points and bounds) for Coordinates and AuxiliaryCoordinates.

(perhaps with an addendum to highlight aggregates of discontinuous bounded values such as climatologies if required).

If we can agree this we can save the discussion of date and time implementations for another day (I can hardly wait ;), the data model shouldn't define such details.

I second the idea of closing this ticket and making new tickets with a careful separation of implementation and data model. Our current understanding of CF is that netCDF is the data model and all else is conventions -- i.e. the data model is *much* more general than the CF implementations. Your task in crafting a CF data model is to narrow that gap. The point you've raised -- whether the abstract data model needs to stipulate that coordinate axes are numeric -- is a valid point to consider.

You've probably noticed that Jonathan and I come from different philosophical camps on these discussions. My position is that the design-by-committee standards process often leads to standards that are too complex -- often much too complex. CF is significantly too complex already. When weighing topics that seem "nice" or "general" or "potentially useful" but no one is articulating a practical need for them, I believe the jury (ourselves) should be instructed to weigh those topics using a "guilty until proven innocent" guideline. That is our bulwark against creeping complexity.

comment:159 in reply to: ↑ 156 Changed 8 years ago by markh

Replying to mgschultz:

I would like to second Mark here and suggest to close this ticket and organize the further discussion in at least two different streams. Ideally, the time between closing this ticket and opening new ones should be used to update the document (http://www.met.rdg.ac.uk/~jonathan/CF_metadata/cfdm.html) and associated information based on what seems to have come out of this discussion. That process would be very helpful to identify which aspects have been resolved and where is further discussion needed.

Replying to stevehankin:

I second the idea of closing this ticket and making new tickets with a careful separation of implementation and data model. Our current understanding of CF is that netCDF is the data model and all else is conventions -- i.e. the data model is *much* more general than the CF implementations. Your task in crafting a CF data model is to narrow that gap. The point you've raised -- whether the abstract data model needs to stipulate that coordinate axes are numeric -- is a valid point to consider.

On the basis I have attempted to summarise the current status in a new ticket #95, I hope I haven't acted too hastily.

Perhaps we can move the relevant discussions there and attempt to finalise the details for Data Model 1.5.

You've probably noticed that Jonathan and I come from different philosophical camps on these discussions. My position is that the design-by-committee standards process often leads to standards that are too complex -- often much too complex. CF is significantly too complex already. When weighing topics that seem "nice" or "general" or "potentially useful" but no one is articulating a practical need for them, I believe the jury (ourselves) should be instructed to weigh those topics using a "guilty until proven innocent" guideline. That is our bulwark against creeping complexity.

I agree: if the model is to be useful it must drive at simplicity, just enough specification to enable interoperability seems a good mental objective to me.

Replying to mgschultz:

On a more general line, I am wondering if it wouldn't be more productive to adopt the general CF philosophy of pragmatism and focus efforts first on the reference python implementation. This will certainly lead to specific design questions which should then be brought to light and discussed in context also of a language-independent data model. Of course this data model may then be somewhat less general, but it may therefore become simpler which is generally a good principle I find.

My view on approach is the opposite, I think we get more benefit from focussing attention on the data model.

I agree with your desire for simplicity; I restate: I think we must strive to make the data model as simple as possible whilst still being useful.

Changed 8 years ago by davidhassell

Attachment newCF_0.7.pdf added

UML diagram of version 0.7 of the proposed CF data model

Changed 8 years ago by davidhassell

Attachment cfdm_0.7.html added

version 0.7 of the proposed CF data model

comment:160 Changed 8 years ago by davidhassell

Hello,

It will be useful to read the latest draft of the data model, which incorporates many of the points raised in this discussion. There is also a UML diagram of it (see the attachments of this ticket). I encourage you to have a glance at it - it should help to refocus our efforts.

The most recent changes are highlighted in this document, but not all changes since version 0.1. However, the highlighted bits are mainly the areas that have changed since the very beginning.

Many thanks,

David

comment:161 in reply to: ↑ 134 Changed 8 years ago by jonathan

I have posted on ticket 95 about the changes in our data model document since the last time I summarised outstanding issues. Following that summary, there was a discussion about time coordinate variables. I think that is an important issue to discuss. It is not currently explicitly dealt with in the proposed data model document.

Jonathan

Note: See TracTickets for help on using tickets.

Download in other formats: