Opened 8 years ago
Last modified 8 years ago
#94 accepted enhancement
Proposal for a CF String Syntax (CFSS)
Reported by: | ceceliadid | Owned by: | lowry |
---|---|---|---|
Priority: | medium | Milestone: | |
Component: | cf-conventions | Version: | |
Keywords: | Cc: | arlindo.dasilva@…, theurich@… |
Description
Proposal for a CF String Syntax (CFSS)
CF String Syntax (CFSS) is a format for expressing semantically meaningful assemblages of CF metadata in strings suitable for manipulation and comparison.
CFSS defines a syntax for associating standard names with qualifiers that may be coordinate variables or cell methods. It is intended to be extensible to constructs such as non-cell-based operators.
A driver for creating CFSS strings is to enable the semantic mediation of fields passed between components during model run-time. Standard names alone often cannot express the features of the data that are necessary for this sort of brokering.
Important features of CFSS are that it uses only CF-compliant content and is backward-compatible with existing standard names.
CFSS strings are structured as follows:
<standard name of data 1>, <standard name of data 2>, ... ,<standard name of data n> <standard name of coordinate or cell method 1>: <value 1> [<unit 1>] <standard name of coordinate or cell method 2>: <value 2> [<unit 2>] ... <standard name of coordinate or cell method m>: <value m> [<unit m>]
where standard names of data are optional; qualifiers are optional; and units are included when they are required for a coordinate. The ordering of standard names of data and the ordering of qualifiers does not affect equivalence of strings.
Examples of compliant strings using CFSS are:
x_wind
x_wind height: 10 m
x_wind height: 10 m time: mean region: atlantic_ocean
x_wind, y_wind height: 10 m time: mean region: atlantic_ocean
height: 10 m time: mean region: atlantic_ocean
This proposal originated from discussions amongst Cecelia Deluca, Tim Campbell, Gerhard Theurich, Jonathan Gregory, Roy Lowry, Bryan Lawrence, and Balaji.
Change History (14)
comment:1 follow-up: ↓ 2 Changed 8 years ago by ngalbraith
comment:2 in reply to: ↑ 1 Changed 8 years ago by bnl
Replying to ngalbraith:
In what circumstances do measurement heights need to be described as 'x_wind height: 10 m'?
We go to a fair amount of trouble coding all the heights for our met measurements as coordinates, usually one per instrument, with a numeric value and a 'units' attribute. Using strings for this seems to take CF in a questionable direction, so if there's a real need for this flexibility, it should probably be described in the ticket.
The data itself should indeed be coded like that, what's needed here is a way of labelling the assemblage. Two example use cases:
- I have a model intercomparison project, and one of the required outputs is something to compare with real observations ... which are 'air_temperature height:2m'; so I want to be able to describe that requirement, and then label the output ... similarly
- I want to couple to model comoponents together, and I need to describe what is coupled, and that description will include not just the standard_name, but coordinate properties, and values ... so I need a way of labelling the assemblage of CF properties.
comment:3 Changed 8 years ago by lowry
This ticket has yet to acquire a moderator. If there I no objections I would be happy to volunteer.
comment:4 Changed 8 years ago by ceceliadid
Thank you, that would be great. (I didn't realize each ticket needed a moderator.)
- Cecelia
comment:5 Changed 8 years ago by lowry
- Owner changed from cf-conventions@… to lowry
- Status changed from new to assigned
Accepting this as there was clear support from Bryan and Jonathan in addition to myself in a series of off-list e-mails that preceded Cecilia's submission of the ticket.
First task is to capture the discussions on CFSS syntax that were sent to the CF list rather than the Trac ticket.
Chris Barker's post
From: CF-metadata [cf-metadata-bounces@…] On Behalf Of Chris Barker [chris.barker@…] Sent: 02 November 2012 16:08 To: cf-metadata@… Subject: Re: [CF-metadata] [CF Metadata] #94: Proposal for a CF String Syntax (CFSS)
I know I should be commenting in TRAC, but I don't think I have a login...
First -- if this is already well established, and simply being codified here, then "never mind" but if there is still room for discussion:
CFSS strings are structured as follows:
<standard name of data 1>, <standard name of data 2>, ... ,<standard name of data n> <standard name of coordinate or cell method 1>: <value 1> [<unit 1>] <standard name of coordinate or cell method 2>: <value 2> [<unit 2>] ... <standard name of coordinate or cell method m>: <value m> [<unit m>]Examples of compliant strings using CFSS are:
x_wind
x_wind height: 10 m
x_wind height: 10 m time: mean region: atlantic_ocean
x_wind, y_wind height: 10 m time: mean region: atlantic_ocean
height: 10 m time: mean region: atlantic_ocean
These strick me as being a pain to parse. For example:
"x_wind, y_wind height: 10 m time: mean region: atlantic_ocean"
there are three delimiters there, commas, colons and whitespace. But whitespace can also be used to separate the units from the value. Also, can there be white space in any of the values (probably not names). To parse this, I guess I would:
look for colons look for whitespace before the colon, what's between is the cell name? look at what's at the beginning, before the whitespace, before the cell name. split that on commas, giving me the variable names. look for the next colon, then look before that for whitespace. between the whitespace and the colon is the next cell name. between the previous colon and the cell name is the value and units. ...
I'm sure there is a smarter way to write that code, but I even find it hard to parse with my eyes.
So I suggest another delimiter -- does netcdf allow line feeds?
x_wind, y_wind height: 10 m time: mean region: atlantic_oceanor maybe semi-colons?
"x_wind, y_wind; height: 10 m; time: mean; region: atlantic_ocean"If I've split that example up wrong, then it really proves my point!
Just my $0.2
Seth McGinnis?'s post
May I suggest using a semicolon to separate name-value pairs? That reads naturally and is easy to parse:
x_wind
x_wind height: 10 m
x_wind height: 10 m; time: mean; region: atlantic_ocean
x_wind, y_wind height: 10 m; time: mean; region: atlantic_ocean;
height: 10 m; time: mean; region: atlantic_ocean
(I also favor allowing an optional trailing semicolon on the final name-value pair.)
comment:6 Changed 8 years ago by lowry
Now for my view...
We have been trying in the NERC Vocabulary Server to find a way of structuring the information delivered in concept definitions. In the previous version we used XML snippets - not a good idea as they break the XML schema of any XML document in which the definition element is included.
Consequently, we switched to JSON (http://www.json.org/)which seems to work quite well. In this encoding the entire CFSS is bounded by '{}' making its boundaries clear. A very basic JSON syntax using a simple list of objects (which are delimited by commas) each comprising a name/value pair (separated by colons) would be:
{"standard_names": "x_wind y_wind", "height": "10 m" , "time": "mean", "region": "atlantic_ocean"}
A common useful practice is to constrain the object names using a controlled vocabulary.
JSON is a widely used standard encoding with associated tooling (formatters, validators, parser APIs)and is very similar to the other syntax suggestions being discussed. My view is why invent a standard when one already exists that seems to do the job?
comment:7 Changed 8 years ago by ceceliadid
I am trying to understand the implications of using a JSON representation for creating more descriptive names for fields passed between models. Do you imagine each JSON string being encoded as a character string in the Fortran or C/C++ code, with escapes before special characters? My sense is that the string format proposed would be easier to work with inside a model. JSON would be especially unappealing if you just wanted to add a qualifier here or there to an existing set of standard name strings.
My gut feeling is that the representation should not matter. We are talking about equivalent content, and the character string and JSON formats are so close that translating would be supereasy. Can we think of this proposal as a CF syntactical schema and CV that can take multiple, well-defined and equally acceptable forms, with one being a string and another being JSON? What is the best way to formalize that in CF?
Best, -- Cecelia
comment:8 Changed 8 years ago by lowry
Hello Cecilia,
I don't like the idea of offering multiple forms- that just makes things unecessarily complex for people writing code to use the CFSS strings - they would have to allow for both encodings.
We've checked out the JSON character set and there's no issues with Fortran because all the characters have ASCII codes <127. So, no need for escapes. Talking to the developers in BODC the Matlab/Fortran? camp are what I would call 'JSON neutral', with support for JSON coming from the Java/Python? camp because they have the APIs to handle it.
So, in summary we have a very good idea with four suggested encodings, each of which seems to have one supporter. Does anybody else in the CF community have an opinion on the preferred encoding option?
comment:9 Changed 8 years ago by ceceliadid
Hi Roy,
Thanks much to your team for looking into the implications of JSON in Fortran.
Something to consider is that the simple string representation proposed represents the interests of a class of users, not just our group. It would be backward compatible with a variety of multi-component modeling systems that currently base inter-component field matching on standard name strings (COAMPS, NOGAPS/NavGEM, NEMS, CESM, GEOS-5, ...). Adopting CFSS would be a trivial change: replacing a few standard name strings with specific, order-dependent CFSS strings in a field dictionary. Super-easy.
Making the approach super-easy to adopt is important. Please be aware that others have already rejected CF for information exchange within models: see "the CF Convention Standard Names were not well-suited to the needs of component-based modeling" in http://csdms.colorado.edu/wiki/CSDMS_Standard_Names
Personally, I think discussion with CF would have made sense before arriving at that conclusion. Ideally (to my mind) CF would eventually vet and accommodate necessary features for the surface modeling community, and spare the Earth system modeling community the need to interact with two similar standards in closely related domains.
Anyway, it sounds like we have multiple applications for a CFSS-like entity ... and that JSON is truly best for some purposes. Maybe multiple encodings are inevitable, and not so bad?
Best, Cecelia
comment:10 Changed 8 years ago by jonathan
Dear Cecilia
Thank you for making the proposal. I support the idea. I am also most happy with agreeing one representation and making it a really simple one. Therefore I like your original proposal.
It could be made even simpler, by omitting the commas, so you would have "x_wind y_wind height: 10 m time: mean region: atlantic_ocean". This is like other CF attributes in using only blank space and colon as punctuation.
In order for the proposal to be accepted, it should supply the exact text to be added to the CF standard, because the process of actually updating the standard document should be purely editorial. The explicit text has to be agreed in the ticket. We would also need corresponding rules to be added to the conformance document.
Best wishes
Jonathan
comment:11 Changed 8 years ago by ceceliadid
- Cc arlindo.dasilva@… theurich@… added
Dear Jonathan and Roy,
I'm sorry it's taken so long to respond.
Before proposing the exact text to be added to the CF standard, I would like to review the comments received.
Here is a statement in the CFSS syntax originally proposed, for reference: x_wind, y_wind height: 10 m time: mean region: atlantic_ocean
Jonathan, you proposed removal of the comma separator: x_wind y_wind height: 10 m time: mean region: atlantic_ocean
Others recommended adding separators, rather than removing them. Below is a summary of suggestions (Arlindo's came from one of our project lists):
Chris Barker - 'a pain to parse', suggests: x_wind, y_wind; height: 10 m; time: mean; region: atlantic_ocean
Seth McGinnis? - 'May I suggest using a semicolon': x_wind, y_wind height: 10 m; time: mean; region: atlantic_ocean "I also favor an optional trailing semicolon"
Arlindo da Silva - 'This conflicts with our current usage of ":" in resource files. How about something like this:' x_wind, y_wind: height=10 m; time=mean; region=atlantic_ocean
My personal preference is for Arlindo's version. My rationale is that the design of CFSS should place a higher priority on suitability for intended purpose than similarity in style to other CF constructs, or even simplicity. Arlindo's version is to me the most readable, and the additional separators will make it easier to parse.
Best, Cecelia
comment:12 Changed 8 years ago by jonathan
Dear Cecelia
The exact definition of syntax is not something I feel very strongly about myself, but nonetheless I would like to argue the case a little further. I do agree with the principle that suitability for purpose is more important than consistency with other CF attributes, if one has to choose between these priorities. Your original statement of requirements for CFSS strings doesn't mention human readability, but obviously human readability is a desirable characteristic.
These are the two extremes of syntax, I suppose:
Arlindo's "maximal" version: x_wind, y_wind: height=10 m; time=mean; region=atlantic_ocean
My minimal version: x_wind y_wind height: 10 m time: mean region: atlantic_ocean
which is motivated by existing CF syntaxes such as
example cell_methods string: longitude: latitude: mean where land time: maximum within years time: mean over years
example formula_terms string: sigma: lev ps: PS ptop: PTOP
I would say that consistency with other CF attributes is also desirable, since it makes CF easier to remember as a whole, and because the same operations can be used to parse different CF attributes if they have similar syntax.
Perhaps what makes my version and other CF strings more difficult to parse by eye is that the end of each "clause" is not explicitly marked. This is not necessary because it is implied by the fact that each clause begins with a word that ends with ":". So the algorithm to parse my version is this:
- Split the string into blank-separated words (discarding all the blanks).
- Identify the words which end in ":", namely height:, time: and region:. Let us call these words names.
- Assign all words following a name to be its list of values, up to but not including the next name.
- Assume any words preceding the first name, if any, to be a list of standard_names of data variables to which the string applies.
This seems quite easy to me.
Here's a suggestion. We could take my minimal syntax as the standard version of CFSS strings, but permit the writer to insert ";" and "," wherever they think it will improve readability, on the assumption that these punctuation marks will be deleted before parsing i.e. they have no semantic content. The string could be processed thus before parsing: sed 's/[,;]*g'. That transformation converts Chris's and Seth's versions into mine.
Arlindo's is different because it uses "=", which is not used in existing CF attributes, so I am more uncomfortable about it. I do not think this version would be easier to parse. I suppose you would use "=" instead of ":" to identify the names. This could be accommodated as well by making "A=B" equivalent to "A: B" by doing sed 's/ *= */: /g'. Applying this transformation and the one of the preceding paragraph converts Arlindo's version into mine.
Best wishes
Jonathan
comment:13 follow-up: ↓ 14 Changed 8 years ago by lowry
Dear Cecilia,
From a moderator's point of view Jonathan's suggestion in the second to last paragraph of his last posting makes a lot of sense. The idea of having a syntax underpinned by a parsing algorithm with specified additional delimiters to enhance human readability is one that hasn't ocurred to me.
Whilst it presents an interoperability risk (e.g. Seth's parser might not be able to read an encoding by Jonathan) I think this will be addressed through the profiling process of community usage. In other words, if the initial usage has commas and semi-colons inserted as per Seth's and Chris's suggestion then it will become a de facto standard profile that subsequent usage will follow.
Would you be happy to prepare the text for insertion into the standard on this basis?
Cheers, Roy.
comment:14 in reply to: ↑ 13 Changed 8 years ago by markh
This is an interesting proposal. I wonder if I may enquire about the objectives and approach before engaging on detail points?
Considering purpose, the ticket states:
CF String Syntax (CFSS) is a format for expressing semantically meaningful assemblages of CF metadata in strings suitable for manipulation and comparison.
A driver for creating CFSS strings is to enable the semantic mediation of fields passed between components during model run-time.
Does this constitute the scope of the objectives?
How do the proposers expect strings defined in this way to be used?
The language A driver suggests that there are others, is this so?
Is the proposal aiming to deliver:
- machine readable string?
- human readable string?
- an attribute on a CF variable in a NetCDF file?
- a syntax for presenting metadata of a CF NetCDF variable by inspection?
- a syntax for summarising the defining characteristics of a CF variable?
I would find it helpful to have such context in order to provide my view on the details of the proposal.
thank you mark
This discussion has moved forward in the email list, can those comments be added here? The email concerns have to do with ambiguous delimiters; my question is more basic.
In what circumstances do measurement heights need to be described as 'x_wind height: 10 m'?
We go to a fair amount of trouble coding all the heights for our met measurements as coordinates, usually one per instrument, with a numeric value and a 'units' attribute. Using strings for this seems to take CF in a questionable direction, so if there's a real need for this flexibility, it should probably be described in the ticket.