Opened 9 years ago

Last modified 8 years ago

#79 reopened enhancement

Handling and formatting of vector quantities in CF

Reported by: lavergne Owned by: cf-conventions@…
Priority: medium Milestone:
Component: cf-conventions Version:
Keywords: vector Cc: markh

Description

1. Title

Handling and formatting of vector quantities in CF

2. Moderator

  1. Lavergne (Norwegian Met Institute - MET.NO)

3. Requirement

As of writing this proposal, the CF convention allows for defining datasets/variables that are components of 2D (or 3D) vectors. For example, a CF file might contain a wind_speed and a wind_direction variables.

It, however, lacks the capability to indicate that the two variables above are truly the components of a higher-dimension vector (the wind vector, that in that case would be 2D). Such a capability would be quite useful for:

  • Allowing a single status_flag to apply for the vector as a whole, that is to both its components. Such a flag variable should have wind_vector status_flag as its standard_name (if the vector object was given the standard name wind_vector).
  • Enable third-party software (such as map plotting software) to identify that there is a vector object in the file, and that it can thus be displayed with arrows (or barbs).
  • Generally easing any vector-based operations such as rotation/scaling of components when changing Earth-mapping projection.

If feasible, the implementation of this proposal should be made backward compatible.

4. Initial Statement of Technical Proposal

The requirements above can be implemented using an umbrella variable for the vector variable. Such a variable would hold no data, have no dimension, and be of arbitrary type, just like the grid mapping variable (5.6. Horizontal Coordinate Reference Systems, Grid Mappings, and Projections).

The (proposed) vector variable would hold only two string attributes: a standard_name and a components one. It can hold others such as long_name if relevant.

The standard_name identifies what the vector quantity is about, and allows for later cross-referencing (e.g. in the status_flag example above).

The components attribute is a space-separated list of variable names that are all components of the vector variable. It is noteworthy that components is intended as any decomposition of the vector along some axis (e.g. speed/dir, x/y, u/v, north/east, etc...). To be valid, a vector variable shall have at least as many components as the dimensionality of the vector. A 2D wind vector shall thus list at least 2 components (e.g. u and v) but we allow for the speed to also be in the file, and listed as a component.

As the case now, each component variable shall define its dimensions, units, standard_name, grid_mapping, etc... The vector variable only holds the necessary attributes to find what the components are.

Note that all the components variables named by the vector variable must have the same set of coordinate axes, identified by the standard_names of their coordinate variables, although they do not have to have the same sets of coordinate values. This is to exclude, for instance, one component variable having time-latitude-longitude and another time-altitude-latitude-longitude as coordinate variables, but it does permit components to be on an Arakawa C-grid.

It should probably not be allowed to list two variables in the components attribute that both have the same 'standard_name'. The vector variable would then have two (possibly different) directions.

The proposal could be implemented as a section "3.6 Vector quantities".

5. Example

Example in the case of an sea ice drift dataset:

// The two X and Y datasets and the direction.
float dX(time, yc, xc) ;
 dX:long_name = "component of the displacement along the x axis of the grid" ;
 dX:standard_name = "sea_ice_x_displacement" ;
 dX:units = "km" ;
 dX:_FillValue = -1.e+10f ;
 dX:coordinates = "lat lon" ;
 dX:grid_mapping = "Polar_Stereographic_Grid" ;

float dY(time, yc, xc) ;
 dY:long_name = "component of the displacement along the y axis of the grid" ;
 dY:standard_name = "sea_ice_y_displacement" ;
 dY:units = "km" ;
 dY:_FillValue = -1.e+10f ;
 dX:coordinates = "lat lon" ;
 dX:grid_mapping = "Polar_Stereographic_Grid" ;

float dir(time, yc, xc) ;
 dY:long_name = "direction of the displacement" ;
 dY:standard_name = "direction_of_sea_ice_displacement" ;
 dY:units = "degrees" ;
 dY:_FillValue = -1.e+10f ;
 dX:coordinates = "lat lon" ;
 dX:grid_mapping = "Polar_Stereographic_Grid" ;

// The new vector variable:

int ice_drift_vector;
 drift_vector:standard_name = "sea_ice_displacement_vector" ;
 drift_vector:long_name = "sea ice drift vector" ;
 drift_vector:components = "dX dY dir" ;

// A status flag for the vector:

byte status_flag(time, yc, xc) ;
 status_flag:standard_name = "sea_ice_displacement_vector status_flag" ;
 status_flag:long_name = "rejection and quality level flag" ;
 status_flag:valid_min = 0b ;
 status_flag:valid_max = 30b ;
 status_flag:grid_mapping = "Polar_Stereographic_Grid" ;
 status_flag:coordinates = "lat lon" ;
 status_flag:flag_values = 0b, 1b,..., 22b, 30b ;
 status_flag:flag_meanings = "missing_input_data over_land ... interpolated nominal_quality" ;

6. Benefits

This proposal will ease the use of any vector quantities so winds, currents, sea ice motion, etc...

It enables the data producer to document that there are some vector quantities hosted in this file. This should greatly help e.g. third-party software to locate the vectors and act specifically for them (plot with arrows, rotate them, etc...)

7. Status Quo and other approaches

This proposal stems from a discussion on the main CF list, originally posted as "[CF-metadata] Proposal for better handling vector quantities in CF" on Nov 24th 2011. The discussion contains both PROS and CONS to the proposal, as well as some alternative approaches. We will not summarize these discussions here, but let us nonetheless list the alternatives:

  • Extensive use of ancillary_variables;
  • Define a vector dimension;
  • Introduce Groups (Common Data Model-2) in CF;
  • Introduce Compound Data Types (HDF5) in CF.

Change History (69)

comment:1 follow-up: Changed 9 years ago by jonathan

Dear Thomas

Thank you for writing this proposal, which I support. Following the discussion on the email list, I like this solution best because it is lightweight, flexible and backwards-compatible. I agree with what you have written, except

  • I think we should be more precise about which CF attributes are allowed on the umbrella variable. Looking at Appendix A, I would suggest that it could have any of these ones: standard_name, long_name, (the new att) components, ancillary_variables, comment, history, institution, references, source. It doesn't make sense for it to have any of the attributes that describe data values or coordinates. Rather than listing the allowed attributes in a new section, I suggest that we should add a new symbol U in the Use column of Appendix A to define which CF attributes are permitted on umbrella variables. (Of course, since CF always allows other non-standardised attributes, without giving them an interpretation, it would not be an error to have other attributes, but it could deserve a warning.)
  • I don't think we need to impose a restriction of a minimum number of components. This would not be possible to check unless we defined that minimum number for each umbrella standard name. Since the application using the data has to inspect the components to find out what decomposition is being used, I think we can leave it up to the application to give an error if it can't find what it needs, rather than making it a defect of the file.
  • I suggest that in your example the umbrella variable should have an ancillary_variables attribute pointing to the status flag variable. This is not mandatory, but it is a courtesy.

I like your two proposed restrictions, that the components should have axes with the same standard names, and that all the components should be distinct.

I think it is a part of the proposal, isn't it, that we will be introducing a new class of standard name, which can only be used for umbrella variables. This will require a flag in the standard name table.

I agree that a new section 3.6 would be a logical place to introduce this new feature. An amendment to section 3.3 would be needed as well. In order for us to agree it, we need a proposal of exactly the text to be added to the conventions document and the changes to be made to the conformance document.

Best wishes

Jonathan

comment:2 Changed 9 years ago by mgschultz

Dear Thomas,

I am also in favour of your suggestion. However, I am not sure if I like "A 2D wind vector shall thus list at least 2 components (e.g. u and v) but we allow for the speed to also be in the file, and listed as a component." In this example, wind can be fully described by either u and v or speed and direction. A "component" should in my view be limited to specifying the components that are required to construct the vector. Additional "related" variables should carry some other attribute that can refer to the vector, but do you think we need the association in the reverse direction (from vector to additional quantities)?

Best,

Martin

comment:3 follow-up: Changed 9 years ago by ngalbraith

Allowing speed and direction in addition to x and y vectors makes this proposal much more useful to me. We often store/serve all of these, because they're generated differently within an instrument. For example, speed is often under-reported when wind is provided only as a pair of vectors.

So, I'd like to suggest amending the above:

A 2D wind vector shall thus list at least 2 components (e.g. u and v) but we allow for the speed and/or direction to also be in the file, and listed as components.

I'm not sure if there's another existing convention to specify this relationship, but also don't see how allowing nominally redundant components lessens the usefulness of the vector "container variable".

comment:4 follow-up: Changed 9 years ago by mgschultz

Just to clarify: I have no objections whatsoever to include additional quantities in the file. I would just think that the concept of defining a "vector" shall allow a data processing application to identify the situation when it needs more than one quantity to produce a plot (e.g. wind vector plot, wind vanes or whatever). This would become difficult if there are "optional" additional fields listed in the components. If the reference is made only in the reverse direction, then a program that analyses for example "wind_speed" would have a chance to find out that it could use the vector quantities "u" and "v" instead.

If you can explain to me why the additional fields shall be listed in the components, then I would be happy with that, provided that we can somehow draw this distinction between "mandatory" and "optional" components.

comment:5 in reply to: ↑ 1 Changed 9 years ago by lavergne

Replying to jonathan:

Good point with identifying which attributes are valid for a vector variable in Appendix A. I would however propose we use the letter V (vector) rather than U (umbrella). The concept of umbrella variable was handy to introduce what I wanted, but we should now probably use the concept of vector and component variable.

I think it is a part of the proposal, isn't it, that we will be introducing a new class of standard name, which can only be used for umbrella variables. This will require a flag in the standard name table.

Yes, this is my understanding. I will be happy to propose/request a new standard name for sea ice motion vectors. Others will later request the standard names required by their application.

I agree that a new section 3.6 would be a logical place to introduce this new feature. An amendment to section 3.3 would be needed as well. In order for us to agree it, we need a proposal of exactly the text to be added to the conventions document and the changes to be made to the conformance document.

Yes. It was not clear to me whether the trac ticket should directly start with an "official" text or with a general message introducing the proposal. I'll try to work on a text in the coming weeks.

Thomas

comment:6 in reply to: ↑ 3 Changed 9 years ago by lavergne

Replying to ngalbraith:

Allowing speed and direction in addition to x and y vectors makes this proposal much more useful to me. We often store/serve all of these, because they're generated differently within an instrument. For example, speed is often under-reported when wind is provided only as a pair of vectors.

Yes. This was only given as an example to make the point that we do not require all the possible components/decompositions of a given vector field to be specified in the file. It is valid (although maybe not very smart) to omit the direction. We want to allow any number of components for a given vector field, just like the case now (one can have both x, y, speed, direction, u, and v in the same CF file as of today).

The only new thing this proposal would bring is that we now link and unite all these datasets and identify them as components of a single vector field.

So, I'd like to suggest amending the above:

A 2D wind vector shall thus list at least 2 components (e.g. u and v) but we allow for the speed and/or direction to also be in the file, and listed as components.

As I mentioned above in an answer to Jonathan, we could drop the "at least 2 components" part. The file containing only x for describing the full wind vector would be a valid CF file, but most applications would fail and report missing information.

I'm not sure if there's another existing convention to specify this relationship, but also don't see how allowing nominally redundant components lessens the usefulness of the vector "container variable".

We could drop that requirement too (although Jonathan supports it). But I still would like to make the user aware that if he has 2 speed datasets in the same file, which have not identical values, and that he claims are both the components of a unique vector field, this user has an issue with his file.

Thomas

comment:7 in reply to: ↑ 4 Changed 9 years ago by lavergne

Replying to mgschultz:

If you can explain to me why the additional fields shall be listed in the components, then I would be happy with that, provided that we can somehow draw this distinction between "mandatory" and "optional" components.

I do not think there are such things as "mandatory" or "optional" components. This proposal was not meant to decide which types of components are more useful than others. The only thing we want is to unite all the components variables in a given file and associate them to a vector field.

Yes, there might be redundancy, and I have other ideas on how to limit it but I do not think it is the scope of this first proposal.

comment:8 follow-up: Changed 9 years ago by markh

I think this is a very useful extension to the CF standard and I support its inclusion.

I feel that the 'umbrella variable' approach provides sufficient scope for defining vector quantities made up of components.

A vector quantity differs from a scalar in that it has magnitude and direction. The direction requires a little care in definition.

I wonder if we should constrain the behaviour to either have components of the vector, or magnitude and direction, but not both. The example of defining 2 components and a direction in a components attribute does not seem correct to me. However, I am not sure that this can be tested for compliance.

I certainly agree that one should be able to either define magnitude and direction, or components.

Perhaps we could have three attributes available for the vector variable:

  • components: a space separated list of variable names
  • magnitude: a single variable name
  • direction: a space separated list of variable names

This could be defined as 'any of the above' but it may be constrained to 'magnitude and direction' or 'components' but not both if that was deemed valuable.

To define direction we need to be careful how the grid_mapping attribute is interpreted.

As such I wonder whether the vector umbrella variable should be allowed to have a grid_mapping attribute. Is this attribute only inferred from the components?

I think there should be a limitation that components of a vector should not have different grid_mapping attributes, otherwise it may not be possible to correctly combine the components to define the vector.

This could be achieved by providing a capability for a grid_mapping attribute on the vector variable and requiring consistency with all components referenced.

Replying to jonathan:

  • I don't think we need to impose a restriction of a minimum number of components. This would not be possible to check unless we defined that minimum number for each umbrella standard name. Since the application using the data has to inspect the components to find out what decomposition is being used, I think we can leave it up to the application to give an error if it can't find what it needs, rather than making it a defect of the file.

This could be addressed for an individual variable by requiring that the vector variable defines it's dimensionality: it could be defined as a two vector or a three vector. If 'dimensionality' was a required attribute, with an integer value, then this could be used to ensure sufficient components (d) or directional components (d-1) are defined.

comment:9 in reply to: ↑ 8 Changed 9 years ago by lavergne

Replying to markh: Dear Mark,

Thank you for your support.

I wonder if we should constrain the behaviour to either have components of the vector, or magnitude and direction, but not both. The example of defining 2 components and a direction in a components attribute does not seem correct to me. However, I am not sure that this can be tested for compliance.

I certainly agree that one should be able to either define magnitude and direction, or components.

In today's CF, one can define both magnitude (e.g. speed), directions, and the x/y (or u/v) components in a file. We do not want to break or amend this. This proposal solely aims at uniting all the components variables (that pre-exist in todays CF file) under a single vector "object" (umbrella variable).

Perhaps we could have three attributes available for the vector variable:

  • components: a space separated list of variable names
  • magnitude: a single variable name
  • direction: a space separated list of variable names

I tend to not support this. First, I want to precise that magnitude and direction are to me not different from 'x', 'y', 'u' or 'v'. Magnitude and direction are just regular components... on a polar base. Second - and it is one of the strength of the umbrella variable approach - each of the component variables have standard names that describe what they are (a magnitude, a direction, a x, a y, etc...). So we do not need the umbrella variable to document this. To find the direction of the vector field, a user/software just needs to go through the list of ':components' variables listed by the umbrella, and check their ':standard_name' until the direction is found.

I think there should be a limitation that components of a vector should not have different grid_mapping attributes, otherwise it may not be possible to correctly combine the components to define the vector.

Yes. Such a limitation is included in the proposal (although maybe not directly on the grid_mapping, but on the dimensions). I would tend to say it achieves the same goal.

comment:10 follow-up: Changed 9 years ago by jonathan

Dear Thomas

I agree with your view that we should not distinguish the functions of the "components" or between mandatory and optional. I see this proposal simply as way to group related quantities. If there is redundancy, that is fine. I would be concerned about redundancy in metadata, but redundancy in data is a normal situation. It is already permissible in a CF-conforming file to provide u, v, direction and speed. This proposal does not change that at all; it just provides a means to indicate that these quantities are related. I wonder if Martin can agree to that.

V for vector might be too restrictive, since this proposal also applies to tensors. Could we use T?

Cheers

Jonathan

comment:11 follow-up: Changed 9 years ago by jonblower

Just a note that people are working on ways of representing uncertain data in NetCDF (called "NetCDF-U") and have come up with a similar approach of an "umbrella variable" (which they call a "concept without values" variable) to group variables together. For example, the mean and variance of a field are grouped to indicate they form a single "concept".

It would probably be a good idea to harmonize the syntax in some way, or think about whether there could be a generic means to group 2 or more variables in a semantic fashion, which would handle vectors, uncertainties and perhaps other use cases.

HTH, Jon

comment:12 follow-up: Changed 9 years ago by markh

Following investigations with implementing analyses using such vector/tensor umbrella variables we have identified two limitations within this proposal:

  • How to identify the dimensionality of a vector
    • this information is important for many operations, e.g. curl of a 2d vector and curl of a 3d vector are very different calculations.
  • How to identify which component a variable listed in 'components' is with respect to the vector quantity:
    • standard_name is optional and complex to parse, making this a fragile approach to component identification

Also of note: Replying to lavergne:

We could drop that requirement too (although Jonathan supports it). But I still would like to make the user aware that if he has 2 speed datasets in the same file, which have not identical values, and that he claims are both the components of a unique vector field, this user has an issue with his file.

To address these issues I propose two amendments:

  1. A new attribute for vector variables: Dimensionality:
    • Dimensionality: an integer defining the order of the vector or tensor quantity
  2. Explicit statement of the components relationship to the vector field:
    • providing identification of a components relationship to the vector field
    • enforcing the requirement that a component must not be stated multiple times for a vector field
    • available components:
      • x_component
      • y_component
      • z_component
      • magnitude
      • xy_direction
      • z_direction
  1. could be implemented either as a list of fixed name attributes or a complex string, with key value pairs.

This would alter the propsed example to:

// The new vector variable:

int ice_drift_vector;
 drift_vector:standard_name = "sea_ice_displacement_vector" ;
 drift_vector:long_name = "sea ice drift vector" ;
 drift_vector:dimensionality = 2 ;
 drift_vector:components = "x_component:dX y_component:dY xy_direction:dir" ;

comment:13 follow-up: Changed 9 years ago by markh

  • Cc markh added

I have also given further consideration to the definition of a grid_mapping, and I think that it would be useful to enable a vector quantity to define a grid_mapping attribute

The extension to the grid_mapping attribute syntax agreed in #70 enables a data variable to define multiple grid_mappings. As such it may be necessary to be explicit the grid mapping which the vector quantity is with respect to.

I wonder whether it is useful to always define a grid_mapping, to explicitly define the meaning of x, y and z within the scope of the vector variable. This may help with conformance checking, ensuring that data variables defined as components are not stating different grid_mappings.

As such, the definition of a grid_mapping could be mandated for spatial vectors, if it is felt that would be necessary.

comment:14 in reply to: ↑ 11 ; follow-ups: Changed 9 years ago by markh

Replying to jonblower:

Just a note that people are working on ways of representing uncertain data in NetCDF (called "NetCDF-U") and have come up with a similar approach of an "umbrella variable" (which they call a "concept without values" variable) to group variables together. For example, the mean and variance of a field are grouped to indicate they form a single "concept".

It would probably be a good idea to harmonize the syntax in some way, or think about whether there could be a generic means to group 2 or more variables in a semantic fashion, which would handle vectors, uncertainties and perhaps other use cases.

HTH, Jon

I have looked into the proposal for NetCDF-U, as posted here: http://mailman.cgd.ucar.edu/pipermail/cf-metadata/2011/049494.html but mailman doesn't keep hold of the attahcments.

The use of umbrella variables is pretty generic: they use the ancillary_variables attribute on a data variable with no data array payload to indicate some grouping taking place. It is left up to the user to infer meaning from this grouping based on uncertML concepts, no further semantics are implied by the linking NetCDF metadata as far as i can make out.

I think this is quite different from the case of vector component definitions, where a particular set of semantics are to be explicitly defined within the metadata.

It is my view that this proposal can provide the required semantics. It may be that encoding of uncertainty could use a similar approach, rather than ancillary data, but it may be that there are too many disperate cases so it is better to retain much more flexibility.

I would not be in favour of adopting the approach used for uncertainty in NetCDF for vector quantity definitions, I think the requirements are for a focussed and specialist approach, as advocated here.

I think it is also worth noting that NetCDF-U was (is?) put forward as consistent with CF, not part of CF.

comment:15 in reply to: ↑ 14 Changed 9 years ago by jonblower

Replying to markh:

I would not be in favour of adopting the approach used for uncertainty in NetCDF for vector quantity definitions, I think the requirements are for a focussed and specialist approach, as advocated here.

Sure. I was actually thinking the other way around, that NetCDF-U could follow a similar approach to this proposal for vectors, which would involve using stronger names for NetCDF attributes than the generic "ancillary_variables". But this is a different discussion of course.

It may be that encoding of uncertainty could use a similar approach, rather than ancillary data, but it may be that there are too many disperate cases

As far as I understand, I think there are only really three structural cases in NetCDF-U: distributions, summary statistics and samples. Each of these has the concept of something like "components", although the meaning of the "component" is different in each case.

I think it is also worth noting that NetCDF-U was (is?) put forward as consistent with CF, not part of CF.

Yes, although people will undoubtedly try to use them together so I'd like to try to ensure that this can be done relatively cleanly. (No doubt there will be use cases for recording vector components as uncertain quantities at some point...)

Anyway, I don't want to distract from the core discussion in this ticket, just wanted to make sure that opportunities aren't missed for spotting related work.

comment:16 follow-up: Changed 9 years ago by rhorne@…

Folks:

I saw an email the other day saying the review period is over, and there have been no objections. Be aware that I am working the GOES-R program and this system includes level 1b space weather products where the data values are associated with directional field of views. My goal is, using the fundamental building block CF constructs of coordinate variables, cells, etc., extend the conventions to support the construction of level 1b space weather products that are self-describing and standards-based.

This relates to this vector enhancement because unit vectors are a simple and convenient mechanism to define the location of the level 1b space weather data. The problem is that the CF vector extension as defined here falls short of what is needed to support these level 1b space weather products. For example these field of views, whose boresight (center) angle can be expressed as a three-dimensional unit vector, have a 3D angular extent around this boresight angle that is needed to completes the definition of the field of view (associated with product data values). An example field of view pattern is a circular cone. This angular extent can be conceptualized as being specified with a boundary variable (cell construct), but the vector construct as currently proposed does not provide the capability for the vector, as a singular entity to be associated with a boundary variable. There are even case where a level 1b space weather product data value is associated with a directional field of view that is moving. In this case, there are two "levels" of cells: (1) one for the range that accounts for the center angle of the directional field of view moving, and (2) one for the angular extent associated with the field of view pattern (e.g. circular cone).

I don't want to gum up the works and slow down the evolution of the CF conventions, but, at the same time, I wanted to let you'll know about the on-going effort to come up with a CF metadata conventions "like" approach to defining self-describing and standards-based level 1b space weather products.

Please advise.

comment:17 in reply to: ↑ 16 ; follow-up: Changed 9 years ago by markh

Replying to rhorne@excaliburlabs.com:

Folks:

I saw an email the other day saying the review period is over, and there have been no objections.

Hello Randy

I do not think the review period is over for this ticket, you may be referring to my post regarding #89, which is somewhat related to this ticket, but definitely a serarate issue.

I am hoping that Thomas will return to this ticket soon and attempt to assimilate the comments to date.

I feel there is some discussion still required before this ticket is finalised; driving this from a perspective of real use is very powerful.

What alterations would you propose to this vector specification to enable you to store your data and metadata effectively?

Are you confident that the data types you are interested in are vectors, datasets with magnitude and direction across a sampling feature?

Could these 1b space weather datasets be a different type, which happens to use a vector as part of their definition?

I don't think this is 'gummin up the works' at all; I'd like to see if these space weather data types need to be in scope, and if so, how to accomodate them.

all the best mark

comment:18 Changed 9 years ago by rhorne@…

There are some more pertinent details on these level 1b space weather products....

unlike the use of vectors above which ses vectors to define the data elements in the products (i.e. the data variables) ...

The use of vectors for level 1b space weather is solely to locate the data in space (i.e. coordinate variables). In this case the vector is a 3D unit vector in a celestial coordinate system (i.e. J2000 ECI). This unit vector specifies a direction, and magnitude is irrelevant. This unit vector defines the boresight angle for the instrument's aperture.

To fully specify the location associated with the data here requires more than just a unit vector. There is also a need to specify additional information to make it clear what the entire field of view is. The field of view is an angular range. For example, a field of view angular range might be 30 degrees (with the unit vector being the centerline).

The 3D unit vector requires three coordinate variables. Although, conceptually, the 30 degrees is related to the CF cell construct (i.e. the use of the "bounds" attribute), there is no CF convention that allows a cell to be associated with a vector. Also note that there is no CF convention for angular bounds.

Note that there are other fields of view that are more complicated that the circular cone identified in the previous paragraph. Some space weather instruments have rectangular frustum and elliptical cone fields of view.

I am currently working on some strawman CF conventions that will work for these level 1b space weather products.

comment:19 in reply to: ↑ 17 Changed 9 years ago by lavergne

Replying to markh:

Replying to rhorne@excaliburlabs.com:

Folks:

I saw an email the other day saying the review period is over, and there have been no objections.

Hello Randy

I do not think the review period is over for this ticket, you may be referring to my post regarding #89, which is somewhat related to this ticket, but definitely a serarate issue.

I am hoping that Thomas will return to this ticket soon and attempt to assimilate the comments to date.

Hi all,

Good news for this ticket: Thomas is back in business. I will answer/follow on the recent comments and tentatively propose a text later this week.

Cheers,

comment:20 in reply to: ↑ 14 Changed 9 years ago by lavergne

Replying to markh:

Replying to jonblower:

Just a note that people are working on ways of representing uncertain data in NetCDF (called "NetCDF-U") and have come up with a similar approach of an "umbrella variable" (which they call a "concept without values" variable) to group variables together. For example, the mean and variance of a field are grouped to indicate they form a single "concept".

It would probably be a good idea to harmonize the syntax in some way, or think about whether there could be a generic means to group 2 or more variables in a semantic fashion, which would handle vectors, uncertainties and perhaps other use cases.

HTH, Jon

I have looked into the proposal for NetCDF-U, as posted here: http://mailman.cgd.ucar.edu/pipermail/cf-metadata/2011/049494.html but mailman doesn't keep hold of the attahcments.

The use of umbrella variables is pretty generic: they use the ancillary_variables attribute on a data variable with no data array payload to indicate some grouping taking place. It is left up to the user to infer meaning from this grouping based on uncertML concepts, no further semantics are implied by the linking NetCDF metadata as far as i can make out.

Hi Mark and Jon,

I had myself a look at the netCDF-U documents (mainly slides from workshops). I first agreed with Jon that the definition of a generic umbrella variable with only "ancillary_variables" was meeting our needs. After all, our vector variable is purely a container. The semantic of, say, variable dX w.r.t. to variable ice_drift_vector is not to be specified by ice_drift_vector, but by dX, via its standard_name.

But after some time, I now tend to follow Mark on this. Not because of pure semantic, but more for human readability. A human would expect a vector to contain components and will look for them. In fact, should the vector come with additional flag and/or uncertainties, then these two variables could be listed as ancillary_variables, but the components attribute would still be the place to look for... the components.

So I would advocate for keeping components for the vector/tensor.

Thomas

comment:21 in reply to: ↑ 12 Changed 9 years ago by lavergne

Replying to markh:

Following investigations with implementing analyses using such vector/tensor umbrella variables we have identified two limitations within this proposal:

  • How to identify the dimensionality of a vector
    • this information is important for many operations, e.g. curl of a 2d vector and curl of a 3d vector are very different calculations.
  • How to identify which component a variable listed in 'components' is with respect to the vector quantity:
    • standard_name is optional and complex to parse, making this a fragile approach to component identification

Also of note: Replying to lavergne:

We could drop that requirement too (although Jonathan supports it). But I still would like to make the user aware that if he has 2 speed datasets in the same file, which have not identical values, and that he claims are both the components of a unique vector field, this user has an issue with his file.

To address these issues I propose two amendments:

  1. A new attribute for vector variables: Dimensionality:
    • Dimensionality: an integer defining the order of the vector or tensor quantity
  2. Explicit statement of the components relationship to the vector field:
    • providing identification of a components relationship to the vector field
    • enforcing the requirement that a component must not be stated multiple times for a vector field
    • available components:
      • x_component
      • y_component
      • z_component
      • magnitude
      • xy_direction
      • z_direction
  1. could be implemented either as a list of fixed name attributes or a complex string, with key value pairs.

This would alter the propsed example to:

// The new vector variable:

int ice_drift_vector;
 drift_vector:standard_name = "sea_ice_displacement_vector" ;
 drift_vector:long_name = "sea ice drift vector" ;
 drift_vector:dimensionality = 2 ;
 drift_vector:components = "x_component:dX y_component:dY xy_direction:dir" ;

Mark,

This is a good proposal, thanks for bringing it forward. Both the dimensionality and the identification of components (key/value pairs) are helping the readability. However, I am concerned they both introduce some redundancy.

For dimensionality: The definition text of the standard_name sea_ice_velocity_vector would probably start as "Sea ice velocity vector is a 2D vector defined at sea ice surface". The exact text is not important per se, but there are no sea_ice_velocity_vector that are 3D. The same for wind_vector: "Wind is defined as a two-dimensional (horizontal) air velocity vector, with no vertical component. (Vertical motion in the atmosphere has the standard name upward_air_velocity.)". So what I mean is that the dimensionality of a given standard name for vectors is (will be) embedded in its definition. In fact, in the Standard Name Table, the column "canonical unit" (or a new one) could read "2D vector", "3D vector".

The same for the identification of the type of component in a key/value list: each variable listed as a component will have its own standard, which will explicit if it is an x_ component or a xy_direction. Hence redundancy.

I am not sure redundancy is B.A.D. but at least the first proposal did not have any (I think). The redundancy you propose to introduce surely helps the human reader, but as far as a machine is concerned, it is not much harder to parse a list of key/value pairs than to read the standard_name attributes of each component variable. I have no definite answer here. Your proposal appeals to me for human readability, but I do not feel confident my comfort justifies redundancy.

I can anyway start on a text including your two points, and we can remove them later if we feel they are not worth the redundancy. Or we can have them as optional, but again I am not sure what is W.O.R.S.E: options or redundancy!

Thomas

comment:22 in reply to: ↑ 13 Changed 9 years ago by lavergne

Replying to markh:

I have also given further consideration to the definition of a grid_mapping, and I think that it would be useful to enable a vector quantity to define a grid_mapping attribute

The extension to the grid_mapping attribute syntax agreed in #70 enables a data variable to define multiple grid_mappings. As such it may be necessary to be explicit the grid mapping which the vector quantity is with respect to.

I wonder whether it is useful to always define a grid_mapping, to explicitly define the meaning of x, y and z within the scope of the vector variable. This may help with conformance checking, ensuring that data variables defined as components are not stating different grid_mappings.

As such, the definition of a grid_mapping could be mandated for spatial vectors, if it is felt that would be necessary.

Mark,

Again a very good point, thank you. I bumped on this myself when having to re-grid/re-project my vector fields. How do I tell the user if my vector components are still aligned with the old x/y axis, or if they have been rotated to the new grid?

But I think this is a lack in CF that was there prior my will to gather the components under an umbrella variable. I would say the best place to specify this grid_mapping would be at component variable level, not at vector level. And maybe the best thing to do would be to open a new trac ticket where we could start a new, focused discussion on this? I would follow you there :) What do you think?

Thomas

comment:23 in reply to: ↑ 10 Changed 9 years ago by lavergne

Replying to jonathan:

Dear Thomas

I agree with your view that we should not distinguish the functions of the "components" or between mandatory and optional. I see this proposal simply as way to group related quantities. If there is redundancy, that is fine. I would be concerned about redundancy in metadata, but redundancy in data is a normal situation. It is already permissible in a CF-conforming file to provide u, v, direction and speed. This proposal does not change that at all; it just provides a means to indicate that these quantities are related. I wonder if Martin can agree to that.

See! Jonathan is "concerned about redundancy in metadata". That sheds a new light on my latest answers to Mark.

V for vector might be too restrictive, since this proposal also applies to tensors. Could we use T?

I am not used to work with Tensors quantities. It might be they are more frequent than I expect and that T is a better letter. Or use V and define it as vectors and tensors. It might be that what Randy is looking at for his l1b unit vectors are tensors?

Thomas

Cheers

Jonathan

comment:24 follow-up: Changed 9 years ago by lavergne

  • Resolution set to worksforme
  • Status changed from new to closed

Proposed implementation of #79 in CF document:

Note to Mark: I went for the minimum implementation, that is with as little metadata redundancy as possible.

Note to Jonathan: The Arakawa-C grid fell out of this proposal (see first comments). Could you post an example CDL description of how you were thinking this? Then we could amend the proposal to include that case as well?

Add section "3.5 Vectors and Tensors".

3.5. Vectors and Tensors

Many geophysical quantities are vectors (e.g. a wind vector, a sea ice velocity vector, etc...) or tensors. As multi-dimensional objects, both vectors and tensors have a set of components that can be used on their own (e.g. a wind speed) or be applied a set of multi-dimensional operations (e.g. divergence, curl, etc...). A vector (or tensor) field has fields of component variables, that are variables that take values on a grid.

In CF, a vector (or tensor) field variable is a variable of arbitrary type since it contains no data (similar to the grid mapping variable defined in section 5.6). Its role is to act as a container for the attributes that describe the vector field. Vector variables shall have (at least) two string attributes: standard_name, and components (see Appendix A).

The standard_name of a vector (or tensor) variable is as any other standard_name in CF, and can noticeably be used in standard name modifiers constructs. The components attribute takes a space-separated list of variable names, each of them being henceforth recorded as name of a component of the vector/tensor. The component variables are regular CF data variables, and thus have a standard_name. A user or application looking for the xy_direction component of a vector variable can thus browse through the list of its components and identify the requested data variable using the latter's standard name.

Note that all the data variables listed in components shall have the same set of dimensions. This is to exclude, for instance, one component variable having dimension (time,lat,lon) and another (time,level,lat,lon) as for the same vector/tensor field. It is not allowed to list two variables in the components attribute that both have the same 'standard_name'.

Example 3.6: A sea ice drift vector field:

// NOTE: the dimensions, grid_mapping and axis variables are not
//    included here to keep focus on the vector variable

// The two X and Y datasets and the direction.
float dX(time, yc, xc) ;
 dX:long_name = "component of the displacement along the x axis of the grid" ;
 dX:standard_name = "sea_ice_x_displacement" ;
 dX:units = "km" ;
 dX:_FillValue = -1.e+10f ;
 dX:coordinates = "lat lon" ;
 dX:grid_mapping = "Polar_Stereographic_Grid" ;

float dY(time, yc, xc) ;
 dY:long_name = "component of the displacement along the y axis of the grid" ;
 dY:standard_name = "sea_ice_y_displacement" ;
 dY:units = "km" ;
 dY:_FillValue = -1.e+10f ;
 dX:coordinates = "lat lon" ;
 dX:grid_mapping = "Polar_Stereographic_Grid" ;

float dir(time, yc, xc) ;
 dY:long_name = "direction of the displacement" ;
 dY:standard_name = "direction_of_sea_ice_displacement" ;
 dY:units = "degrees" ;
 dY:_FillValue = -1.e+10f ;
 dX:coordinates = "lat lon" ;
 dX:grid_mapping = "Polar_Stereographic_Grid" ;

// The vector variable:
int ice_drift_vector;
 drift_vector:standard_name = "sea_ice_displacement_vector" ;
 drift_vector:long_name = "sea ice drift vector" ;
 drift_vector:components = "dX dY dir" ;
 drift_vector:ancillary_variables = "status_flag" ;

// A status flag for the vector:
byte status_flag(time, yc, xc) ;
 status_flag:standard_name = "sea_ice_displacement_vector status_flag" ;
 status_flag:long_name = "rejection and quality level flag" ;
 status_flag:valid_min = 0b ;
 status_flag:valid_max = 30b ;
 status_flag:grid_mapping = "Polar_Stereographic_Grid" ;
 status_flag:coordinates = "lat lon" ;
 status_flag:flag_values = 0b, 1b,..., 22b, 30b ;
 status_flag:flag_meanings = "missing_input_data over_land ... interpolated nominal_quality" ;

Apendix A. Attributes

Introduction text:

The "Type" values are S for string, N for numeric, and D for the type of the data variable. The "Use" values are G for global, C for variables containing coordinate data, <add>V for vectors and tensors variables</add> and D for variables containing non-coordinate data.

In the table:

1) add the line: Attribute: 'components' / Type:S / Use:V / Links: Section 3.5, "Vectors and Tensors" / Description: "Space-separated list of names of data variables containing components values for the vector (tensor)".

2) add letter 'V' for in 'Use' column for: standard_name, long_name, ancillary_variables, comment, history, institution, references, source

comment:25 Changed 9 years ago by lavergne

  • Resolution worksforme deleted
  • Status changed from closed to reopened

Well... I surely did not want to CLOSE the ticket... re-opening it.

comment:26 follow-up: Changed 9 years ago by ngalbraith

Thanks for not actually closing this - are you sure that wasn't just meant to bring the lurkers out of the woodwork?

My concern with the example above is that existing data servers would blithely return the direction variable with no indication that there's a status variable associated with it. Finding that connection requires too many steps, since there's nothing about any of the components themselves to indicate that they're members of a vector variable.

If I were writing this NetCDF file, I'd add the status_flag as an ancillary variable to all the component variables. But really, shouldn't each component also have some sort of attribute to indicate that it's a member of the vector group, so that at some point servers can check for that relationship and provide the appropriate info to the user?

comment:27 follow-up: Changed 9 years ago by jonblower

Quick suggestion - would it be a good idea to consider using "vector_components" instead of just "components"?

(The idea is that there are other possible types of components - for example, NetCDF-U could propose "uncertainty_components" to contain groups of statistics.)

comment:28 in reply to: ↑ 27 ; follow-up: Changed 9 years ago by markh

Replying to jonblower:

Quick suggestion - would it be a good idea to consider using "vector_components" instead of just "components"?

(The idea is that there are other possible types of components - for example, NetCDF-U could propose "uncertainty_components" to contain groups of statistics.)

I like this idea, in that it makes the umbrella variable concept available to be applied to many cases, one of which is vectors.

It seems to me that this suggestion is to have an umbrella variable concept, with a specialisation for vectors.

A vector umbrella variable would be identified by the presence of an attribute:

vector_components

comment:29 in reply to: ↑ 26 ; follow-up: Changed 9 years ago by markh

Replying to ngalbraith:

Thanks for not actually closing this - are you sure that wasn't just meant to bring the lurkers out of the woodwork?

My concern with the example above is that existing data servers would blithely return the direction variable with no indication that there's a status variable associated with it. Finding that connection requires too many steps, since there's nothing about any of the components themselves to indicate that they're members of a vector variable.

If I were writing this NetCDF file, I'd add the status_flag as an ancillary variable to all the component variables. But really, shouldn't each component also have some sort of attribute to indicate that it's a member of the vector group, so that at some point servers can check for that relationship and provide the appropriate info to the user?

I am wary about this approach. I think that this two way relationship between umbrella variables and data variables is fragile and prone to inconsistency.

I fear that subsetting or aggregating NetCDF files would be made more complex by this.

It would also require careful checking, putting a further onus on the conformance checker, and raising risks for the many users who do not use the checker.

I think that a component should be defined as a data variable, exactly as CF currently does.

The responsibility for grouping should lie with the umbrella variable only. Software will need to be CF umbrella variable aware, to ensure that the correct metadata annd relationships propagate. This feels very manageable to me.

I think this one way aggregation is consistent with the CF approach to coordinates referenced by data variables and bounds referenced by coordiantes. I am concerned that making the relationship two way will bring more issues than it addresses.

comment:30 in reply to: ↑ 29 Changed 9 years ago by lavergne

Replying to markh:

Replying to ngalbraith:

I tend to agree with Mark here and would propose to keep a one-way relationship.

comment:31 in reply to: ↑ 28 ; follow-up: Changed 9 years ago by lavergne

Replying to markh:

Replying to jonblower:

Quick suggestion - would it be a good idea to consider using "vector_components" instead of just "components"?

(The idea is that there are other possible types of components - for example, NetCDF-U could propose "uncertainty_components" to contain groups of statistics.)

I like this idea, in that it makes the umbrella variable concept available to be applied to many cases, one of which is vectors.

It seems to me that this suggestion is to have an umbrella variable concept, with a specialisation for vectors.

A vector umbrella variable would be identified by the presence of an attribute:

vector_components

I must say I like the idea as well. Let's leave that open for more comments/inputs but I am ready to take it on board.

comment:32 in reply to: ↑ 31 Changed 9 years ago by lavergne

Replying to lavergne:

Replying to markh:

Replying to jonblower:

Quick suggestion - would it be a good idea to consider using "vector_components" instead of just "components"?

(The idea is that there are other possible types of components - for example, NetCDF-U could propose "uncertainty_components" to contain groups of statistics.)

I like this idea, in that it makes the umbrella variable concept available to be applied to many cases, one of which is vectors.

Well... I now remember Jonathan's point that this ticket could solve both vectors AND tensors... should we allow both tensor_components and vector_components? Are they just aliases? Or should a compliance checker send a warning if a vector uses tensor_component?

Is human readability worth the extra burden?

comment:33 in reply to: ↑ 24 ; follow-up: Changed 9 years ago by markh

Replying to lavergne:

Proposed implementation of #79 in CF document:

I feel this proposal does not provide enough information to make the vector really useful to downstream software.

As previously stated, I would like to see:

  • explicit dimensionality of the vector
  • identification of the individual components:
    • x-component, magnitude etc

as part of the vector variable attributes. The list of component variables:

drift_vector:components = "dX dY dir" ;

does not provide sufficient information.

I do not think it is adequate to try and use standard_name to infer such information as:

  • standard_name is an optional field
  • cross referencing and parsing standard names and aspects of their description is at best fragile, but I would suggest unworkable.

comment:34 Changed 9 years ago by mcginnis

Still reading through the comments, but I would just like to drop a quick comment in favor of using just "components" regardless of what kind of bundle we're putting under the umbrella. I can think of plenty of use cases where it would be nice to be able to treat something as a vector (or tensor, or whatever) even though it's not, and having to check for a bunch of different attribute names ending in "_components" would impede that. Contrariwise, I'm having trouble coming up with a case where having different names for the component list is a big win.

comment:35 in reply to: ↑ 33 Changed 9 years ago by mcginnis

Replying to markh:

Replying to lavergne:

Proposed implementation of #79 in CF document:

As previously stated, I would like to see:

  • explicit dimensionality of the vector
  • identification of the individual components:
    • x-component, magnitude etc

as part of the vector variable attributes. The list of component variables:

drift_vector:components = "dX dY dir" ;

does not provide sufficient information.

I really like the idea of having umbrella variables to bundle together and define the relationships between different data variables, not just for vector components, so with that end in mind, I'd like to propose a framework that provides a solution to the Mark's need (which I agree with) to identify the components.

We can go from generic umbrella to specialized vector by adding a type attribute. (Or maybe group_type, if type is confusing.) The value of type not only tells you what you're dealing with, it also determines a list of other attributes to identify the components. This is the same approach used for map projection metadata with the grid_mapping_name attribute. It's easy to extend, so to add tensors, you just propose a new value for type and names for the tensor components.

For the two-dimensional vector case, as used in the example, we would have type="2d_vector", and it would have two to four of the attributes x_component, y_component, magnitude and direction defined, each of which names another variable. By declaring a list of dependent attributes, we then also have an appropriate place to discuss their meanings and to declare restrictions, like that if you only define two components, they must be the pair x and y or the pair magnitude and direction.

Rather than explicitly declaring dimensionality, I think it's better to define types with a specific dimensionality (i.e., 2-vectors and 3-vectors) because then we can choose the most sensible names for the components. For example, the community might prefer to use "direction" for 2-vectors, but "azimuth" for 3-vectors.

Randy can then propose a new "viewfield" umbrella type for space weather products, and the NetCDF-U folks can propose a bunch of umbrella types for different statistical distributions, and all the umbrella types will have consistent forms, and they'll all have sensible names for their purpose. In the meantime, for the purposes of this proposal, all we need to do is hash out the name of the type for 2-D vectors and the names and definitions of the dependent attributes.

So here's what the example would look like using this approach:

char ice_drift_vector;
 drift_vector:standard_name = "sea_ice_displacement_vector" ;
 drift_vector:long_name = "sea ice drift vector" ;
 drift_vector:components = "dX dY dir" ;
 drift_vector:type = "2d_vector"
 drift_vector:x_component = "dX"
 drift_vector:y_component = "dY"
 drift_vector:direction = "dir"
 drift_vector:ancillary_variables = "status_flag" ;

If this seems like a good solution, I'd suggest defining just the overall umbrella form and the specific details for 2-D vectors in this proposal, and splitting off 3-D vectors into a new proposal, since I think naming may be more complicated in that case.

comment:36 Changed 9 years ago by bnl

Just skimmed through the comments on this ...

I believe the last proposal introduced the concept of generic umbrella variables, and two levels of controlled vocabularies, specific umbrella variables, and then a further controlled vocabulary for what the relationship is between the variables under that specific umbrella.

(e.g. type="2d_vector" is a first level controlled term, and "x_component", "y_component" and "direction" are second level controlled ... )

(Whether these are the right terms is not my worry at the moment ... it's whether this is the right approach, and it feels right to me.)

So, I rather like this, since one of my reservations reading through was that without some specific guidance, one wouldn't really be guaranteed unambiguity when trying to sort out which variables were (say) direction, magnitude, and which were dx,dy (oh, I can see that for some cases one could do it via inspection but it all gets quite torturous).

We can then proceed with our usual method of handling only the cases for which we have specific use now, with the knowledge that we have have a mechanism for dealing with a wide range of future uses.

I appreciate that there is room for redundancy in metadata with this approach, but why on earth is that a problem, if it's possible to work out it's redundant, and it's contradictory, isn't that a good thing to know? If it's not possible to work out it's redundant easily, then perhaps it's helping avoid software errors?

(p.s. I vote for one-way too!)

comment:37 Changed 9 years ago by jonblower

I like these recent proposals and am happy with plain "components" supported by other attributes.

What becomes of "ancillary_variables"? If we define "umbrella variables" we will also need to define better the use of existing "ancillary_variables" construct, as this is being used in existing data for a similar purpose. The rule could be something like:

  1. Use umbrella variables when there is a logical "contains" relationship (e.g. vector components, groups of related statistics).
  2. Use ancillary_variables when a variable is associated with another but is not "contained" by it (e.g. a quality flag).

Thoughts?

comment:38 follow-ups: Changed 9 years ago by jonathan

Dear all

I like Thomas's proposal as he wrote it out on 21st August. I agree with the preference others have expressed for the one-way relationship; the data variables stand alone, and the umbrella is a new layer which points to them, but not the other way.

Mark commented that it wouldn't be adequate to rely on the standard_names of the component data variables to identify their roles in the group, because "the standard_name is an optional field, and cross-referencing and parsing standard names and aspects of their description is at best fragile." I'd suggest that the first of these is quite easy to deal with: we could make it mandatory for a data variable to have a standard_name if any umbrella variable points to it. This would be easy for the CF checker to verify.

The second point is more difficult. Seth has suggested the alternative of including more attributes in the umbrella variable, as well as the component list, in order to identify the roles of the components. This method would be using the umbrella variable to get round the fact of the standard_name being difficult to analyse, as Mark is right to say. But although it's hard to analyse, it is surely the case that eastward_wind is necessarily the x_component of a wind vector, isn't it. Hence Seth's and Mark's suggestion of identifying the role e.g x_component through an attribute name would be duplicating information which is already implicit in the standard_name.

Another alternative to meet this need would be to make it explicit in the definition of a standard_name if it could serve as a component in a vector and if so which component. Thomas's proposal already implies introducing new standard_names which may only be used for umbrella variables. That means we have to modify the format of the standard_name table in some way. The definition of each of these new standard_names could list the standard_names which are permissible for components of the group. At the same time, we could add to the definition of those standard_names the role that they would play in the group. So, for instance, the new standard_name for sea_ice_displacement_vector will identify itself as a vector/tensor standard_name, and it could list sea_ice_x_displacement and sea_ice_y_displacement among its permissible components. In addition, we would add to the definition of sea_ice_x_displacement that it is the x_component of a vector.

If the standard_name is mandatory for components of names by umbrella variables, and if the standard_name table is modified to say which components they are, I think Mark's need would be met with just the list of components in the component attribute, wouldn't it? The CDL would be as in Thomas's proposal, and there would not be any redundant metadata. The redundancy would be contained in the standard_name table itself, which would say explicitly that sea_ice_x_displacement is an x_component. Even though that is obvious to humans on inspection, it's not obvious to programs, so this redundancy is useful. It should be easy to keep under control, because it's all in one place within the definition of the standard_name, so unlikely to become inconsistent.

I support Seth's suggestion of a group_type attribute for the umbrella variable, to allow umbrellas to be used for other purposes in future. I'd suggest that the value of the group_type should appear in the definition of the group standard_name. If it is necessary to distinguish vectors and tensors of different dimensionality, that would mean we'd need different standard_names of e.g. wind_horizontal_vector and wind_3d_vector with different group_types of horizontal_vector and 3d_vector; eastward_wind is a possible component of both of these, but upward_air_velocity is not allowed for wind_horizontal_vector.

Partly answering Jon's and Nan's points: One of the advantages of umbrella variables is that they can have ancillaries that apply to all components at once. A new mechanism for this is needed because the ancillaries for individual components would require different standard_names e.g. "sea_ice_x_displacement status_flag" and "sea_ice_y_displacement status_flag". Therefore they'd have to be different variables, even if they had exactly the same contents! Ancillaries and umbrellas don't have the same purpose.

Thomas, I don't think it's necessary to include a description of grid-staggering e.g. Arakawa. That is extra information which could be added if people wanted it to be. In your proposal you write, "Note that all the data variables listed in components shall have the same set of dimensions." I think that should be clarified: "same" means "having the same standard_names". For instance, the components may each have a longitude and a latitude axis, but if they're on an Arakawa C-grid the dimensions and coordinates of the axes will be different for the two components. Would you agree with that?

Cheers

Jonathan

comment:39 in reply to: ↑ 38 Changed 9 years ago by lavergne

Dear Jonathan,

Replying to jonathan:

I like Thomas's proposal as he wrote it out on 21st August. I agree with the preference others have expressed for the one-way relationship; the data variables stand alone, and the umbrella is a new layer which points to them, but not the other way.

Thanks for supporting the August 21st proposal. As you got from the comments, I was advocating for the simplest and most self-contained changeset, although I must agree that adding more attributes (Seth, Mark) is very appealing for a human reader/user. I still think a machine would be fine with the plain list of components and mandatory standard_name for the components variables.

I did not exactly get if you see the "group_type" information (as introduced by Seth) as an attribute to the umbrella variable (in addition to "components") of if you'd rather have the information explicit from the standard_name table.

Since this proposal is not about _any_ umbrella variable, I would implement your restrictions on standard_name as: "all variables appearing as components of a vector variable should have a standard_name", in the text of the new section.

Thomas, I don't think it's necessary to include a description of grid-staggering e.g. Arakawa. That is extra information which could be added if people wanted it to be. In your proposal you write, "Note that all the data variables listed in components shall have the same set of dimensions." I think that should be clarified: "same" means "having the same standard_names". For instance, the components may each have a longitude and a latitude axis, but if they're on an Arakawa C-grid the dimensions and coordinates of the axes will be different for the two components. Would you agree with that?

Yes, I agree. The reason why I dropped reference to the "standard_name" for dimensions is embarrassingly enough because I was not up to date with the 'latitude' or 'grid_latitude' standard names. My files do not use them and this is why I did not think of them as appropriate in this case. Now that I am updated, I'll implement this in my files, and in the next version of the vector proposal.

Concerning #79, I propose we wait until the end of this week for concluding before I submit an updated proposal, based on the one from August 21st.

How/when in practise do you see the implementation of the change in the standard name table? Do we have to prepare something now, upfront of acceptance of this ticket, or do we take this when - #79 being agreed upon - I ask for a first vector standard name (probably sea_ice_displacement_vector)?

Cheers Thomas

comment:40 follow-up: Changed 9 years ago by jonathan

Dear Thomas

Thanks for your point about group_type. Seth proposed this as an attribute of the umbrella variable. However, your point implies that this might not be necessary. If you and others agreed with the idea of putting vector/tensor and component tags into the standard_name table itself, it would be possible to deduce from the standard_name of the umbrella e.g. sea_ice_displacement, that it is a vector, by looking it up in the standard_name table. That would be using the standard_name table as a dictionary, which is coincidentally related to the thread which Martin Schultz has started on the email list. What do you and others think about this, I wonder?

How should we distinguish 2D and 3D vectors umbrellas? I wonder whether this distinction might be more convenient to make on the umbrella itself, so that we didn't have to have separate standard_names for 2D and 3D vectors.

I think that the proposal should specify the changes in principle to be made to the standard_name table, since the table is described by the conventions document, but the new contents of the table can be left out of this trac ticket, and proposed on the email list once the new format of the table has been agreed.

Best wishes

Jonathan

comment:41 in reply to: ↑ 40 ; follow-up: Changed 9 years ago by jonblower

Replying to jonathan:

Personally I would prefer to have an explicit group_type attribute (or similar) on the umbrella variable, so it is easier for a software tool to deduce the type of the umbrella. If this information is encoded in the standard_name it makes it much harder for software to make this inference, since it would not only need access to the standard name table but also the rules for constructing standard names.

A second reason is that umbrellas might be used for other purposes (for example, grouping uncertainty information) and it doesn't seem sensible to encode this in the standard_name, since you can have uncertainty information about anything at all. But it would be very easy with an appropriate group_type attribute.

(Of course, the values of the group_type attribute must be selected from a controlled vocabulary of its own.)

comment:42 in reply to: ↑ 41 Changed 9 years ago by mcginnis

Replying to jonblower:

Replying to jonathan:

I'm opposed to the idea of putting vector component tags into the standard_name table.

I think Jon Blower's points are very god, and I have to agree with him. Checking the value of an attribute with a known name is easy; downloading and parsing the standard_name table is hard (maybe even effectively impossible, depending on the environment you're working in). It also spreads the logic of the umbrella types out across different locations, making it easier for inconsistencies to sneak in and cluttering up the standard_name table.

Plus, it seems more in the spirit of CF to record structural information explicitly in the file itself, rather than implicitly in the spec.

comment:43 in reply to: ↑ 38 Changed 9 years ago by ngalbraith

Replying to jonathan:

Partly answering Jon's and Nan's points: One of the advantages of umbrella variables is that they can have ancillaries that apply to all components at once. A new mechanism for this is needed because the ancillaries for individual components would require different standard_names e.g. "sea_ice_x_displacement status_flag" and "sea_ice_y_displacement status_flag". Therefore they'd have to be different variables, even if they had exactly the same contents! Ancillaries and umbrellas don't have the same purpose.

By having the status flag, or other ancillary data, connected ONLY to the umbrella variable, you are making it much more difficult to find these important ancillary variables. It makes much more sense to me to allow both sea_ice_x_displacement and sea_ice_y_displacement to share an ancillary variable named "status_flag". I realize that this is being discussed on ticket 74 too, but not in the context of vectors.

If you are moving the 'ancillary variable' attribute to the vector, and not attaching things like flag values directly to the data variable, it seems to me that you really should provide some indication on the member data variables to state that they are part of the vector, so the user can at least suspect that ancillary variables might be found there.

If, for example, I'm reading the wind speed data from a file, the only way I can find the ancillary variables is by inspecting every umbrella variable to see if wind speed is a member. This seems completely counter-productive to me - am I missing something?

Mark commented that it wouldn't be adequate to rely on the standard_names of the component data variables to identify their roles in the group, because "the standard_name is an optional field, and cross-referencing and parsing standard names and aspects of their description is at best fragile." I'd suggest that the first of these is quite easy to deal with: we could make it mandatory for a data variable to have a standard_name if any umbrella variable points to it. This would be easy for the CF checker to verify.

A mandatory standard name for any member of a vector group would be a problem for working files that contain, for example, variously treated directional components (rotated or not, height-adjusted or not, or something like wind relative to currents vs observed wind). We normally distinguish these by using a standard name for the variable that most correctly represents the concept as described by the standard name, and leaving off any standard name for "working versions" of the variable.

As far as I know, it's possible to have a CF-compliant file that includes some variables that have no standard name; this restriction (requiring a standard name for any variable that's part of a vector group) would make the vector concept less useful to some of us.

comment:44 Changed 9 years ago by jonathan

Dear all

Nan's first point is an argument against one of the motivations for this proposal, which is to provide a way for group of variables to share ancillary variables. She argues that they are found more easily if they are attached to the component variables. This problem could be got round by requiring the component variable to have a link to a particular ancillary variable if the umbrella variable does. The CF checker could verify that, but it makes me uncomfortable because it's redundant. Pointing both ways component<>umbrella is also redundant.

I wonder how important this motivation is for the proposal, compared with the more general motivation of providing a way to group components? In fact components can only share ancillary variables when they are on the same grid,which is not always the case (especially in models).

I take the point that it would be harder work to look at the standard name table to find out the role of a component. I guess we could provide software with APIs in many languages for parsing the standard name table, and perhaps we should do that anyway, but that's another story.

Going back to Seth's idea, then (and omitting the ancillary variables, which may be a different issue)

char ice_drift_vector;
 drift_vector:standard_name = "sea_ice_displacement_vector" ;
 drift_vector:long_name = "sea ice drift vector" ;
 drift_vector:components = "dX dY dir" ;
 drift_vector:type = "2d_vector";
 drift_vector:x_component = "dX";
 drift_vector:y_component = "dY";
 drift_vector:direction = "dir";

I think it would be nicer if the components were not mentioned twice (once in the components atribute, and once in the attributes indicating roles). If we see this like the grid_mapping variables, then the type of umbrella implies which component attributes it can have. Is there a need then to list them all together as well? Nan's posting suggests that she might want to include "unofficial" components in the umbrella as well - ones which don't even have standard names, and several varieties of each. Generic software could not be expected to know what to do with all these. Maybe this could be addressed by providing an attribute which could list all the components of the vector which do not have a standardised purpose. Generic software would ignore these, but Nan's software would make use of them.

I also tend to think that the umbrella variable is becoming so different from a data variable that it should not have a standard_name. It is a new kind of construct. What do you think of this:

char ice_drift_vector;
 drift_vector:umbrella_name = "sea_ice_displacement_vector" ;
 drift_vector:long_name = "sea ice drift vector" ;
 drift_vector:type = "2d_vector";// see below
 drift_vector:x_component = "dX";
 drift_vector:y_component = "dY";
 drift_vector:direction = "dir";
 drift_vector:other="dX_rotated dY_rotated";

where the other attribute is a blank-separated list of data variables which are non-standardised components. If we do something like this, we should have a new Appendix in the standard document, which lists for each umbrella type which components it is allowed to have (like the grid_mapping appendix F), and we should have another table which lists the legal values for the umbrella_name attribute. Should this table also specify the type of the umbrella variable? That would be redundant with the attribute, but they could be checked for consistency, in the way that the CF checker verifies that the units of a data variable are consistent with the canonical units for its standard_name.

Cheers

Jonathan

comment:45 follow-up: Changed 9 years ago by markh

I would like to echo

jonblower:

mcginnis:

I would much prefer to have explicit components defined, and not have to cross reference a standard name.

Replying to jonathan:

Mark commented that it wouldn't be adequate to rely on the standard_names of the component data variables to identify their roles in the group, because "the standard_name is an optional field, and cross-referencing and parsing standard names and aspects of their description is at best fragile." I'd suggest that the first of these is quite easy to deal with: we could make it mandatory for a data variable to have a standard_name if any umbrella variable points to it. This would be easy for the CF checker to verify.

I don't like the approach of a mandatory standard name:

  • I feel that it is important to keep the flexibility not to have a standard_name for a variable, particularly in the cases where one does not exist:
  • vector quantities may be derived from scalar fields, in general, so the potential for a vast array of different vector components exists:
    • valid standard names cannot be generated on the fly, so this would make most such derivations banned by CF, rather than just lacking a standard name for a component;
  • a relational requirement would be introduced on data variables because they are referenced, which goes against the 'one way' relationship idea.

The second point is more difficult. ...

I think the issue is that software should not have to analyse the standard_name of a component to make correct use of it within an umbrella context. The fact that this is also hard compounds the issue, but even if it were easier I would counsel against it.

A vector quantity should identify it's components explicitly, so that any downstream user (human or machine) correctly interprets the interrelations between the component data variables.

I think standard names are already heavily loaded with capabilities, adding another one feels unwise to me.

I would instead put forward the perspective that by correctly defining an umbrella variable, with explicit components, the standard names for each component become secondary, and can be left off without losing information. In many cases, the reason for having component standard names is to ensure that they can be used independently of the umbrella, for example, extracted singly from a file. Such duplication can be useful and should not be disallowed, but it is secondary to the umbrella definition of the vector.

In summary, I don't think this approach adequately meets our requirements for vector processing; I prefer the definition as exemplified in jonathan .

Replying to jonathan:

I think it would be nicer if the components were not mentioned twice (once in the components atribute, and once in the attributes indicating roles).

This makes sense to me, I think only the explicit attributes, such as x_component are required.

I also tend to think that the umbrella variable is becoming so different from a data variable that it should not have a standard_name. It is a new kind of construct.

I'm not sure about this, the standard_name vocabulary could be extended to include vectors, once we know how to represent them. However standard names are currently only defining scalar quantities, I think. A new vocabulary would require management, could this become part of the remit of the standard name maintenance, even if they are not standard names? I don't know how to call this one.

comment:46 in reply to: ↑ 45 ; follow-up: Changed 9 years ago by jonathan

Dear Mark

I'm glad you are generally happy with this.

I don't think we should see the invention of umbrellas as a reason for standard names to be left out. The proposal was to add an extra layer of information, not to make the existing data variable less self-describing. However I appreciate, as Nan also advocated, that you might want to include variables under the umbrella for quantities which do not have standard_names. So I'd like to propose a reduced requirement, that any data variable must have a standard_name if it is pointed to by an attribute of the umbrella apart from other (in my example) i.e. any of the attributes that are part of the definition of the umbrella type, such as x_component. This doesn't rule out including other variables, and it doesn't seem too much to expect. It would be reasonable, when a new umbrella name (or standard name for an umbrella) is introduced e.g. "wind", that we also make sure there are standard names for all the components it might have, in the expectation that they will be needed. What do you think?

The reason for suggesting that umbrella names should go in a new table is because umbrella variables are now quite different from data variables. They have a whole different set of new attributes, rather than (as originally proposed) a subset of the possible attributes of data variables (and one new one). But certainly this list of names could be maintained by the same "vocabulary" process as standard names, just like the lists for region names and area type names.

Best wishes

Jonathan

comment:47 Changed 9 years ago by jonblower

Dear Jonathan, all,

It would be reasonable, when a new umbrella name (or standard name for an umbrella) is introduced e.g. "wind", that we also make sure there are standard names for all the components it might have, in the expectation that they will be needed. What do you think?

I think this may work practically for vectors (they only have 2 or 3 components, and not all standard_names are inherently vector quantities) but won't be very practical for other types of grouping.

Apologies for going on about uncertainty again, but I think umbrellas could be a very good way of grouping statistics about a phenomenon or parameters of a probability density function (cf. UncertML). If we followed your suggestion above (and if I've understood it!), we would need new standard names for every statistic of every measurable quantity (mean_of, standard_deviation_of, variance_of, kurtosis_of, etc.)

I'm not sure of the solution to this. I thought that cell_methods may be an answer, but cell_methods seems to address a different problem (statistics over a cell, rather than statistics over a set of measurements). What do you think?

Jon

comment:48 follow-up: Changed 9 years ago by pbentley

Based on an admittedly cursory reading of this ticket it would seem that at least some of the functionality that is being proposed here (e.g. defining containers for variables) could be provided by features within the enhanced netCDF-4 data model. I'm thinking specifically of groups and user-defined data types. Have these been considered as a practicable solution to the original user requirement? It would seem a shame to devise an entirely new set of data structures and rules if the existing netCDF-4 features can do the job (though I suspect that some extra rules/semantics would still be required even in the case of netCDF-4).

Phil

comment:49 Changed 9 years ago by jonblower

Phil - I had also wondered this and I guess it's the ideal solution in some ways, but I wonder if this is too big a leap for a lot of CF users. I guess it's a chicken-and-egg thing, but I suppose we need to have the "big conversation" about NetCDF-4 at some point!

Jon

comment:50 Changed 9 years ago by jonathan

Dear Jon

Uncertainty measures like mean, standard_deviation, etc. are described in CF by cell_methods, not standard names. The fields which the umbrella points to should use cell_methods to distinguish these statistics. However, I agree that you might want to have such statistics for a quantity which does not itself have a standard_name. Although it's right to have an eye to future generalisation, could we for the moment discuss umbrella variables for the purpose originally proposed for vectors and tensors? I'd happily agree that my proposed requirement for standard names for components with an identified role only applies to the vectors and tensors, not to all kinds of umbrella that might be proposed (but haven't yet been proposed specifically).

Cheers

Jonathan

comment:51 in reply to: ↑ 46 ; follow-up: Changed 9 years ago by markh

Replying to jonathan:

Dear Mark

I'm glad you are generally happy with this.

I'd like to propose a reduced requirement, that any data variable must have a standard_name if it is pointed to by an attribute of the umbrella apart from other (in my example) i.e. any of the attributes that are part of the definition of the umbrella type, such as x_component. This doesn't rule out including other variables, and it doesn't seem too much to expect. It would be reasonable, when a new umbrella name (or standard name for an umbrella) is introduced e.g. "wind", that we also make sure there are standard names for all the components it might have, in the expectation that they will be needed. What do you think?

I'm afraid I do not see the value for this and I think there are costs which I do not like. I think that a particular merit of this proposal is that all vector components are CF data variables in their own right. The umbrella is merely a useful way of aggregating already well defined CF constructs.

As such, any constraints on these data variables, such as mandating a standard name, seems unwise to me. It implies that a data variable needs to know if it is referenced by an umbrella variable in order to check it's validity. I would suggest this is unhelpful.

I think the simple approach of 'all components are CF data variables', with all the freedoms and constraints currently in place, offers the best solution.

The responsibility for the aggregation lies solely with the aggregator: the vector variable. This responsibility includes how a particular component is interpreted.

With this in mind, I propose a change of terminology for component labels, to use i,j,k rather than x,y,z:

  • x_component i_component
  • y_component j_component
  • z_component k_component
  • x_x_direction i_j_direction
  • z_direction k_direction

This explicitly differentiates the basis vectors for the vector quantity from the standard names of the components.

This would enable the definition of an arbitrary 2D vector (with no standard name) where, e.g.:

i_component => a y_wind component

j_component => an upward_air_velocity component

and another arbitrary 2D vector (with no standard name) where, e.g.:

i_component => a data variable (with no standard name but a long name and unit)

j_component => another data variable (with no standard name but a long name and unit)

Perhaps this distinction makes the different roles of standard name and vector component identification.

The reason for suggesting that umbrella names should go in a new table is because umbrella variables are now quite different from data variables. They have a whole different set of new attributes, rather than (as originally proposed) a subset of the possible attributes of data variables (and one new one). But certainly this list of names could be maintained by the same "vocabulary" process as standard names, just like the lists for region names and area type names.

I think I understand the logic, and I can see an appeal. As long as there is an agreed maintenance approach, I think this is workable.

comment:52 in reply to: ↑ 48 Changed 9 years ago by markh

Replying to pbentley:

Based on an admittedly cursory reading of this ticket it would seem that at least some of the functionality that is being proposed here (e.g. defining containers for variables) could be provided by features within the enhanced netCDF-4 data model. I'm thinking specifically of groups and user-defined data types. Have these been considered as a practicable solution to the original user requirement? It would seem a shame to devise an entirely new set of data structures and rules if the existing netCDF-4 features can do the job (though I suspect that some extra rules/semantics would still be required even in the case of netCDF-4).

You make a valid point here Phil

As we do not have a CF data model we are forced to discuss concepts and encodings at the same time. I am very keen to see the concepts described here available, hence we have to discuss encoding options.

I think there is a coherent activity for the CF community to adopt the NetCDF-4 features. However, using them in this proposal explicitly disallows an encoding of such concepts using the NetCDF-3 concepts. This may not be desirable.

I wonder whether we could manage this situation by having an open ticket on migrating CF to NetCDF-4. If such a ticket were agreed before this ticket is concluded, then the options for encoding could change to make use of these features. On the other hand, if no conclusion is reached and the vector concepts are agreed, they could be implemented using NetCDF-3 features and migrated to NetCDF-4 when the community adopts the extra features.

In this way we can decouple the two issues.

It may be that this ticket may provide a useful use case for the NetCDF-4 migration ticket.

My view is that these concepts are encoding independent. If we can agree the concepts, and a particular encoding at this stage that would be an excellent result.

Adapting the encoding in the future to take account of new features available should then be a relatively minor change.

comment:53 Changed 9 years ago by Cathy Smith

I would advise against using the netCDF-4 model to handle vector quantities. First, as has been said, users often look at the parts separately (u wind, v wind, zonal momentum flux...). Having them in a group, different from all other variables, makes them non-standard and harder for the typical user to deal with. Second, vectors can be 3D and not just 2D. So, a user may look at the u,w vector or the vw one in addition to the more traditional 2D uv. So, the group would have to be set up to handle that but that makes the data harder to use in general. Incorporating wind speed or similar variables into this model would be hard, as well. Even those who go to netCDF-4 aren't necessarily using the group model; we certainly aren't here.

A solution that used additional attributes to describe the relationship but otherwise used standard CF is probably most flexible being easier for the user while having enough information that automatic programs can interpret the information. I believe this has been one of the suggestions proposed.

comment:54 in reply to: ↑ 51 Changed 9 years ago by jonblower

Replying to markh:

I think the simple approach of 'all components are CF data variables', with all the freedoms and constraints currently in place, offers the best solution.

I agree with Mark here.

comment:55 follow-up: Changed 8 years ago by markh

This ticket has been untouched in 6 months, but it was not clear at this point that there was agreement on the way forward.

I am really keen to see this enhancement agreed and implemented, I think it is a particularly useful feature with significant possibilities for CF.

I will present my feelings on where this proposal sits, in the hope that this will rekindle the discussion and move it to a resolution.

Explicit identification of components and their roles has been stated as important by me and others.

An umbrella_name has been suggested in preference to a standard_name, to highlight the difference between the umbrella variable and data or coordinate variables and to keep the standard name table from expanding further (which has to be a good thing). This would suggest that umbrella variables may not have a standard_name attribute.

Proposal

A new variable type is introduced to CF NetCDF: the Umbrella Variable.

An umbrella variable may have attributes but it must not have data associated with it and it must not be defined with respect to any NetCDF dimensions.

Umbrella variables do not inherit the ability to use any CF data variable attributes.

A new attribute:

  • umbrella_name

is introduced, to be used only with umbrella variables. This will be a new controlled vocabulary for CF.

A specific type of umbrella variable is defined for the purpose of defining vector quantities.

Only spatial vectors are supported.

The vector umbrella variable defines a set of attributes for the definition of components:

  • i_component
  • j_component
  • k_component
  • i_j_direction
  • k_direction
  • magnitude

Each of these attributes points to one and only one data variable within the dataset. All referenced components are CF data variables which exist in the dataset.

Components must be defined on the same domain as each other (all coord definitions must be the same), but the sampling regimes may be different.

This is to exclude, for instance, one component variable having time-latitude-longitude and another time-altitude-latitude-longitude as coordinate variables, but it does permit components to be on an Arakawa C-grid.

Example

int ice_drift_vector;
 drift_vector:umbrella_name = "sea_ice_displacement_vector" ;
 drift_vector:long_name = "sea ice drift vector" ;
 drift_vector:i_component = "dX" ;
 drift_vector:j_component = "dY" ;
 drift_vector:i_j_direction = "dir" ;

comment:56 follow-up: Changed 8 years ago by pbentley

Hi Mark,

I wonder, perhaps, if 'umbrella name/variable' is not really the right terminology here. Taken in isolation, the attribute umbrella_name suggests that its value is providing the name of an umbrella! Which clearly is not the intention ;-) I suspect that what you were wanting here is umbrella_variable_name or umbrella_quantity_name?

Personally, however, I'd prefer that the term 'umbrella' was dropped. Within the geoinformatics realm I think there are more commonly used terms which more accurately define the role being played by 'umbrella variable'. For example: parent, container, composite or aggregate (no doubt there are others).

A careful (re)analysis of the role/purpose of the proposed new construct is likely, I think, to lead to the selection of a more precise and meaningful term.

Alternatively, rather than targeting a (hazily-defined?) generic use-case, one could target the specific use-case suggested by the title of this ticket (i.e. vector quantities). Thus, one might have, say, vector_quantity_name in place of umbrella_name. (In which case it might also be useful to debate whether or not there is value in identifying a new CF featureType of vector to reinforce the intent.)

Regards,

Phil

comment:57 in reply to: ↑ 56 ; follow-up: Changed 8 years ago by markh

Replying to pbentley:

Hi Phil

(In which case it might also be useful to debate whether or not there is value in identifying a new CF featureType of vector to reinforce the intent.)

I'm not sure that a new feature type definition is helpful here. These containers may be used for different feature types, a structured grid may define a vector as may a trajectory or a time series, I think the concepts are independent.

I wonder, perhaps, if 'umbrella name/variable' is not really the right terminology here. Taken in isolation, the attribute umbrella_name suggests that its value is providing the name of an umbrella! Which clearly is not the intention ;-) I suspect that what you were wanting here is umbrella_variable_name or umbrella_quantity_name?

Personally, however, I'd prefer that the term 'umbrella' was dropped. Within the geoinformatics realm I think there are more commonly used terms which more accurately define the role being played by 'umbrella variable'. For example: parent, container, composite or aggregate (no doubt there are others).

I like the term container here, it seems quite intuitive to me; a marked improvement on umbrella.

Alternatively, rather than targeting a (hazily-defined?) generic use-case, one could target the specific use-case suggested by the title of this ticket (i.e. vector quantities). Thus, one might have, say, vector_quantity_name in place of umbrella_name.

I think that the aim of this ticket should be to deliver the specific use case of vector quantities. That said I think it is worth keeping an eye on the future and other uses for a container/umbrella so I think vector should be the first of a potential number of types, with scope for extension, not treated in complete isolation. Hence I propose the generic type and one explicit instance of that type.

Perhaps we could define the following to deliver to the scope of this ticket.

  • Container variable
    • a new type of CF variable
  • container_type
    • an attribute on a Container variable
    • a controlled vocabulary within CF, defined in the specification
    • initially the only valid value is:
      • vector
    • the container type then defines a list of further container attributes for that type (as above)
  • container_name
    • an attribute on a Container variable
    • a controlled vocabulary within CF, maintained as a list like standard_names
    • the container_name would be linked to a valid container type, e.g.:
      • container_name: "sea_ice_displacement", type: "vector"

what do you think?

mark

comment:58 in reply to: ↑ 57 Changed 8 years ago by pbentley

Replying to markh:

Perhaps we could define the following to deliver to the scope of this ticket.

  • Container variable ...
  • container_type ...
  • container_name ...

what do you think?

I think that your proposed scheme, or something close to it, would be workable. As I hinted at back in comment:48, however, I do wonder to what extent we could - or should - exploit the basic machinery for handling groups of variables as provided by the netCDF-4 library (in combination with the enhanced data model). It would seem to make sense to exploit that capability, rather than retro-fit it to the netCDF-3 world.

If we were to endorse a netCDF-3 classic implementation - either exclusively, or as an option - then I can foresee a scenario whereby countless developers of netCDF client tools will each roll their own custom (and potentially incompatible) solutions for handling groups of variables.

From comments made earlier against this ticket, I realise that there is some resistance to going down the full netCDF-4 path. But if we are unwilling to adopt a feature as compelling as netCDF-4 groups to handle, erm, grouping of variables, then I wonder what, if anything, might prompt us to make the transition?

I realise that I have drifted off topic somewhat. Apologies for that. To reiterate: yes, I think the metadata/data model being proposed could be made to work using either the netCDF-3 or netCDF-4 data models. But if we were to endorse/support both approaches then I suspect that developers probably will only target the former.

Phil

comment:59 in reply to: ↑ 55 ; follow-up: Changed 8 years ago by lavergne

Replying to markh:

This ticket has been untouched in 6 months, but it was not clear at this point that there was agreement on the way forward.

Mark,

I appreciate you re-open this issue.

If you remember correctly, I had proposed an implementation of the vectors/tensors in CF on 21st August (2012). This proposal was however deemed as difficult to use in practice by others, and the explicit definition of components attributes (i_component, ij_direction, magnitude,...) was proposed. Even Jonathan, that was first advocating against the implied redundancy of the attributes was convinced, and -I must admit- so was I.

So I do support your initiative to re-open the issue, following your idea. I am looking forward to comment/edit a text implementing this changeset into CF (main text and appendix).

The term "umbrella" was really one I found for clarifying my ideas, but I am not attached to it. "container" is better, although it is funny how the netCDF implementation will "contain" very little (no data, no link to dimensions, just plain text attributes). And I cannot foresee what other usage of "umbrella/container" will be later propose that do not fit the "container" realm.

Isn't a "umbrella/container" just a "gluer" or a "stitcher"? (now that's ugly :)

I support "container" when compared to "umbrella", but am still looking for a more generic term.

Cheers, Thomas

comment:60 in reply to: ↑ 59 ; follow-up: Changed 8 years ago by davidhassell

Replying to lavergne:

I support "container" when compared to "umbrella", but am still looking for a more generic term.

Another idea that occurs to me, in the spirit of using existing functionality, might be to use the "cf_role" property, as is used for discrete sampling geometries. So your container variable could would have cf_role='vector'. Such a variable would then have vector_name='sea_ice_displacement'. A "vector_name" property in the absence of cf_role='vector' would not be treated in any special way.

Or something like that?

All the best,

David

comment:61 in reply to: ↑ 60 ; follow-up: Changed 8 years ago by ngalbraith

Replying to davidhassell:

Another idea that occurs to me, in the spirit of using existing functionality, might be to use the "cf_role" property, as is used for discrete sampling geometries. So your container variable could would have cf_role='vector'. Such a variable would then have vector_name='sea_ice_displacement'. A "vector_name" property in the absence of cf_role='vector' would not be treated in any special way.

Using cf_role seems good because it's an existing concept. Could we also replace 'vector_name' with something like 'group_name' to make this easier to expand? Although we're starting with vectors, as the cf_role terms grow, the uses for this construct could be very broad.

Going back to the example above, proposed in Sept '12,

int ice_drift_vector;
 drift_vector:umbrella_name = "sea_ice_displacement_vector" ;
 drift_vector:long_name = "sea ice drift vector" ;
 drift_vector:i_component = "dX" ;
 drift_vector:j_component = "dY" ;
 drift_vector:i_j_direction = "dir" ;

I find the attribute names i_component and j_component completely opaque. CF is supposed to be human-readable, and these don't really convey any information, at least nothing intuitive. Was the idea to specify the components individually that we needed to identify whether this was a 2-d or 3-d vector? To me, a shared 'components' attribute would work just as well, and would be more flexible and concise.

Could it be simplified to:

int drift_vector;
 drift_vector:cf_role='vector';
 drift_vector:group_name='sea_ice_drift_vector';
 drift_vector:long_name = "sea ice drift vector" ;
 drift_vector: components = "dX, dY, dir" ;

The variables dX, dY, and dir would then be responsible for identifying themselves fully as to their role in the vector. I've read through the comments above and I see the i_ and j_component being proposed, but not adopted, and I don't find the rationale behind this.

comment:62 in reply to: ↑ 61 Changed 8 years ago by lavergne

Replying to ngalbraith:

Could it be simplified to:

int drift_vector;
 drift_vector:cf_role='vector';
 drift_vector:group_name='sea_ice_drift_vector';
 drift_vector:long_name = "sea ice drift vector" ;
 drift_vector: components = "dX, dY, dir" ;

The variables dX, dY, and dir would then be responsible for identifying themselves fully as to their role in the vector. I've read through the comments above and I see the i_ and j_component being proposed, but not adopted, and I don't find the rationale behind this.

Thank you, we are back where we started, 60 comments ago :)

This is exactly what I was advocating for in the text opening this ticket, and later in the implementation I proposed on August 21st (post #33).

The reasons why many then advocated for introducing a more detailled listing of the components (like i_component, magnitude, etc...) are mainly: 1) it proves cumbersome to rely on parsing standard names for deciding if a component is a magnitude, a direction, or a x component; 2) standard_names are not mandatory and sometimes "under approval", and vectors could need a way to be defined with "non standard" components;

I did not fully buy the first argument, but I must say the second convinced me.

I am still puzzled on how we are going to check the validity of a vector variable that defines "sea_ice_x_displacement" as "ij_direction". How will a CF checker handle that? The only way I see is to introduce the link between component attributes and standar names in the standard name table: "sea_ice_x_displacement" can only appear as either a "i_component" or a "j_component", or a "k_component". And if we do that, argument 1) in my list above partly falls.

Thomas

comment:63 follow-up: Changed 8 years ago by jonathan

Dear Thomas

I prefer the simplicity of your original idea, recently reproposed by Nan. I do not think it is a good idea to store more information in the container variable than is necessary. It is not necessary to identify the components, amplitude, angle etc. of a vector, since they do that themselves with their standard names. You just want to provide a way to group them. The second argument

standard_names are not mandatory and sometimes "under approval", and vectors could need a way to be defined with "non standard" components

strikes me as an argument against using the container variable to indicate which is the x-component, which the y-component, etc. Providing this facility might mean that users of new quantities would not bother to request standard names for them, and thus the container variable would become an alternative to standard names for identifying these quantities. That would make the use of the CF standard more complicated for software trying to find what it wants in a file. Rather, I think that the variables the container points to should be required to have standard names - I'm sure that we have been through this before :-)

Your original use-case was to make it easier to identify the groups of variables that together form a vector. The simplest kind of container will fulfil that need.

Use of cf_role="vector" looks like a good idea to me. The term "container" might be slightly too general, though. We also call grid_mapping a container variable, but its purpose is different. "Group" could be a good term.

I still think it would be fine to have special standard names for these vector groups, but equally we could have a new controlled vocabulary. In either case, it should be stated in the vocabulary which standard names were allowed as components of a group with a particular group name.

So I would favour something like your original version and Nan's recent one

 int drift_vector;
   drift_vector:group_name = "sea_ice_displacement";
   drift_vector:cf_role="vector";
   drift_vector:components = "dX dY dir" ;

Even more simply, if it has a group_name (rather than a standard_name), perhaps that identifies its purpose, and the cf_role could be omitted.

Best wishes

Jonathan

comment:64 in reply to: ↑ 63 ; follow-up: Changed 8 years ago by lavergne

Jonathan,

Replying to jonathan:

Dear Thomas I prefer the simplicity of your original idea, recently reproposed by Nan.

[...]

Your original use-case was to make it easier to identify the groups of variables that together form a vector. The simplest kind of container will fulfil that need.

Thanks again. I feel this trac ticket is going in an infinite loop modus, and I do not really see where (or how) to put a "break" and conclude. Do we have to reach 100% agreement? or have some voices more weight than others? How to we take the final decision in such a ticket?

If someone still feels the "simple" proposal is not enough, could I propose the following approach:

  1. We finalize then close this ticket with the "simple" approach, introducing the new mechanisms for identifying a vector as a "group/umbrella/container" variable, that uses a mandatory "components" attributes listing the data variables. We thus require that components of the vector have standard name.
  1. Right after, or later in the lifetime of the project, when we see that the "simple" approach is not enough, we start a new ticket for enhancing the construct. This new ticket could, for example, introduce the "i_component", "magnitude", etc... attributes (but maybe we will come up with an even smarter solution?) and put the "components" one as optional. Data files that were built using the "simple" approach will still be CF compliant, but will lack some features, that -if critical enough- will prompt the re-converting of the files to the new approach.

The other (cleaner, longer) way of getting out of this would be to forget what we have been debating in this ticket (namely HOW to implement vectors) and instead go back to the WHY we need vector constructs. Answers to this WHY should then help us pin-point why the "simple" approach is enough, or not.

What to you think?

Replying to jonathan:

Even more simply, if it has a group_name (rather than a standard_name), perhaps that identifies its purpose, and the cf_role could be omitted.

When it comes to the decision if such a construct should have a standard name or one from a new vocabulary: one of the arguments why I proposed a standard_name in the first place, is because it then opens all the additional things one can do with such names, e.g. standard name modifiers. One could for example refer a "status_flag" variable directly on the vector construct, in addition to that of the individual components. I do not say we cannot construct "group_name modifiers" if we need, but standard name modifiers are already there, so why not try to use them.

Thomas

comment:65 in reply to: ↑ 64 ; follow-up: Changed 8 years ago by markh

Replying to lavergne:

I do not really see where (or how) to put a "break" and conclude. Do we have to reach 100% agreement? or have some voices more weight than others? How to we take the final decision in such a ticket?

My view is that we need to reach a consensus on what represents a good enough solution, without any strong objections to the approach left outstanding.

I feel that there is a good case for adding extra structure to this proposal, in order to meet a set of use cases which we have for specifying and working with vector quantities.

I have tried to stress how important it is for our use of this feature that we can precisely define the role of a component within the scope of the vector container, we feel this is a crucial factor in delivering the benefit we seek.

I also think it needs careful consideration how this new type of assocation will scale in the future of CF. The opportunity to semantically group data variables has not been present in CF before, it's posential applications are not yet clear but there may be many. For this reason I think it will be a very valuable approach for us to take to define a type for the vector container, so that if other containers are required in the future their usage can be clearly separated from what we decide here.

For these reasons I have strong objections to the simplified solution and I urge those who favor the simplified solution to reconsider my position. I do not think that the solution I presented in comment:55 adds a lot of complexity, it feels clear and self describing to me, whilst it brings significant benefit to the people I work with. This is why I am continuing to request that it is implemented by the community.

comment:66 in reply to: ↑ 65 ; follow-up: Changed 8 years ago by lavergne

Mark,

Replying to markh:

Replying to lavergne:

I do not really see where (or how) to put a "break" and conclude. Do we have to reach 100% agreement? or have some voices more weight than others? How to we take the final decision in such a ticket?

My view is that we need to reach a consensus on what represents a good enough solution, without any strong objections to the approach left outstanding.

I feel that there is a good case for adding extra structure to this proposal, in order to meet a set of use cases which we have for specifying and working with vector quantities.

Let's discuss it a bit further, then:

Does the problem with the "simple" solution boil down to the standard names being cumbersome to parse by a machine (like sea_ice_x_velocity, u_wind, etc...) ?

If yes, I argue the "component" attribute as you propose are a way around an issue with the standard name table. I do not say it rules your approach out, but are we introducing the components attributes because we cannot practically change the standard names?

Now imagine the standard names were "easy" to parse, and it was thus easy to decide if a variable was a "x", or a "direction" would there be anything missing from the "simple" implementation?

Cheers, Thomas

comment:67 Changed 8 years ago by rhorne@…

Folks:

I have spent some time thinking about how to use (and possibly extend the CF standard) for space weather products, which essentially boil down to data sets derived from sensors with one or more apertures looking off into outer space.

As a result, these use cases need 3 dimensional vectors. In the case of the more structure approach, the conventions will need to specifically include support for 3D vectors. But this alone does not completely solve my problem. These space weather product vectors also have a 3D field of view around the sensor aperture's boresight that need to be captured. With the more structured approach, additional vector component conventions will need to be captured.

The point here is that the more structured approach is likely to be enhanced/tweaked repeatedly, if not on an on-going basis, moving forward.

Thus, the stability of the conventions is enhanced (not to mention the rate in which progress can be achieved by data producers) with the simpler, less structured approach moving forward.

very respectfully,

randy

comment:68 in reply to: ↑ 66 Changed 8 years ago by markh

Replying to lavergne:

Let's discuss it a bit further, then:

thank you

Does the problem with the "simple" solution boil down to the standard names being cumbersome to parse by a machine (like sea_ice_x_velocity, u_wind, etc...) ?

This is a challenge, but it is not the core problem.

I know that standard_names are optional and much data is created/provided without a standard_name.

As such, the dependency of component identification on standard_names is inherently fragile, it only works in certain, well defined cases.

I would also like to use this facility to define vector fields created from scalar fields and other vector fields.

If yes, I argue the "component" attribute as you propose are a way around an issue with the standard name table. I do not say it rules your approach out, but are we introducing the components attributes because we cannot practically change the standard names?

I agree that there is a part of my logic which is wary of further complicating the use of standard names, as I think they are already taking a large amount of the strain of CF and they are a limiting factor in some regards.

But this is secondary to my concern that standard names are not able to deliver to the requirements for vector identification.

Now imagine the standard names were "easy" to parse, and it was thus easy to decide if a variable was a "x", or a "direction" would there be anything missing from the "simple" implementation?

Yes, there would. Specifically, the ability to derive vector fields from other fields which can be used.

If I calculate the gradient of a scalar field, I get a set of vector components. I have no confidence that someone has already prepared a set of standard names for me, in many cases there will be no valid names, but I have created the fields. I need to identify them for them to be useful to my downstream software.

I can link them together as components, but I have no way of defining what their roles are. I do not want to pursue the path of inventing my own semantics for this, I want to work within the standard. But the standard states I must not make up standard names on the fly.

At this point I have a vector field with n components and no clue what these vector components represent. It is guesswork how to provide the correct data to plot, for example, arrows for a 2D slice through my 3D vector field.

I feel it is far too much of a burden to place an data analysers to make them request and discuss a new standard_name every time they want to calculate the gradient or curl of another Field.

Hence, my preferred solution is to define roles.

comment:69 Changed 8 years ago by markh

i would also like to counsel against the use of

cf_role

in this context.

This attribute is used as part of the definition of discrete sampling geometries and I think the vector container we are discussing here is as applicable to use with discrete sampling geometries as it is with continuous datasets.

I think it will lead to problems re-using this attribute in the context of vectors

mark

Note: See TracTickets for help on using tickets.