Opened 13 years ago

Closed 5 years ago

Last modified 5 years ago

#31 closed enhancement (fixed)

Proposal for standard attributes actual_min and actual_max

Reported by: jonblower Owned by: cf-conventions@…
Priority: medium Milestone:
Component: cf-conventions Version:
Keywords: Cc:

Description

Summary

It is very useful for data mining and visualization applications to know the minimum and maximum values of a particular variable in a NetCDF file, without needing to extract the entire variable and calculate this in the application. Here we propose a new pair of standard variables actual_min and actual_max that contain the min and max values of a variable.

Advantages

The proposed new attributes would prevent misuse of the valid_min and valid_max attributes, which are intended to be used to delimit a valid data range, but in fact are often used to denote the actual range of data in a variable. The latter (mis)use leads to incorrect assessment of missing values by tools.

Supports data mining: It is a quick operation to find, say, all those data files that contain temperature values above 30degC.

Supports visualization: Visualization tools can use the actual_min/max to generate a sensible colour scale range for displaying the contents of a file. (However see caveats below.)

In the context of aggregations (by NcML or otherwise), the actual_min/max could be easily calculated by taking the minimum/maximum of the attributes of the components of the aggregation.

Disadvantages

The new attributes represent redundant metadata and could be incorrectly generated or otherwise become inconsistent. Mitigation: Allow the CF-checker to check these attribute values if they are present in a file.

Caveats

These attribute values would not be correct for a subset of data from the file and so any data subsetting tools must be aware of this and recalculate or remove these attributes from any data product subset.

Data values outside the valid_range would not be counted in the actual_min/max. An alternative nomenclature could be actual_valid_min/max (although personally I find this more confusing).

For visualization, the actual_min/max will not always represent the optimal scale range, particularly if examining a restricted geographical area, or when looking at data from a particular elevation. A more sophisticated solution could involve expressing actual_min/max as an array quantity, with a value pair for each elevation in the data volume. This increases the complexity of the solution and places extra burdens on data providers and tool developers (and does not completly solve the problems described).

Change History (23)

comment:1 Changed 13 years ago by jonathan

Dear Jon

Thanks for making this proposal. Please could you supply the exact amendments you propose should be made to the text of the CF conventions document and conformance document.

Cheers

Jonathan

comment:2 Changed 13 years ago by jonathan

This proposal needs a moderator. According to the rules, "A member of the conventions committee, or another suitably qualified person, volunteers to moderate the discussion. If no-one volunteers, the chairman of the committee will ask someone to do it."

Jonathan

comment:3 Changed 13 years ago by taylor13

Dear Jon,

Thank you for taking the time to submit a ticket on this. Also for suggesting how the process might be streamlined. I hope we will soon be able to begin modifying the website to make it easier.

I can see that it would be convenient for graphics to have the actual_min and actual_max available. It would be nice to know whether, in practice, determining the max & min from the data itself significantly slows the graphical display of typical output. I would have thought that calculating the contours, etc. would take much longer than deteriming the extremes.

The data mining advantage is clear, but only if you are interested in exceedence near the extremes, which I see as a rather special case.

I hope there will be graphics tool designers who might weigh in with an opinion of how important these new attributes would be for their graphics packages.

with regards, Karl

P.S. I will appoint a moderator for this ticket unless someone on the Conventions Committee volunteers soon.

comment:4 Changed 11 years ago by caron

Id like to advocate that we accept this change. Users often instead use valid_min/max which has the undesirable side effect of requiring conforming software to examine data values to see if they are outside of the range. When they really just want to document the actual range in the file, this gives a clear way to do so.

comment:5 Changed 11 years ago by caron

I propose that we accept this proposal. The valid_min/max attribute keeps tripping people up, lets do what we an to fix that.

If we need to go through the formal process, I am willing to be moderator.

So, if there are any objections or concerns, please speak up!

comment:6 Changed 11 years ago by jonathan

Enough people have already supported it, and long enough has passed without objection, for this proposal to be accepted now. However, it can't be implemented unless someone writes down in this ticket exactly the changes in text proposed to be made to the standards and conformance documents. Could Jon do that?

To clarify this, is it required that the actual_min and actual_max have to equal exactly the minimum and maximum value in the data variable, or is it legal for them to indicate a wider range? If exact equality is required, which makes sense given the intention, then these attributes must have the same data type as their data variable, I suppose, and that should also be stated in the convention and conformance documents.

Jonathan

comment:7 follow-ups: Changed 9 years ago by pbentley

Hi,

I support this proposal too. However, I wonder if it would be preferable to conflate the two attributes actual_min and actual_max into a single attribute called actual_range, mirroring the existing NUG-derived attribute valid_range, and based on the assumption that one would always wish to specify both ends of the range. It also has the advantage of being a bit more compact: one attribute rather than two :-)

As with valid_range, the value of the actual_range attribute would be a two-element vector whose data type matched the data variable to which it is attached. For example:

float tas(time, lat, long) ;
   ...
   tas:actual_range = -75.1, 49.6 ;
   ...

Regards,

Phil

comment:8 in reply to: ↑ 7 ; follow-up: Changed 9 years ago by jonathan

I support this proposal too. However, I wonder if it would be preferable to conflate the two attributes actual_min and actual_max into a single attribute called actual_range, mirroring the existing NUG-derived attribute valid_range, and based on the assumption that one would always wish to specify both ends of the range. It also has the advantage of being a bit more compact: one attribute rather than two :-)

Yes, that looks like a good idea to me as well. What do you think about whether the range should be exactly as wide as the data range used, or could it be wider?

Cheers

Jonathan

comment:9 in reply to: ↑ 8 Changed 9 years ago by pbentley

Replying to jonathan:

Yes, that looks like a good idea to me as well. What do you think about whether the range should be exactly as wide as the data range used, or could it be wider?

Yep, I'd concur with your earlier comment 6 - the range should specify the exact minimum and maximum. Then client tools can choose whether or not to 'add a bit extra' for, say, plotting purposes.

If possible, I reckon it would be good if we could slip this small addition into the next version of the CF conventions. It might only need an extra row added to the head of Table A.1.

Phil

comment:10 Changed 9 years ago by ngalbraith

the range should specify the exact minimum and maximum.

Should be "the exact minimum and maximum that are within the valid range" or something similar, so that any unusual null values (e.g. -999) are not included in the calculation. Sorry if I'm belaboring the point!

  • Nan

comment:11 in reply to: ↑ 7 ; follow-up: Changed 9 years ago by dmurray

Replying to pbentley:

As with valid_range, the value of the actual_range attribute would be a two-element vector whose data type matched the data variable to which it is attached. For example:

float tas(time, lat, long) ;
   ...
   tas:actual_range = -75.1, 49.6 ;
   ...

What should the data type of actual_range be for data packed with scale_factor and add_offset? With valid_range, the type must be the same as the packed type (e.g. short). Your proposal here seems to indicate that actual_range should also be short if the data are packed into shorts. If the goal is to have this value be available for visualizations, then it would seem that it should be the data type of the unpacked values to be most useful. I think that would be okay as long as it is documented.

Don

comment:12 in reply to: ↑ 11 ; follow-up: Changed 9 years ago by pbentley

Replying to dmurray:

What should the data type of actual_range be for data packed with scale_factor and add_offset? With valid_range, the type must be the same as the packed type (e.g. short). Your proposal here seems to indicate that actual_range should also be short if the data are packed into shorts. If the goal is to have this value be available for visualizations, then it would seem that it should be the data type of the unpacked values to be most useful. I think that would be okay as long as it is documented.

Yes, I think we'd want to add the clarification you suggest to the CF doc. As with scale_factor and add_offset, the values for actual_range "should be of the type intended for the unpacked data" (to quote the NUG).

Phil

comment:13 Changed 9 years ago by jonathan

Dear all, especially Phil, Nan, Jon and John

It would be nice to conclude this ticket now so it can go into CF 1.7, which is about to be compiled. Are we agreed that we should have a two-element actual_range attribute of a data variable, whose values are packed into the same data type as the data if the data is packed, and when unpacked are within the valid range if specified, and exactly equal to the minimum and the maximum values of the data when unpacked? I think that summarises the above. Lack of objections will be interpreted as agreement, according to the rules.

Cheers

Jonathan

comment:14 Changed 9 years ago by caron

Im good with this, with caveat of clarifying the interaction with scale/offset packing. Did we get the actual amendment wording?

comment:15 in reply to: ↑ 12 ; follow-up: Changed 9 years ago by caron

Replying to pbentley:

Yes, I think we'd want to add the clarification you suggest to the CF doc. As with scale_factor and add_offset, the values for actual_range "should be of the type intended for the unpacked data" (to quote the NUG).

Phil

Im confused by this, where do you find it? I am seeing this wording at

http://www.unidata.ucar.edu/software/netcdf/docs/netcdf/Attribute-Conventions.html#Attribute-Conventions:

"If the variable is packed using scale_factor and add_offset attributes (see below), the _FillValue, missing_value, valid_range, valid_min, or valid_max attributes should have the data type of the packed data."

comment:16 in reply to: ↑ 15 ; follow-up: Changed 9 years ago by pbentley

Replying to caron:

Hi John,

Im confused by this, where do you find it? I am seeing this wording at

http://www.unidata.ucar.edu/software/netcdf/docs/netcdf/Attribute-Conventions.html#Attribute-Conventions:

The snippet of text I quoted came from the last sentence of the description of the add_offset attribute in Appendix B. I was looking at my treeware version of the NUG v4, though I notice that the same wording is in the online version you mention above. Anyhows, my reply was merely concurring with Don's original comment - that actual_range should always reflect regular, unpacked data values - so probably best to completely ignore my follow up if it confuses matters :-)

Phil

comment:17 follow-up: Changed 9 years ago by jonathan

OK, so the actual range is of the unpacked type. Sorry. We do need wording for the proposed change, as John says. Here is my suggestion:

  • Change the title of section 2.5.1 from Missing data to Missing data, valid and actual range of data.
  • Append the following paragraph to section 2.5.1: This convention defines a two-element vector attribute actual_range for variables containing numeric data. If the variable is packed using the scale_factor and add_offset attributes (see section 8.1), the elements of the actual_range should have the type intended for the unpacked data. The elements of actual_range must be exactly equal to the minimum and the maximum data values which occur in the variable (when unpacked if packing is used), and both must be within the valid_range if specified. If the data is all missing or invalid, the actual_range attribute cannot be used.
  • Add an entry to Appendix A for actual_range, numeric type, applicable to coordinates and data, reference section 2.5.1, text: "The smallest and the largest valid non-missing values occurring in the variable".

Is that correct? Cheers

Jonathan

comment:18 in reply to: ↑ 16 Changed 9 years ago by caron

Replying to pbentley:

Replying to caron:

Hi John,

Im confused by this, where do you find it? I am seeing this wording at

http://www.unidata.ucar.edu/software/netcdf/docs/netcdf/Attribute-Conventions.html#Attribute-Conventions:

The snippet of text I quoted came from the last sentence of the description of the add_offset attribute in Appendix B. I was looking at my treeware version of the NUG v4, though I notice that the same wording is in the online version you mention above. Anyhows, my reply was merely concurring with Don's original comment - that actual_range should always reflect regular, unpacked data values - so probably best to completely ignore my follow up if it confuses matters :-)

Phil

Yes, this is indeed confusing. Im pretty sure the following is true:

1) scale_factor and add_offset should have the data type of the unpacked data

2) _FillValue, missing_value, valid_range, valid_min, or valid_max attributes should have the data type of the packed data

The reason for 2) is to identify missing values without having to apply the scale/offset.

As for actual_range, I can see both sides of the argument. Probably using the unpacked data type is easier, so Im good with the proposal as Jonathan states it.

comment:19 Changed 9 years ago by jonathan

There is no moderator for this ticket. Enough support has been expressed, there are no answered objections since the last summary and enough time has elapsed according to the rules. I assert therefore that this ticket should be accepted and included in the next release.

We need an addition to the requirements in section 2.5.1 of the conformance document, which I propose as follows:

  • The actual_range attribute must be of the same type as its associated variable unless there is a scale_factor and/or add_offset attribute, in which case it must be of the same type as those attributes.
  • The actual_range attribute must have two elements, of which the first exactly equals the minimum non-missing value occurring in the associated variable after any scale_factor and add_offset are applied, and the second exactly equals the maximum value in the same way.
  • There must not be an actual_range attribute if all the data values of the associated variable equal the missing value.
  • If both the actual_range and valid_range/valid_min/valid_max are specified, the values of the actual_range must be valid values.

The above will be assumed to be correct unless anyone objects.

Jonathan

comment:20 in reply to: ↑ 17 Changed 5 years ago by jonathan

Changes to convention implemented in https://github.com/cf-convention/cf-conventions/pull/78

comment:21 Changed 5 years ago by jonathan

Changes to conformance requirements had already been made by mattben - thank you. Change to the title of 2.5.1 in the conformance document made by https://github.com/cf-convention/cf-convention.github.io/pull/50.

David and Jonathan

comment:22 Changed 5 years ago by jonathan

  • Resolution set to fixed
  • Status changed from new to closed

comment:23 Changed 5 years ago by jonathan

Added Jon Blower as an additional author of the CF convention

Note: See TracTickets for help on using tickets.