[CF-metadata] Usage of histogram_of_X_over_Z from Jonathan Gregory on 2016-10-27 (Archive of CF discussions from 2002 to 2019 on the cf-metadata mailing list)

From: Jonathan Gregory <j.m.gregory>
Date: Thu, 27 Oct 2016 17:08:11 +0100

Dear Martin

> In broad usage, I have the impression that a "histogram" can be expressed as either a count or a percentage, so we should be explicit in the convention if we want a narrower definition here. A narrower definition is probably needed, as there would otherwise be no way of distinguishing between the two.

I agree with that but the idea is that a standard name of histogram would be
for a count, while probability would be for a fraction. The latter could be
0-1 or 0-100% - they are dimensionally equivalent but different units. We
could clarify that in the guidelines.

> There are two further CMIP variables, both or which are bi-variate distributions, with bins of spectral bands and cloud top height ranges, which I'd like to bring into the discussion, but it might be useful to transfer the conclusions of the exchange so far into a ticket first. I think the two additional variables could be covered by a simple extension to "probability_density_function_of_X_and_Y" ... though you might want to insert "joint_" at the beginning of the term.

OK, that's interesting. I agree that it would fit.

Best wishes

Jonathan
>
> Dear Martin and Alejandro (following off-list discussions)
>
> > The CF definitions say ''"histogram_of_X[_over_Z]" means histogram (i.e. number of counts for each range of X) of variations (over Z) of X.'
>
> Yes, that's in the guidelines for construction of standard names, and there
> are only two of them at present, as you say. The simplest case is when you
> have some quantity Q depending on only one dimension, Q(Z). Then the histogram
> H(Q) is the number of values of Q which fall into each interval of Q,
> considering variation over Z. In general there could be more than one
> dimension retained, and more than one removed. If the original field was
> Q(P,Y,Z,T), we might construct a histogram H(Q,Z,T), for instance, containing
> the frequencies of values of Q falling into joint intervals of Q, Z and T, for
> variation over P and Y. Following the guideline above, we would call this a
> histogram of Q over P and Y, I think.
>
> It is not necessary to indicate in the standard name the dimensions which
> the histogram depends on (Z and T in my example) because the coordinate
> variables (of Z and T) make that clear. Martin suggests that by this argument
> we could also omit Q from the standard name, and just call it a histogram
> (or frequency distribution) rather than a histogram of Q, where Q is air
> temperature, precipitation amount, backscattering ratio, etc. I think there
> are two reasons why we include Q in the standard name,
>
> * I think a histogram of air temperature is not the same geophysical quantity
> as a histogram of precipitation amount, for instance, so they should be
> distinguished by standard name.
>
> * Although histograms are pure numbers, and so are probabilities, probability
> densities are not. Histograms, probability distributions and probability
> density functions are all related ways of expressing the same information.
> In the guidelines, we foresee that we might need names for all of them (though
> so far we have only histograms) and it would make sense to give them consistent
> names. The probability density function of air temperature has units of K-1,
> and of precipitation amount kg-1 m2, for instance. Because they have different
> canonical units, they must have different standard names, so Q needs to be
> included in the standard name.
>
> Cell methods describe how the values represent variation within the cells.
> The transformation from the values of a quantity to a histogram of the
> quantity makes the original quantity into a dimension. This seems more of
> a radical transformation than computing a mean or a standard deviation, which
> doesn't change the dimensions of the variable, but just reduces their size
> (to unity if completely collapsed). A frequency distribution of Q is
> regarded as a different geophysical quantity from Q itself, so we have not
> used cell methods to describe the relationship. Of course, this is a bit
> arbitrary (like everything else in the CF convention!).
>
> I agree with Martin that we could omit the "over" part of the standard name for
> histograms, probabilities and probability densities. It is useful to retain the
> collapsed dimensions as size-1 dimensions, so that their original range can
> be recorded. They could be assigned cell_method of "sum", the default for
> extensive quantities, because the histogram applies to their entire range.
> The same applies to the variable with has been histogrammed and is now a
> dimension; the histogram is a sum for each of its cells.
>
> For example, in the 1D case, suppose the original field is air_temperature
> as a function of time only. Then the histogram variable is
> float hair(tair);
> hair:standard_name="histogram_of_air_temperature";
> hair:units="1";
> hair:cell_methods="time: sum tair: sum";
> hair:coordinates="time";
> float time; // scalar coordinate variable with bounds
> float tair(tair);
> tair:units="K";
>
> As a multidimensional example, suppose the original field is
> float tair(time,altitude,latitude,longitude);
> tair:units="K";
> tair:standard_name="air_temperature";
> tair:cell_methods="altitude: mean area: mean time: mean";
> from which we might construct
> float pair(tair,time,altitude);
> pair:standard_name="probability_density_function_of_air_temperature";
> pair:units="K-1";
> pair:cell_methods="altitude: mean time: mean area: sum tair: mean";
> pair:coordinates="latitude longitude"; // to record the ranges
> Here, I suggest that the cell_method for area is "sum", because the PDF
> applies to the whole area, which is an extensive quantity. For air temperature
> it seems more sense to interpret a PDF as a mean within cells, since a PDF is
> an intensive quantity - you can interpolate it, for example - but not a point
> quantity if it's calculated from a histogram with finite bin-widths.
>
> Best wishes
>
> Jonathan
>
> _______________________________________________
> CF-metadata mailing list
> CF-metadata at cgd.ucar.edu
> http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata

----- End forwarded message -----
Received on Thu Oct 27 2016 - 10:08:11 BST

This archive was generated by hypermail 2.3.0 : Tue Sep 13 2022 - 23:02:42 BST