[CF-metadata] Attributes to describe precision resulting from lossy compression? from Charlie Zender on 2015-02-18 (Archive of CF discussions from 2002 to 2019 on the cf-metadata mailing list)

From: Charlie Zender <zender>
Date: Wed, 18 Feb 2015 10:27:42 -0800

Dear All,

The latest version (4.4.8) of NCO contains a Precision-Preserving
Compression (PPC) feature that might benefit from wider discussion
before its associated metadata are finalized. If you are interested
in precision, compression, or just procrastination, please join a
discussion on changes or improvements to the scheme I've devised.

More documentation on PPC algorithms and performance details is at
http://nco.sf.net/nco.html#ppc
However, I think any changes to CF would focus on definitions (of
precision) and implementation. For data that are rounded (quantized),
users want to know what that means, not necessarily how it was
performed.

The meaning of data precision, and thus what is means for data to
be "rounded" or "quantized" could be clarified in CF with something
like the text drafted below. These changes adequately represent, I
think, an existing metadata annotation for precision used in nc3tonc4
by Jeff Whitaker, which NCO has adopted (called DSD below), as well
as an annotation for a new method of quantization (called NSD below)
introduced in NCO. You will see that it boils down to adding an
attribute that indicates the type and degree of imposed precision.
A possibility that I considered before discarding was to specify the
absolute precision in units of the stored variable (rather than the
number of significant digits). There are arguments both ways...

The suggested CF changes below are a minimal way of specifying how
data have been quantized. A more general metadata framework for
precision might include distinctions for intrinsic precision of
measurement/model (in addition to precision due to post-processing or
rounding), notations helpful for propagating errors, and how to
specify precision lost due to packing/unpacking. None of that is
in the below draft, which simply extends CF to cover precision imposed
by NSD and DSD quantization. If you want or don't want CF to recommend
attributes describing precision and/or lossy compression then please
comment...

Best,
Charlie

Current CF:
http://cfconventions.org/Data/cf-conventions/cf-conventions-1.7/build/cf-conventions.html#packed-data
"Methods for reducing the total volume of data include both packing
and compression. Packing reduces the data volume by reducing the
precision of the stored numbers. It is implemented using the
attributes add_offset and scale_factor which are defined in the
NUG. Compression on the other hand loses no precision, but reduces the
volume by not storing missing data. The attribute compress is defined
for this purpose."

Proposed CF:
"Methods for reducing the total volume of data include packing,
rounding, and compression. Packing reduces the data volume by reducing
the range and precision of the stored numbers. It is implemented using
the attributes add_offset and scale_factor which are defined in the
NUG. Rounding preserves data values to a specified level of precision,
with no required loss in range. It is implemented using bitmasking or
other quantization techniques. Compression on the other hand loses no
precision, but reduces the volume by not storing missing data. The
attribute compress is defined for this purpose."

...

"Packing quantizes data from a floating point representation into an
integer representation within a limited range that requires only
one-half or one-quarter of the number of floating-point bytes.
For values that occupy a limited range, typically about five orders of
magnitude, packing yields an efficient tradeoff between precision and
size because all bits are dedicated to precision, not to exponents.
A limitation of packing is that unpacking data stored as integers
into the linear range defined by scale_factor and add_offset rapidly
loses precision outside of a narrow range of floating point values.
Variables packed as NC_SHORT, for example, can represent only about
64000 discrete values in the range -32768*scale_factor+add_offset to
32767*scale_factor+add_offset. The precision of packed data equals the
value of scale_factor, and scale_factor must be chosen to span the
range of valid data, not to represent the intrinsic or desired
precision of the values. Values that were packed and then unpacked
have lost precision, although there is no standard way of recording
this other than recording the history of the data processing. [One
solution to this would be to record the former scale_factor of
unpacked data in a precision attribute, e.g., "maximum_precision".
Any champions for this?]

Rounding allows per-variable specification of precision in terms
of significant digits valid across the entire range of the floating
point representation. The precision specification may take one of
two forms, either the total number of significant digits (NSD), or
the number of decimal significant digits (DSD), i.e., digits following
(positive) or preceding (negative) the decimal point. The attributes
"number_of_significant_digits" and "least_significant_digit" indicate
that the variable has been rounded to the specifed precision using NSD
or DSD definitions, respectively. The quantized values stored with
these attributes are guaranteed to be within one-half of a unit
increment in the value of the least significant digit. Consider, for
example, a true value of 1776.0704. Approximations valid to a
precision of NSD=2 (or DSD=-2) include 1800.0 and 1750.123, both of
which are within 50 (one-half a unit increment in the hundreds digit)
of the true value. Approximations valid to a precision of NSD=5 (or
DSD=1) include 1776.1 and 1776.03, both of which are within 0.05
(one-half a unit increment in the tenths digit) of the true value.

8.2 ...

-- 
Charlie Zender, Earth System Sci. & Computer Sci.
University of California, Irvine 949-891-2429 )'(

Received on Wed Feb 18 2015 - 11:27:42 GMT

This archive was generated by hypermail 2.3.0 : Tue Sep 13 2022 - 23:02:42 BST