⇐ ⇒

[CF-metadata] [CF Metadata] #68: CF data model and reference implementation in Python

From: Chris Barker - NOAA Federal <chris.barker>
Date: Thu, 10 Jan 2013 10:26:54 -0800

Folks,

I'm posting this to the list, rather than the ticket, as the ticket
had gotten huge, and it looks like it may be closed/restarted/split up
anyway. Also, this, perhaps, a more general question/thought.

> > This illustrates my point. I am not advocating a discussion here on how
> to represent time, instead I am advocating that the data model does not
> take on this question at all, it is for implementations to deal with.

My idea/confusion is about what the difference is between a data
model, and a particular implimentation -- i.e in this case, we have
the netcdf CF standard -- that is an implimentation of (theoretically)
a CF data model.

But if we define the data model and netcdf implementation to be
identical, then there is no point in having a data model as a unique
concept.

But I do like having a data model, and the original title of this
ticket outlines why -- with an independently defined data model, we
can have implementation in Python, or your language of choice, that is
independent of the specific storage format, etc. Ideally, that would
mean that one coudl work with such an implementation in a nice natural
way.

So on to an example -- for axes -- datetime is a common axis to want
to use for the kind of data we're trying to deal with here -- working
with an abstraction of a data model, I want to work with the time axis
in a natural way -- that would be a datetime type of some sort -- in
fact, pretty much the first thing I ever do when reading in a netcdf
CF file, isto convertt eh time axis to python datetime -- I don't want
my client code to have to deal with whether the units are seconds or
days, or ??? -- that's what a datetime type is for.

So I think the data model, in this case, should define the concept of
a time axis, which has datetime types of some sort -- they are
sortable, differentiable, etc. in Python, for instance, they might be
an array of numpy datetime64 types, or a python list of
datetime,datetime objects.

But in CF-netcdf, we have "timedelta since datetime" as the standard
-- so we can store datetime axis with standard numerical types. fine,
but is that the data model? or is that a particular encoding for a
particular file format? I think the later.

Example: unicode provides a data model for text -- a set of code
points, etc. With a modern programing language, you can work with text
in unicode objects -- a sequence of characters, etc. You don't need to
know about how they are internally encoded. However, when you want to
do I/O, you do need to know about encoding. So a file format
specification, and the tools to work with that format, need to specify
and handle the encoding/decoding for you.

I think we should take a similar approach to netcdf-CF -- in a netcdf
file, time axes are encoded as "timedelta since datetime", but in the
data model, they are an abstract datetime type. In other formats that
support the data model, one might well choose a different "encoding".
(maybe iso strings, but I rather NOT have that in netcdf files --
leave it to libraries to convert different encodings for you)

I've focused on time axes -- I do think that datetimes are a common
enough case that special treatment is warranted, but maybe there are
others, or a generalization that could be used.

Another example:

This is a bit of a side issue, but may be relvant. On the numpy list
there is discussion about casting rules -- when to upcast, for
example, an 8-bit integer to a larger integer. One driving example is
HDF files -- apparently it' s common to store data in small integers,
along with a scale and offset that will transform the the natural data
coordinates. (this is done in CF, too, yes?) The problem arrises when
you load up an array of 8 bit integers, then multiply and add the
scale and offset -- you can very easily end up overflowing your data
type if you're not careful.

My thoughts on this is that the "small integer plus shift and scale"
should be thought of like an encoding, and should be hidden from the
user end -- the data model should reflect the real values, and the I/O
library should handle the encoding/decoding for you -- i.e. user code
should be able to get, for instance, the voltage from a sensor, in an
appropriate type, with the scaling/shifting applied, and not have to
know or care about how it was stored in the file.

Of course, it should also be possible to access the raw data as
stored, but that should require explicit access, and can assume that
the user knows what they are doing.

So: in short -- the CF data model should be more general, and deal
with data concepts, separate from encoding of those data. File formats
should specify encoding, and libraries should handle
encoding/decoding.

-Chris

-- 
Christopher Barker, Ph.D.
Oceanographer
Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception
Chris.Barker at noaa.gov
Received on Thu Jan 10 2013 - 11:26:54 GMT

This archive was generated by hypermail 2.3.0 : Tue Sep 13 2022 - 23:02:41 BST

⇐ ⇒