⇐ ⇒

[CF-metadata] Seeking example program for storing surface obs in CF?convention

From: John Caron <caron>
Date: Thu, 09 Aug 2007 11:11:06 -0600

Hi Jonathan:

Thanks for taking the time to look at this. Comments are inline.

Jonathan Gregory wrote:
> Dear John
>
>> My own opinion is that CF is not currently adequate for writing observational data to NetCDF. The basic limitation in section 5.4 is that
>>
>> float humidity(time,pressure,station)
>> float pressure(pressure);
>> double time(time);
>>
>> requires the same number and values of the time and pressure coordinates at each station.
>
> Yes, this is wasteful of space if you make all the stations share the
> coordinate variables but they don't all have info at all (time,pressure)
> points. Alternatively you have to create separate coordinate variables for
> each station, which may be inconvenient.
>
> If we put them in common variables, if I have understood your proposal, I
> prefer the contiguous arrangement, something like this:
>
> dimensions:
> record=UNLIMITED;
> station=5;
> stringlen=20;
> variables:
> char station_name(station,stringlen);
> float latitude(station);
> float longitude(station);
> double time(record);
> float humidity(record);
> humidity:coordinates="time";
> float temperature(record);
> temperature:coordinates="time";
>
> where the individual stations are contiguous in the humidity and temperature
> variables. Then the question is how to indicate the range of records which
> belongs to each station. One way, as in your example, is to provide an array
> of start or end pointers into the records. Another way, which takes up a bit
> more space but could be more convenient for using the data, would be to include
>
> int whichstation(record);
> whichstation:coordinate_index="station";
>
> where the presence of the coordinate_index attribute indicates that the value
> of whichstation is an index into the station coordinate dimension. whichstation
> could be identified an an auxiliary coordinate variable by naming it in the
> coordinates attribute:
>
> float humidity(record);
> humidity:coordinates="time whichstation";
>
> E.g. if you have two timeseries, one with temperature data (1.1, 1.2, 1.3) and
> the other with data (2.1, 2.2), you would have:
>
> data:
> temperature=1.1, 1.2, 1.3, 2.1, 2.2;
> whichstation=0, 0, 0, 1, 1;
>
> If it is done this way, rather than with start pointers, the individual
> timeseries actually do not have to be stored contiguously, so any of them can
> be appended to at any time. That might be a useful feature.

Yes, I think its a good alternative to just have each record refer to its owning station, and not have to maintain the links. The parent/child linked (and contiguous array) variant is useful to make finding the data fast; otherwise you have to read through all of the data when you want to find data for one (or a small subset) of stations.

The reference to the station could either be by index or by name, in our typical files of this type, a few bytes wont matter much.

>
> Your proposal appears to me to introduce several extra features which are
> redundant or duplicating other CF attributes. The _CoordinateAxisType attr
> has the same function as the CF axis attribute. I don't see the need for the
> global attributes latitude_coordinate etc. since the lat etc. coordinates can
> be identified by units and by standard_name; also, having a *global* attr
> restricts the file to having only *one* coord variable of each type. The
> attributes giving the max and min of each of the coordinates contain info
> which can be deduced from the coord variables themselves, of course; is that
> an important kind of discovery metadata? I'd be worried about it because it
> is almost certain to be wrong some of the time i.e. inconsistent with the
> coord variables. The cdm_datatype attribute implies a distinction between
> various kinds of data which are formally not really different and would be
> processed in the same way, so I don't see why this is useful.

The Convention wasnt intended to be a proposal for CF, just a stand-alone Convention for this type of data, so we were making it rather broad to cover several existing data formats. So there is likely to be some redundancy and I guess the next step is to decide which parts should be added to CF.

The _CoordinateAxisType enumeration is intended to be a complete listing of georeferencing axis types. We use them instead of parsing the units, looking for "positive", looking for standard names, and the other ways of identifying coordinate axes that have evolved out of COARDS/CF. They are for sure redundant to all of that.

The min/max values are a kind of discovery metadata. We also use them to tell the user what are the possible valid space/time queries on this dataset. Again, this is an optimization for reading/serving data that obviates having to read through the entire file.

The cdm_datatype reflects our experience in how to describe kinds of data ("scientific data types"). This has been a long and ongoing evolution of our understanding. For example the coordinate system for a "time series of point data" looks just like "trajectory" data, so we use the cdm_datatype to disambiguate. It essentially describes the connectivity of the points. Its needed by visualizers, and useful for discovery.

Our "Observation Convention" introduces the notion of grouping variables into "Structures" by specifying that all variables with a common outer dimension are part of the structure. This works especially well for the record dimension, where the variables really are a Structure (that is, all record variables are stored contiguously for record 0, then record 1, etc). Its also useful for non-record dimensions, eg all variables whose outer dimension is "station" comprise the "Station Structure".

Anyway, it would be great to get some other heads onto this, especially those who have written or need to write this kind of point observation data. If we can get 3 or 4 interested parties, we could put together a real proposal for CF.

Thanks again, Jonathon!

John
Received on Thu Aug 09 2007 - 11:11:06 BST

This archive was generated by hypermail 2.3.0 : Tue Sep 13 2022 - 23:02:40 BST

⇐ ⇒