[CF-metadata] stations and trajectories from John Caron on 2005-06-24 (Archive of CF discussions from 2002 to 2019 on the cf-metadata mailing list)

From: John Caron <caron>
Date: Fri, 24 Jun 2005 11:01:18 -0600

Jonathan Gregory wrote:

>Dear John
>
>Thanks for this document.
>
>
>>http://my.unidata.ucar.edu/content/software/netcdf-java/formats/RecordsInNetcdf3.html
>>
>>
>
>In your classification, the model we are using in CF for stations and
>trajectories is (3), the multidimensional case, with data variables dimensioned
>(station,nobs) and the per-station metadata, dimensioned (station), being
>attached to the data variable through the coordinates attribute. As you say,
>this is efficient of storage when all stations have the same number of
>observations nobs, and efficient for retrieval if you want to get all the obs
>for a given time at once.
>
>However, it is inconvenient if all the stations have different nobs, as you
>need a different nobs dimension and data variable for each one. If there are
>thousands, as you comment, this is a mess. Then you would like to store them
>in a single variable which is indexed somehow. Your method (2), the linked
>list, is a way to do this. Magnus's database-like scheme is another, in which
>the data variable is dimensioned (record), and there is a station identifier
>variable dimensioned the same way. The timeseries/trajectory for a given
>station is then extracted using data_variable(where(station_id eq station)),
>in pseudo-IDL, or "select from data_variable where station_id eq station" in
>pseudo-SQL. This means all the station metadata is repeated for every obs of
>the station, which is inefficient of storage, unlike in your scheme where it
>is stored only once. But Magnus's scheme is conceptually simpler and that
>makes it appealing, I think.
>
>Your document is about netCDF-3 and I wonder if this is a case where we should
>consider defining a convention based on netCDF-4 (coming soon). If I have
>understood correctly, netCDF-4 permits multidimensional arrays to have ragged
>dimensions. Internally it is mapped onto a 1D array, like your linked list,
>but the complexity is hidden behind the interface. With this feature we could
>use the multidimensional model. Recall that for the case where all stations
>have the same number of obs, this looks like:
>
> float station_latitude(station);
> float station_longitude(station);
> double time(time);
> float data_variable(time,station);
> data_variable:coordinates="station_latitude station_longitude";
>
>With the case of different numbers of obs, we would have
>
> float station_latitude(station);
> float station_longitude(station);
> double time(record,station);
> float data_variable(record,station);
> data_variable:coordinates="station_latitude station_longitude time";
>
>in which record would be a "ragged" dimension, so it would have a different
>size for each station. The only formal difference between the two is that the
>time variable is now auxiliary and 2D, because each station has its own set of
>times. This is OK; an auxiliary coord var can have any of the dimensions
>of the data var. I think therefore that this scheme would be CF-compliant with
>no change to the standard, but we would have to point out that the time has
>to become auxiliary in this case.
>
>What do you think? Have I understood netCDF-4 correctly?
>
>Cheers
>
>Jonathan
>
>
Hi Jonathan:

Thanks for the insightful summary of that document and how it maps to
CF. Yes, I think you are right that with a "variable length" dimension,
you could use the NetCDF-4 data model to handle this case. However, I am
hesitant to declare victory until we have actually implemented it,
tested performance, etc. There may be some subtleties that we havent
understood yet. For example, i suspect that a vlen dimension has to be
the innermost (rapidly varying) dimension. You also may have to wrap
things in a Structure to get the effect that you want.

The good news is that the netCDF-4 library is pretty much ready to start
testing, so we should be able to nail this down soon.

One thing that Im unclear about leaving the standard unchanged, is
identifying the station variables. Its not clear to me how a program can
automatically see that this is a station dataset, and figure out what
the station dimension is, etc. What do you think?

Ciao

John
Received on Fri Jun 24 2005 - 11:01:18 BST

This archive was generated by hypermail 2.3.0 : Tue Sep 13 2022 - 23:02:40 BST