Dear John
Thanks for this document.
> http://my.unidata.ucar.edu/content/software/netcdf-java/formats/RecordsInNetcdf3.html
In your classification, the model we are using in CF for stations and
trajectories is (3), the multidimensional case, with data variables dimensioned
(station,nobs) and the per-station metadata, dimensioned (station), being
attached to the data variable through the coordinates attribute. As you say,
this is efficient of storage when all stations have the same number of
observations nobs, and efficient for retrieval if you want to get all the obs
for a given time at once.
However, it is inconvenient if all the stations have different nobs, as you
need a different nobs dimension and data variable for each one. If there are
thousands, as you comment, this is a mess. Then you would like to store them
in a single variable which is indexed somehow. Your method (2), the linked
list, is a way to do this. Magnus's database-like scheme is another, in which
the data variable is dimensioned (record), and there is a station identifier
variable dimensioned the same way. The timeseries/trajectory for a given
station is then extracted using data_variable(where(station_id eq station)),
in pseudo-IDL, or "select from data_variable where station_id eq station" in
pseudo-SQL. This means all the station metadata is repeated for every obs of
the station, which is inefficient of storage, unlike in your scheme where it
is stored only once. But Magnus's scheme is conceptually simpler and that
makes it appealing, I think.
Your document is about netCDF-3 and I wonder if this is a case where we should
consider defining a convention based on netCDF-4 (coming soon). If I have
understood correctly, netCDF-4 permits multidimensional arrays to have ragged
dimensions. Internally it is mapped onto a 1D array, like your linked list,
but the complexity is hidden behind the interface. With this feature we could
use the multidimensional model. Recall that for the case where all stations
have the same number of obs, this looks like:
float station_latitude(station);
float station_longitude(station);
double time(time);
float data_variable(time,station);
data_variable:coordinates="station_latitude station_longitude";
With the case of different numbers of obs, we would have
float station_latitude(station);
float station_longitude(station);
double time(record,station);
float data_variable(record,station);
data_variable:coordinates="station_latitude station_longitude time";
in which record would be a "ragged" dimension, so it would have a different
size for each station. The only formal difference between the two is that the
time variable is now auxiliary and 2D, because each station has its own set of
times. This is OK; an auxiliary coord var can have any of the dimensions
of the data var. I think therefore that this scheme would be CF-compliant with
no change to the standard, but we would have to point out that the time has
to become auxiliary in this case.
What do you think? Have I understood netCDF-4 correctly?
Cheers
Jonathan
Received on Fri Jun 24 2005 - 05:00:59 BST