[CF-metadata] point observation data in CF 1.4

From: Wright, Bruce <bruce.wright> Date: Thu, 11 Nov 2010 13:02:21 -0000 · This archive was generated by hypermail 2.3.0 : Tue Sep 13 2022 - 23:02:41 BST

----
Here are some thoughts in response to a number of postings:
In response to Chris Barker (2/Nov): We certainly vary the time step
between particles in our models but usually as a sub-step. I agree there
is no strong need to be able to store the substeps. Trajectories are
quite a common need for us but usually for a small subset of particles
(several hundred at any one time). These could be output as a separate
stream and the small files would then make trajectory reconstruction
cheap. Also the "already proposed standard for trajectories" mentioned
by Chris on 4/Nov would presumably be an option for a small subset of
particles (I don't know anything about this though).
In response to John Caron (4/Nov): "Those" are indeed reasonable
assumptions. Navg is generally similar to Nmax in our runs (the
exception being during "spin up times" at the start of runs). In some
runs Navg can increase roughly linearly with time, so Navg ~ Nmax/2.
These are runs dominated by spin up with just a short period after
spin-up. Data per particle is typically in the range 10-300 bytes,
depending on what is stored (e.g. may have masses for ~100 chemical
species, plus met data at particle location). Data might include release
time and travel time of the particle which might be best stored in a
"time variable".
Response to Jon Caron (4/Nov, later post): 20K particles is very small
for us. 4M is not uncommon and this is likely to increase in the future.
However, as mentioned above, the need for trajectory output (or indeed,
usually, the need for any particle output as opposed to gridded
concentrations) is restricted to a small sample of particles.
Response to Ben Hetland (4/Nov): "if one wants to jump directly to
particle p at timestep t, then a simple prop(index(t)+p) should do the
lookup" assumes that all particles are stored at each time step. In
practice for a long run with continuously emitting sources, particles
are created and destroyed.
Dave
---
-----Original Message-----
From: cf-metadata-bounces at cgd.ucar.edu
[mailto:cf-metadata-bounces at cgd.ucar.edu] On Behalf Of Ben Hetland
Sent: 04 November 2010 19:42
To: cf-metadata at cgd.ucar.edu
Subject: Re: [CF-metadata] point observation data in CF 1.4
On 04.11.2010 16:07, John Caron wrote:
> On 11/4/2010 5:50 AM, Ben Hetland wrote:
[...]
>> In a recent case I had, the data were to be transferred to a client 
>> computer over the Internet for viewing locally. In that case reducing
>> the content of the file to the absolute minimum set of properties 
>> (that the client needed in order to visualize) became paramount. Even
>> a fast Internet connection does have bandwidth limitations... :-)
> 
> im thinking more of the file as its written by the model.
Our model currently uses a proprietary format to hold all this
information, but we are investigating changing that to using netCDF
instead, in an attempt to be a little bit more "standard compliant" and
facilitate exchange with other software.
In the indicated "Internet use case" we actually extract data from that
native file into a NetCDF file, and the netCDF is then used as the
"exchange format". We still found the current CF 1.4 convention too
limiting for what we ideally would like to represent, and the proposed
1.6 wasn't quite up to it either.
> but it seems like an interesting use case is to then be able to 
> transform it automatically to some other optimized layout.
Yes!
However, it is just as interesting for us if we could deal directly with
only that single file containing "everything". We wouldn't know the
"optimized layout" anyway before the end-user made a choice of what and
how to view something, so at that point incurring a delay (for
transforming/optimizing) is somewhat less interesting.
>> Yes, often sequentially by time as well.
> 
> sequential is very fast.
I'm not familiar with the internals of the netCDF storage, but doesn't
this assume that the values are also _stored_ sequentially by time?
> Well, its impossible to optimize for both "get me all data for this 
> time step" and "get me all data for this trajectory" unless you write 
> the data twice.
Why is that impossible?
Isn't such a thing normally achieved by suitable file organization and
indexing? It's the retrieval times for the data query that's critical,
not how it is organized sequentially (or internally for that matter).
> So im hearing we should do the former, and make the trajectory 
> visualization as fast as possible.
Acceptable to me, if there has to be a trade-off.
> without knowing Nmax, you couldnt use netcdf-3 multidimension arrays, 
> but you could use the new "ragged array" conventions.
Yes, already discovered that limitation. Is that a restriction
associated with netCDF itself, or is it related to CF conventions only?
> because the grouping is reversed (first time, then station dimension) 
> from the current trajectory feature type (first trajectory, then 
> time), i need to think how to indicate that so that the software knows
> how to optimize it.
I got one idea how to do it, but it may conflict with some convention as
it currently is, or with basic netCDF reading optimization. Anyway, here
goes...
Suppose we consider representing a time series of particle clouds which
is highly variable. I here represent a single particle property by the
imaginary variable 'prop'. If we have several properties (very likely;
lat, lon, depth, mass, just to indicate a few), then we would simply
have several such 'prop' arrays with identical dimensions.
dimensions:
	time = UNLIMITED;
	record = UNLIMITED; // yes, ideally!
variables:
  double time(time); // the usual one
  int index(time);
  float prop(record);
In this case, the array 'index' simply holds the starting record index
of a given timestep. The particles are simply stored consecutively from
that index until 'index(time+1) - 1'.
Number of particles at timestep 'i' can be calculated as:
   index(i+1) - index(i)
(Optionally the 'index' could be 2-dimensional holding both a start and
count value.)
Example:
- 4 time steps:
	at time 2.0, 3 particles (values 'a')
	time 2.5, 0 particles (values 'b')
	time 3.0, 4 particles (values 'c')
	time 3.5, 1 particle (values 'd')
- 8 particles total.
Thus:
	time = 4
	record = 8
then:
   time = { 2.0, 2.5, 3.0, 3.5 }
   index = { 0, 3, 3, 7 }
   prop = { a1, a2, a3, c1, c2, c3, c4, d1 }
Straight forward to read sequentially (by time), I should say, but if
one wants to jump directly to particle p at timestep t, then a simple
	prop(index(t)+p)
should do the lookup.
> 20K particles x 100 bytes/particle = 2M / time step. So every access 
> to a time step would cost you a disk seek = 10 ms.
How do you know this would be a single disk seek if these 100 bytes are
stored in, say, 25 different float variables? There would also be
variables storing values pertaining to the time step as a whole, not any
particular particle. (Eg., a bounding rectangle for the entire particle
cloud, or things like "total mass".)
(This also assumes the file was stored continuously on the disk itself.)
> So 1000 time steps = 10 seconds to retrieve a trajectory. Fast enough?
Not if there are hundreds of trajectories to process, I guess... ;-)
--
Regards,  -+- Ben Hetland <ben.a.hetland at sintef.no> -+-
Opinions expressed are my own, not necessarily those of my employer.
_______________________________________________
CF-metadata mailing list
CF-metadata at cgd.ucar.edu
http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata