On 04.11.2010 16:07, John Caron wrote:
> On 11/4/2010 5:50 AM, Ben Hetland wrote:
[...]
>> In a recent case I had, the data were to be transferred to
>> a client computer over the Internet for viewing locally. In that case
>> reducing the content of the file to the absolute minimum set of
>> properties (that the client needed in order to visualize) became
>> paramount. Even a fast Internet connection does have bandwidth
>> limitations... :-)
>
> im thinking more of the file as its written by the model.
Our model currently uses a proprietary format to hold all this
information, but we are investigating changing that to using netCDF
instead, in an attempt to be a little bit more "standard compliant" and
facilitate exchange with other software.
In the indicated "Internet use case" we actually extract data from that
native file into a NetCDF file, and the netCDF is then used as the
"exchange format". We still found the current CF 1.4 convention too
limiting for what we ideally would like to represent, and the proposed
1.6 wasn't quite up to it either.
> but it seems like an interesting use case is to then be able to
> transform it automatically to some other optimized layout.
Yes!
However, it is just as interesting for us if we could deal directly with
only that single file containing "everything". We wouldn't know the
"optimized layout" anyway before the end-user made a choice of what and
how to view something, so at that point incurring a delay (for
transforming/optimizing) is somewhat less interesting.
>> Yes, often sequentially by time as well.
>
> sequential is very fast.
I'm not familiar with the internals of the netCDF storage, but doesn't
this assume that the values are also _stored_ sequentially by time?
> Well, its impossible to optimize for both "get me all data for
> this time step" and "get me all data for this trajectory" unless
> you write the data twice.
Why is that impossible?
Isn't such a thing normally achieved by suitable file organization and
indexing? It's the retrieval times for the data query that's critical,
not how it is organized sequentially (or internally for that matter).
> So im hearing we should do the former, and make the trajectory
> visualization as fast as possible.
Acceptable to me, if there has to be a trade-off.
> without knowing Nmax, you couldnt use netcdf-3 multidimension arrays,
> but you could use the new "ragged array" conventions.
Yes, already discovered that limitation. Is that a restriction
associated with netCDF itself, or is it related to CF conventions only?
> because the grouping is reversed (first time, then station dimension)
> from the current trajectory feature type (first trajectory, then time),
> i need to think how to indicate that so that the software knows how
> to optimize it.
I got one idea how to do it, but it may conflict with some convention as
it currently is, or with basic netCDF reading optimization. Anyway, here
goes...
Suppose we consider representing a time series of particle clouds which
is highly variable. I here represent a single particle property by the
imaginary variable 'prop'. If we have several properties (very likely;
lat, lon, depth, mass, just to indicate a few), then we would simply
have several such 'prop' arrays with identical dimensions.
dimensions:
time = UNLIMITED;
record = UNLIMITED; // yes, ideally!
variables:
double time(time); // the usual one
int index(time);
float prop(record);
In this case, the array 'index' simply holds the starting record index
of a given timestep. The particles are simply stored consecutively from
that index until 'index(time+1) - 1'.
Number of particles at timestep 'i' can be calculated as:
index(i+1) - index(i)
(Optionally the 'index' could be 2-dimensional holding both a start and
count value.)
Example:
- 4 time steps:
at time 2.0, 3 particles (values 'a')
time 2.5, 0 particles (values 'b')
time 3.0, 4 particles (values 'c')
time 3.5, 1 particle (values 'd')
- 8 particles total.
Thus:
time = 4
record = 8
then:
time = { 2.0, 2.5, 3.0, 3.5 }
index = { 0, 3, 3, 7 }
prop = { a1, a2, a3, c1, c2, c3, c4, d1 }
Straight forward to read sequentially (by time), I should say, but if
one wants to jump directly to particle p at timestep t, then a simple
prop(index(t)+p)
should do the lookup.
> 20K particles x 100 bytes/particle = 2M / time step. So every access to
> a time step would cost you a disk seek = 10 ms.
How do you know this would be a single disk seek if these 100 bytes are
stored in, say, 25 different float variables? There would also be
variables storing values pertaining to the time step as a whole, not any
particular particle. (Eg., a bounding rectangle for the entire particle
cloud, or things like "total mass".)
(This also assumes the file was stored continuously on the disk itself.)
> So 1000 time steps = 10 seconds to retrieve a trajectory. Fast enough?
Not if there are hundreds of trajectories to process, I guess... ;-)
--
Regards, -+- Ben Hetland <ben.a.hetland at sintef.no> -+-
Opinions expressed are my own, not necessarily those of my employer.
Received on Thu Nov 04 2010 - 13:42:15 GMT