[CF-metadata] point observation data in CF 1.4 from John Caron on 2010-11-04 (Archive of CF discussions from 2002 to 2019 on the cf-metadata mailing list)

From: John Caron <caron>
Date: Thu, 04 Nov 2010 09:07:31 -0600

On 11/4/2010 5:50 AM, Ben Hetland wrote:
> Hello List,
>
> Since I'm also dealing a bit with data sets holding "particle
> collections" through time, I'd like to contribute some thoughts
> regarding this. Our primary use here at SINTEF is for inputs to and
> results from oil drift simulations, so we're dealing mostly at the sea
> surface and below, although it appears to me that similar data sets are
> just as applicable up in the atmosphere as well. (Particle collections
> and bounding polygons for the ash cloud from the Eyjafjallaj?kull
> eruption springs to mind as a fairly recent example.)
>
>
> On 04.11.2010 03:19, John Caron wrote:
>> 1) It seems clear that at each time step, you need to write out the
>> data for whatever particles currently exist.
>
> This is a very fair assessment in our case. One could generalize a bit
> more: We have data organized as a series of time steps (as the primary
> dimension). At each time step we have a number of "data" to store, and
> they are of various sizes and types. Particles are but one of these
> kinds of objects. Most of them may probably be treated similar to
> particles, though, where a fixed set of properties describing each
> object can simply be represented by a separate netCDF variable per property.
>
> A more nasty example could be to represent an oil slick's shape and
> position with a polygon. The number of vertices of that polygon would be
> highly variable through time. (This is a typical GIS-like representation.)
>
>
>> I assume that if you wanted to break up the data for a very
>> long run, you would partition by time, ie time steps 1-1000,
>> 1001-2000, etc. would be in seperate files.
>
> How one decides to partition I think can depend a lot on the
> application. Sometimes splitting them on data type can be more
> appropriate. In a recent case I had, the data were to be transferred to
> a client computer over the Internet for viewing locally. In that case
> reducing the content of the file to the absolute minimum set of
> properties (that the client needed in order to visualize) became
> paramount. Even a fast Internet connection does have bandwidth
> limitations... :-)

im thinking more of the file as its written by the model.

but it seems like an interesting use case is to then be able to transform it automatically to some other optimized layout.

>
>
>> 2) Apparently the common read pattern is to retrieve the set of
>> particles at a given time step. If so, that makes things easier.
>
> Yes, often sequentially by time as well.

sequential is very fast.

>
>
>> 3) I assume that you want to be able to figure out an individual
>> particle's trajectory, even if that doesnt need to be optimized
>> for speed.
>
> Not my primary need, but if an object is "tracked" like that it would
> not be unlikely that the trajectory might need to be accessed
> "interactively", eg. while a user is viewing a visualization of the data
> directly on screen. Does that count as "optimized for speed"?

Well, its impossible to optimize for both "get me all data for this time step" and "get me all data for this trajectory" unless you write the data twice.

So im hearing we should do the former, and make the trajectory visualization as fast as possible. If you really needed to make that really fast, i really would write the data twice (really). That could be done in a post-processing step, so we dont have to let it complicate too much right now.

>
>
>> 1) is the avg number (Navg) of particles that exist much smaller,
>> or approx the same as the max number (Nmax) that exist at one time
>> step?
>
> This varies a lot. Sometimes it is like you suggest, but sometimes maybe
> only a few. Sometimes there isn't any defined Nmax either (dynamic
> implementations), or such a limit can be difficult to know beforehand.
>
> Even where an Nmax is set, would it be unreasonable to require the
> _same_ value to be used every time if the netCDF dataset was accumulated
> through _multiple_ simulation runs?

without knowing Nmax, you couldnt use netcdf-3 multidimension arrays, but you could use the new "ragged array" conventions. because the grouping is reversed (first time, then station dimension) from the current trajectory feature type (first trajectory, then time), i need to think how to indicate that so that the software knows how to optimize it.

Also, we could explore netcdf-4 which has variable length arrays, although these have to be read atomically, so its not obvious what performance would be for any given read scenario. for sure worth trying.

>
>
>> 2) How much data is associated with each particle at a given
>> time step (just an estimate needed here - 10 bytes? 1000 bytes?)
>
> In our case this varies a lot with type of particle, and how the
> simulation was set up. A quick assessment indicates that some are only
> 16 bytes per particle, while others may currently require up to 824
> bytes. (This does not account for shared info like the time itself,
> which we don't store per particle.) It also wouldn't be very atypical if
> this amount is then to be multiplied by say 20000 particles per time step.

20K particles x 100 bytes/particle = 2M / time step. So every access to a time step would cost you a disk seek = 10 ms. So 1000 time steps = 10 seconds to retrieve a trajectory. Fast enough?

Otherwise, putting it on an SSD (solid state disk) is interesting thing to try.

Not sure how long it would take to rewrite a 2Gb file to reverse dimensions.

>
>
> Hope that provides some useful ideas of the real-life needs!
> :-)

thanks!
Received on Thu Nov 04 2010 - 09:07:31 GMT

This archive was generated by hypermail 2.3.0 : Tue Sep 13 2022 - 23:02:41 BST