⇐ ⇒

[CF-metadata] point observation data in CF 1.4

From: Wright, Bruce <bruce.wright>
Date: Tue, 2 Nov 2010 13:09:31 -0000

All,

Sorry for a late follow-up (and once again breaking the thread), but
below is some feedback from our guys running the particle trajectory
models at the Met Office, which I think highlight the difficulties
storing particle trajectories efficiently.
----
Its good to see this being thought about. Here are some immediate
comments.
In a long (multi-year) air quality or risk assessment run, the total
number of particles followed could be a thousand times the maximum
number existing at any one time (and this factor could change if one
decides to do a longer run than originally intended). That suggests that
padding out arrays to the total number of particles is not a sensible
option. On the other hand, a particle's storage space could be reused at
later times once the particle has vanished, provided it was clearly
identified as a different particle. This is similar to the way particles
are stored within our dispersion model. But it's a bit inelegant as the
structure has redundant features which don't correspond to reality (in
that it links particles arbitrarily according to whether they reuse the
same space).
An alternative is, at each time, to store the particle data and for this
to include a particle id, without attempting to link particles at
different times. However retrieving a trajectory is then difficult as
will have to search each time for the particle id required. Storing
start and end time for each particle id would help, but restoring a
complete trajectory would still be inefficient. One can think of ways
round this: in a computer language one would have an array for each
particle id giving the indices in each time slice corresponding to the
particle (these arrays could be offset relative to the particle start
time so they would not have to be very long), and then an array of such
structures, one for each particle id. Can NetCDF do that?
To make things more difficult it might also be useful to store
trajectories with different length time-steps for different
trajectories. I don't think this is important for any applications we
have in mind at present, but it would be nice to think if it could be
done.
For very long runs, one would probably not want to be forced to store
everything in one very large file.
I think it would be acceptable to have more than one format for storing
data with different methods being efficient for different retrieval
types, together with (slow) utilities for converting between these
formats. Indeed that might be preferable if it enables things to be kept
simple conceptually.
----
Regards, 
Bruce 
-- 
Bruce Wright  Expert Strategic Advisor
Met Office FitzRoy Road Exeter EX1 3PB United Kingdom 
Tel: +44 (0)1392 886481 Fax: +44 (0)1392 885681
E-mail: bruce.wright at metoffice.gov.uk http://www.metoffice.gov.uk
<http://www.metoffice.gov.uk/>  
-----
Hi Steve
> That's the narrow technical perspective.  The "social" question is
> whether there is a significant benefit to the community from defining
> is a single approach that is general enough to handle the necessary
> range of model-generated tracer particle clouds.   How commonly does
> one want to share trajectory files between models?  
How common is it to want to share software for handling trajectory 
output? In my experience, common enough to justify getting this right.
Bryan
> Is it worth the
> effort to reach agreement on a new convention?   I've cc'ed Al
> Hermann here, who comes to mind as another modeler who might be
> interested in discussing this question.
> 
>      - Steve
> 
> ==========================================
> 
> On 10/14/2010 4:22 PM, Christopher Barker wrote:
> > Hi folks,
> > 
> > I just joined the list, so I apologize for breaking the threading,
> > and being out of the loop, but Rich Signell alerted me that you
> > were discussing a format for particle tracking models, and we'd
> > like to be involved.
> > 
> > A couple introductory comments:
> > 
> > We (NOAA Emergency Response Division) have a particle tracking
> > model (GNOME) we use for oil, chemical, and random-other-stuff
> > spills. For the most part, our output is in our own special
> > formats, and doesn't interact well with other tools -- we'd like
> > to change that, and we're going to netcdf for everything else, so
> > we want to use it for this, too.
> > 
> > We use netcdf from C/C++ and Python, so would rather not do
> > anything supported only in the Java libs. We're also on netcdf3 at
> > this point, though we can upgrade to netcdf4 if there is a
> > compelling reason.
> > 
> > Our model, at this point, keeps the number of particles constant
> > (though some may be flagged as "not released" or "off map", or
> > "evaporated" or what have you. As a result, the "natural" way that
> > we have stored the results (in netcdf and other) is in
> > (num_timesteps X num_particles) arrays.
> > 
> > We then have arrays for latitude, longitude, mass, various flags,
> > etc.
> > 
> > As discussed here, other folks' models change the number of
> > particles as time goes on, so such a simple block storage won't
> > work. Our case is a subset of the more general case, to it should
> > be easy to support the simple case anyway (and we may well add
> > variable particle numbers in the future as well)
> > 
> > When looking at the docs for PointObservationConventions, it didn't
> > seem to fit quite right. One key point is that for the most part,
> > we think of the collection of particles as an entity -- we are far
> > more likely to be interested in the whole collection at one point
> > in time, than the path of a particular particle over time. In
> > fact, it's rare the we care at all about the ID of a particular
> > particle -- we simply want to know its properties at a given time
> > -- so it would be nice if the data storage could do that
> > efficiently.
> > 
> > We'd generally want time to be the unlimited dimension, as well, as
> > we tend to run the models and analysis forward in time, and might
> > well want to incrementally output the data.
> > 
> > It seems ragged arrays are called for, though I've never tried to
> > do that in netcdf, so I don't know what the issues are. Are ragged
> > arrays a netcdf4-only feature?
> > 
> > Of course, another option is to allocate the full amount of space
> > required to store the maximum number, and then mask off the invalid
> > ones. With compression, that may not be too bad a way to go.
> > 
> > A few specific comments:
> >> I now write the data with redundant time as
> >> a limited dimension, and records(time, latitude, longitude) and
> >> have mass (record), radius(record) etc.
> >> 
> > > Thanks anyway,
> > > Ute
> > 
> > Do you have an example output file for that you could share?
> > 
> >> Clearly there is a need for another Point Convention type to
> >> handle the output from particle tracking models like this.
> > 
> > I think so too -- it really is a different use case.
> > 
> >> 2. I think trajectory is when you follow a set of "things", boats,
> >> a person. But at each time step they are identical, maybe not the
> >> same number because of missing data. I could assume that I have a
> >> trajectory but actually I can't be sure if my particles are the
> >> same as before. Therefore I chose not to take that convention.
> > 
> > hmm -- it sounds like this is similar to what I was talking about
> > above -- the collection of particles at a given time is what's
> > important -- not the path of any given particle.
> > 
> > As a not, we've been working some with the CDOG (deepwater blowout
> > model), and it doesn't keep track of which particles are which as
> > they are added an removed, either -- so it's a pretty common use
> > case.
> > 
> >> There may
> >> be thousands or tens of thousands of particles, so it's not
> >> feasible to write each trajectory into a separate file.
> > 
> > nor does that fit that natural data model -- one file per timestep
> > would make more sense.
> > 
> >> We want a featureType that will allow us to write the entire
> >> collection of particles at each time step into a single file, and
> >> that will allow us to extract all the particles at a single time
> >> step, as well as extract individual particle trajectories by
> >> their ID.
> > 
> > well said.
> > 
> >> whereas as you describe it the time coord is common to all
> >> trajectories
> > 
> > yup.
> > 
> >>  To arrange this, an indirection could be
> >> 
> >> used on
> >> 
> >> the time dimension:
> >>   data(i,o)     x(i,o) y(i,o) z(i,o) t(tindex(i,o))
> >> 
> >> where i is the instance (which of the trajectories), o is the
> >> point along that
> >> trajectory, t is the coordinate vector of common times, and tindex
> >> is an index
> >> to t. For example, we might have these two trajectories (x,t)
> >> (omitting y and
> >> z for simplicity)
> >> 
> >>   (0,10) (1,11) (2,12)
> >>   
> >>          (3,11) (2,12) (1,13) (0,14)
> >> 
> >> Then t would be [10,11,12,13,14] (all the times). For the first
> >> trajectory
> >> 
> >>   x=[0,1,2] tindex=[0,1,2]
> >> 
> >> and for the second
> >> 
> >>   x=[3,2,1,0] tindex=[1,2,3,4]
> >> 
> >> Is that right? Perhaps/probably there's a neater or more natural
> >> way to do it.
> > 
> > I'm having trouble following that -- but yes, it does not seem
> > natural.
> > 
> >> The synchronization of coordinates opens a potential door to great
> >> simplicity of representation.
> >> 
> >>     metadata(i)    data(i,o)     x(i,o) y(i,o) z(i,o) t(o)
> >> 
> >> where i is the instance (which of the trajectories), o is simply
> >> the time index.  The possible costs are proliferation in numbers
> >> of ways to represent similar things and file size.   The question
> >> that I'd be inclined to ask of Ute and Rich would be a judgment
> >> call on the cost in file size that would result from filling
> >> missing values at the start/end of each individual trajectory.
> > 
> > If you're going to do that, you could just store it as one big
> > rectangular array, with missing values marked (see above). Which
> > sure would be easy but costly in storage space.
> > 
> > Another problem -- you may not know the maximum number of particles
> > at time zero -- so you can't know how much space to allocate when
> > you start writing the file -- that may kill that approach.
> > 
> >>  Optionally the metadata could include
> >>  
> >>     tstart_index(i)   tend_index(i)
> >> 
> >> This representation seems _the simplest from the standpoint of
> >> application code (reading)_.  Synoptic view are simply projection
> >> at fixed "o" index;  the history of an individual trajectory is
> >> simply a projection at a fixed "i" index.
> > 
> > How do you do this with a ragged array? I'm missing something.
> > 
> >>  Does the
> >> 
> >> saving of space through not padding the trajectories justify the
> >> complexity?
> >> I don't know.
> > 
> > Do you have a choice if you don't know what your maximum number of
> > particles is at the start?
> > 
> > Thanks all for working on this.
> > 
> > -Chris
> 
> _______________________________________________
> CF-metadata mailing list
> CF-metadata at cgd.ucar.edu
<http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata> 
> http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata
Bryan Lawrence
Director of Environmental Archival and Associated Research
(NCAS/British Atmospheric Data Centre and NCEO/NERC NEODC)
STFC, Rutherford Appleton Laboratory
Phone +44 1235 445012; Fax ... 5848; 
Web: home.badc.rl.ac.uk/lawrence
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cgd.ucar.edu/pipermail/cf-metadata/attachments/20101102/70e27a25/attachment-0001.html>
Received on Tue Nov 02 2010 - 07:09:31 GMT

This archive was generated by hypermail 2.3.0 : Tue Sep 13 2022 - 23:02:41 BST

⇐ ⇒