---- Here are some thoughts in response to a number of postings: In response to Chris Barker (2/Nov): We certainly vary the time step between particles in our models but usually as a sub-step. I agree there is no strong need to be able to store the substeps. Trajectories are quite a common need for us but usually for a small subset of particles (several hundred at any one time). These could be output as a separate stream and the small files would then make trajectory reconstruction cheap. Also the "already proposed standard for trajectories" mentioned by Chris on 4/Nov would presumably be an option for a small subset of particles (I don't know anything about this though). In response to John Caron (4/Nov): "Those" are indeed reasonable assumptions. Navg is generally similar to Nmax in our runs (the exception being during "spin up times" at the start of runs). In some runs Navg can increase roughly linearly with time, so Navg ~ Nmax/2. These are runs dominated by spin up with just a short period after spin-up. Data per particle is typically in the range 10-300 bytes, depending on what is stored (e.g. may have masses for ~100 chemical species, plus met data at particle location). Data might include release time and travel time of the particle which might be best stored in a "time variable". Response to Jon Caron (4/Nov, later post): 20K particles is very small for us. 4M is not uncommon and this is likely to increase in the future. However, as mentioned above, the need for trajectory output (or indeed, usually, the need for any particle output as opposed to gridded concentrations) is restricted to a small sample of particles. Response to Ben Hetland (4/Nov): "if one wants to jump directly to particle p at timestep t, then a simple prop(index(t)+p) should do the lookup" assumes that all particles are stored at each time step. In practice for a long run with continuously emitting sources, particles are created and destroyed. Dave --- -----Original Message----- From: cf-metadata-bounces at cgd.ucar.edu [mailto:cf-metadata-bounces at cgd.ucar.edu] On Behalf Of Ben Hetland Sent: 04 November 2010 19:42 To: cf-metadata at cgd.ucar.edu Subject: Re: [CF-metadata] point observation data in CF 1.4 On 04.11.2010 16:07, John Caron wrote: > On 11/4/2010 5:50 AM, Ben Hetland wrote: [...] >> In a recent case I had, the data were to be transferred to a client >> computer over the Internet for viewing locally. In that case reducing >> the content of the file to the absolute minimum set of properties >> (that the client needed in order to visualize) became paramount. Even >> a fast Internet connection does have bandwidth limitations... :-) > > im thinking more of the file as its written by the model. Our model currently uses a proprietary format to hold all this information, but we are investigating changing that to using netCDF instead, in an attempt to be a little bit more "standard compliant" and facilitate exchange with other software. In the indicated "Internet use case" we actually extract data from that native file into a NetCDF file, and the netCDF is then used as the "exchange format". We still found the current CF 1.4 convention too limiting for what we ideally would like to represent, and the proposed 1.6 wasn't quite up to it either. > but it seems like an interesting use case is to then be able to > transform it automatically to some other optimized layout. Yes! However, it is just as interesting for us if we could deal directly with only that single file containing "everything". We wouldn't know the "optimized layout" anyway before the end-user made a choice of what and how to view something, so at that point incurring a delay (for transforming/optimizing) is somewhat less interesting. >> Yes, often sequentially by time as well. > > sequential is very fast. I'm not familiar with the internals of the netCDF storage, but doesn't this assume that the values are also _stored_ sequentially by time? > Well, its impossible to optimize for both "get me all data for this > time step" and "get me all data for this trajectory" unless you write > the data twice. Why is that impossible? Isn't such a thing normally achieved by suitable file organization and indexing? It's the retrieval times for the data query that's critical, not how it is organized sequentially (or internally for that matter). > So im hearing we should do the former, and make the trajectory > visualization as fast as possible. Acceptable to me, if there has to be a trade-off. > without knowing Nmax, you couldnt use netcdf-3 multidimension arrays, > but you could use the new "ragged array" conventions. Yes, already discovered that limitation. Is that a restriction associated with netCDF itself, or is it related to CF conventions only? > because the grouping is reversed (first time, then station dimension) > from the current trajectory feature type (first trajectory, then > time), i need to think how to indicate that so that the software knows > how to optimize it. I got one idea how to do it, but it may conflict with some convention as it currently is, or with basic netCDF reading optimization. Anyway, here goes... Suppose we consider representing a time series of particle clouds which is highly variable. I here represent a single particle property by the imaginary variable 'prop'. If we have several properties (very likely; lat, lon, depth, mass, just to indicate a few), then we would simply have several such 'prop' arrays with identical dimensions. dimensions: time = UNLIMITED; record = UNLIMITED; // yes, ideally! variables: double time(time); // the usual one int index(time); float prop(record); In this case, the array 'index' simply holds the starting record index of a given timestep. The particles are simply stored consecutively from that index until 'index(time+1) - 1'. Number of particles at timestep 'i' can be calculated as: index(i+1) - index(i) (Optionally the 'index' could be 2-dimensional holding both a start and count value.) Example: - 4 time steps: at time 2.0, 3 particles (values 'a') time 2.5, 0 particles (values 'b') time 3.0, 4 particles (values 'c') time 3.5, 1 particle (values 'd') - 8 particles total. Thus: time = 4 record = 8 then: time = { 2.0, 2.5, 3.0, 3.5 } index = { 0, 3, 3, 7 } prop = { a1, a2, a3, c1, c2, c3, c4, d1 } Straight forward to read sequentially (by time), I should say, but if one wants to jump directly to particle p at timestep t, then a simple prop(index(t)+p) should do the lookup. > 20K particles x 100 bytes/particle = 2M / time step. So every access > to a time step would cost you a disk seek = 10 ms. How do you know this would be a single disk seek if these 100 bytes are stored in, say, 25 different float variables? There would also be variables storing values pertaining to the time step as a whole, not any particular particle. (Eg., a bounding rectangle for the entire particle cloud, or things like "total mass".) (This also assumes the file was stored continuously on the disk itself.) > So 1000 time steps = 10 seconds to retrieve a trajectory. Fast enough? Not if there are hundreds of trajectories to process, I guess... ;-) -- Regards, -+- Ben Hetland <ben.a.hetland at sintef.no> -+- Opinions expressed are my own, not necessarily those of my employer. _______________________________________________ CF-metadata mailing list CF-metadata at cgd.ucar.edu http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadataReceived on Thu Nov 11 2010 - 06:02:21 GMT
This archive was generated by hypermail 2.3.0 : Tue Sep 13 2022 - 23:02:41 BST