[CF-metadata] high sample rate (seismic) data conventions from Seth McGinnis on 2017-04-10 (Archive of CF discussions from 2002 to 2019 on the cf-metadata mailing list)

From: Seth McGinnis <mcginnis>
Date: Mon, 10 Apr 2017 11:54:18 -0600

Hi Jonathan,

Oh, climate model outputs are also supposed to have a uniform sample
rate for the whole time series -- emphasis on *SUPPOSED TO*. To my
dismay, I have encountered multiple cases where something went wrong
with the generation of the data files, resulting in missing or repeated
or weirdly-spaced timesteps, and sorting out the resulting problems is
how I came to appreciate the value of the explicit coordinate...

As far as I know, you are correct that CF does not have a standardized
way to represent a coordinate solely in terms of a formula without
reference to a corresponding coordinate variable.

However, that doesn't mean you couldn't do it and still have the file be
CF-compliant. As far as I am aware (and somebody correct me if I'm
wrong), coordinate variables are not actually mandatory.

So if, for reasons of feasibility, you found it necessary to do
something like the following, I believe that strictly speaking it would
be not just allowed but fully CF-compliant:

dimensions:
  time = UNLIMITED; // (1892160000 currently)
variables:
  double acceleration(time);
    acceleration:long_name = "ground acceleration";
    acceleration:units = "m s-2";
    acceleration:start_time = "2017-01-01 00:00:00.01667"
    acceleration:sampling_rate = "60 hz"
data:
    acceleration = 1.324145e-6, ...

I actually have some files without any coordinate variables sitting
around from the intermediate stage of some processing I did; I checked
one with Rosalyn Hatcher's cf-checker, and it didn't complain, so I
think it is technically legal. It's kind of a letter-of-the-law rather
than spirit-of-the-law thing, but it's at least theoretically compliant.
Up to you whether that would count as sufficiently suitable for your
use case.

Cheers,

--Seth

On 4/10/17 10:54 AM, Maccarthy, Jonathan K wrote:
> Hi Seth,
>
> Thanks for the very helpful response. I can understand the argument for
> explicit coordinates, as opposed to using formulae; I think it solves
> several problems. The assumption of a uniform sample rate for the
> length of a continuous time series is deeply engrained in most seismic
> software, however. Changing that assumption may lead to other problems
> (but maybe not!). Data volumes for a single channel can be 40-100
> 4-byte samples per second, which is something like 5-12 GB per channel
> per year uncompressed. Commonly, dozens of channels are used at once,
> though some of them may share time coordinates. It sounds like this
> use-case is similar in volume to what you've used, and may be worth
> trying out.
>
> Just to be clear, however, would I be correct in saying that CF has no
> accepted way of representing the data as I've described?
>
> Thanks again,
> Jonathan
>
>> On Apr 7, 2017, at 4:43 PM, Seth McGinnis <mcginnis at ucar.edu
>> <mailto:mcginnis at ucar.edu>> wrote:
>>
>> Hi Jonathan,
>>
>> I would interpret the CF stance as being that the value in having
>> explicit coordinate variables and other ancillary data to accompany the
>> data outweighs the cost of increased storage.
>>
>> There are some cases where CF bends away from that for the sake of
>> practicality (see, e.g., the discussion about external file references
>> for cell_bounds in CMIP5), but overall, my sense is that the community
>> feels that it's better to have things explicitly written out in the file
>> than it is to provide them implicitly via a formula to calculate them.
>>
>> Based on my personal experiences, I think this is the right approach.
>> (In fact, I take it even further: I prefer to avoid data compression
>> entirely and to keep like data with like as much as possible, rather
>> than splitting big files into smaller pieces.)
>>
>> I have endured far, far more suffering and toil from (a) trying to
>> figure out what's wrong with a file that violates some implicit
>> assumption (like "there are never gaps in the time coordinate") and (b)
>> dealing with the complications of various tactics for keeping file sizes
>> small than I ever have from storing and working with very large files.
>>
>> YMMV, of course. What are your data volumes like? I'm working at the
>> terabyte scale, and as long as my file sizes stay under a few dozen GB,
>> I don't really even bother thinking about anything that affects the file
>> size by less than an order of magnitude.
>>
>> Cheers,
>>
>> Seth McGinnis
>>
>> ----
>> NARCCAP / NA-CORDEX Data Manager
>> RISC - IMAGe - CISL - NCAR
>> ----
>>
>>
>> On 4/7/17 9:55 AM, Maccarthy, Jonathan K wrote:
>>> Hi all,
>>>
>>> I?m curious about the suitability of CF metadata conventions for
>>> seismic sensor data. I?ve done a bit of searching, but can?t find
>>> any mention of how CF conventions would store high sample-rate data
>>> sensor data. I do see descriptions of time series conventions, where
>>> hourly or daily sensor data samples are stored along with their
>>> timestamps, but storing individual timestamps for each sample of a
>>> high sample rate sensor would unnecessarily double the storage.
>>> Seismic formats typically don?t store time vectors, but instead just
>>> store vectors of samples with an associated start time and sampling
>>> rate.
>>>
>>> Could someone please point me towards a discussion or existing
>>> conventions on this topic? Any help or suggestion is appreciated.
>>>
>>> Best, Jon _______________________________________________ CF-metadata
>>> mailing list CF-metadata at cgd.ucar.edu <mailto:CF-metadata at cgd.ucar.edu>
>>> http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata
>>>
>> _______________________________________________
>> CF-metadata mailing list
>> CF-metadata at cgd.ucar.edu <mailto:CF-metadata at cgd.ucar.edu>
>> http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata
>
Received on Mon Apr 10 2017 - 11:54:18 BST

This archive was generated by hypermail 2.3.0 : Tue Sep 13 2022 - 23:02:42 BST