⇐ ⇒

[CF-metadata] Are ensembles a compelling use case for "group-aware" metadata? (CZ)

From: Steve Hankin <steven.c.hankin>
Date: Thu, 26 Sep 2013 14:14:00 -0700

On 9/26/2013 8:40 AM, Jim Biard wrote:
> Steve,
>
> The expected users (this is a new effort) are "power users" that
> either wish to do diagnostic-type work or reprocess VIIRS data from
> scratch using their own algorithms (as opposed to the standard
> Suomi-NPP mission processing algorithms). The data will most likely
> be accessed via the netCDF-4 API in custom code in C, C++, Java,
> Python, IDL, Matlab, etc. You can also easily access the file
> contents using the HDF5 API, and display contents using HDFView or
> other general applications.
>
> The data is going to be archived (we are a couple of weeks away from
> start of production) at NOAA NCDC in netCDF-4 format. We will start
> with the current data stream from the VIIRS instrument, and will also
> back-fill with data from the beginning of the mission up until the
> beginning of production. The files will be accessible from the HDSS
> Access System (HAS) at NCDC, and delivered via ftp.
>

Hi Jim,

Thanks for sharing this scenario. An interesting one to think about.

You faced a tough (but common) choice -- weighing the benefits of
interoperabilty that would come through sticking to a standard (CF)
against the cleaner data structures you could create with unrestricted
use of a file API (netCDF4's groups). From the description it looks
like CF was do-able, but with what you felt was a big "yuk" factor.
Having power users as your target audience is an understandable factor
in tipping the scales.

There are down sides to to your choice that would be nice to mitigate.
So a few words on "third way" approaches that might allow you to have
your power user cake and eat it interoperably too.

The down sides:

  * loading the archive center up with yet-another mission-specific file
    format. This places the interoperability responsibilities on the
    archive center ... (see fur flying from the Open Archival Reference
    Model folks, etc. etc. )
  * your files are not readily accessible to communities outside of your
    power users. A programmer's outlook is needed to get information
    from the files.

Two possible ways to mitigate the down sides:

 1. distribute your data as CF files, and for your power users provide a
    utility that generates your netCDF4-grouped version of it;
 2. continue to create your netCDF4-grouped files as-is, but also fund
    the development of an IOSP that allows the file to be THREDDS-served
    and opened by the Java netCDF library as valid CF. (As I understand
    NCDC has used this approach very successfully on legacy satellite
    archives.)

Taking these approaches implies more resources from you or your project,
of course. But with each choice available there is a price that will be
paid by someone ....

     - Steve

> Grace and peace,
>
> Jim
>
> CICS-NC <http://www.cicsnc.org/>Visit us on
> Facebook <http://www.facebook.com/cicsnc> *Jim Biard*
> *Research Scholar*
> Cooperative Institute for Climate and Satellites NC <http://cicsnc.org/>
> North Carolina State University <http://ncsu.edu/>
> NOAA's National Climatic Data Center <http://ncdc.noaa.gov/>
> 151 Patton Ave, Asheville, NC 28801
> e: jim.biard at noaa.gov <mailto:jim.biard at noaa.gov>
> o: +1 828 271 4900
>
>
>
>
> On Sep 26, 2013, at 11:20 AM, Steve Hankin <Steven.C.Hankin at noaa.gov
> <mailto:Steven.C.Hankin at noaa.gov>> wrote:
>
>> Hi Jim,
>>
>> Thanks for the description. An interesting use case. It is clear
>> why netCDF groups add value for you.
>>
>> Can you add a few words about your users? What software do they use
>> when accessing the files that you create? What actions do they take
>> to adapt to (what I gather is) a unique data distribution format? Do
>> they individually write their own code? Is someone supplying and
>> maintaining higher level applications that are shared among a
>> community of users? What about long-term archival? Who handles that
>> and what data format do they use?
>>
>> - Steve
>>
>> ================================================================
>>
>> On 9/26/2013 6:48 AM, Jim Biard wrote:
>>> Hi.
>>>
>>> I am currently building netCDF-4 files that use groups. I'd love it
>>> if CF were modified such that these files would be "mostly"
>>> compliant (which would require nothing more than acceptance of
>>> groups and hierarchical inheritance of 'file-level' attributes). I
>>> am well aware that my use case is significantly different than most
>>> CF use cases, but it might help illuminate the discussion. Here's
>>> what I'm doing and why.
>>>
>>> I am building a data product that is much lower level than most
>>> (NOAA Level 1a) - swaths of raw binary counts accompanied by
>>> coefficients for algorithms that can be used to convert the counts
>>> to calibrated scientific unit measurements. The data contained is
>>> from the Visible Infrared Imaging Radiometer Suite (VIIRS)
>>> instrument on the Suomi-NPP satellite. I store the data from the
>>> VIIRS sensor in 'data' files, and the algorithm coefficients in
>>> 'supporting data' files. (I separate them this way because the
>>> contents of one supporting data file applies to many data files.)
>>>
>>> Each data file is on the order of 200 MB in size, and contains 321
>>> variables. Each file contains four VIIRS science Raw Data Record
>>> (RDR) granules (~6 minutes of data). I have groups for:
>>>
>>> * imagery data for the 375 m (nadir) resolution bands
>>> * imagery data for the 750 m single-gain bands
>>> * imagery data for the 750 m dual-gain bands
>>> * imagery data for the day/night band
>>> * engineering data for the instrument
>>> * ephemeris, attitude, and spacecraft state data
>>>
>>>
>>> The image variables in the four groups of imagery bands have
>>> different shapes from one another. The engineering and "ephemeris,
>>> etc" data each have different first dimensions in their shapes.
>>>
>>> The supporting data (coefficients) comes to me as 35 different
>>> 'binary blob' files (C structures written directly to files),
>>> totaling around 5 MB. I break the contents of each binary blob into
>>> its constituent variables, and store the variables from each
>>> incoming file in a separate group. There are 307 variables in each
>>> supporting data file. The supporting data values change, but at a
>>> much lower rate (less than or equal to once per week) than the
>>> science data.
>>>
>>> I chose to use groups because I came to the conclusion that the name
>>> lengths needed to store all of these variables in flat files would
>>> be a detriment to human understanding of the contents and groupings
>>> of the contents. Creating constellations of 41 flat files (one for
>>> each group) also imposed a significant organization and maintenance
>>> burden when compared with the use of groups.
>>>
>>> The data files only have group attributes at the entire file level
>>> and for the "ephemeris, etc" group. The "ephemeris, etc" group has
>>> metadata values that hold for all elements of the group that are
>>> different from the values for the rest of the data. The supporting
>>> data files have few file-level attributes (ACDD and CF), and more
>>> extensive metadata values that are different for each group.
>>>
>>> Love it or hate it, this is what I've got. :) As I said at the
>>> beginning, extending CF to embrace groups and inheritance of group
>>> (file-level) attributes would make these files compliant. (Or at
>>> lease mostly compliant. There are no geographic coordinates, for
>>> example.)
>>>
>>> Grace and peace,
>>>
>>> Jim
>>>
>>> CICS-NC <http://www.cicsnc.org/>Visit us on
>>> Facebook <http://www.facebook.com/cicsnc> *Jim Biard*
>>> *Research Scholar*
>>> Cooperative Institute for Climate and Satellites NC <http://cicsnc.org/>
>>> North Carolina State University <http://ncsu.edu/>
>>> NOAA's National Climatic Data Center <http://ncdc.noaa.gov/>
>>> 151 Patton Ave, Asheville, NC 28801
>>> e: jim.biard at noaa.gov <mailto:jim.biard at noaa.gov>
>>> o: +1 828 271 4900
>>>
>>>
>>>
>>>
>>> On Sep 25, 2013, at 6:35 PM, "Cameron-smith, Philip"
>>> <cameronsmith1 at llnl.gov <mailto:cameronsmith1 at llnl.gov>> wrote:
>>>
>>>> Hi All,
>>>> I think Steve's email (below) is a fair summary of how I see the
>>>> current state of the discussion too.
>>>> In order to move the discussion forward, I have put forward below a
>>>> simple strawman suggestion that is very limited, but which I think
>>>> would capture the most useful piece of hierarchies with minimal
>>>> impact on CF. Note that credit for many of the elements should go
>>>> to other people who have previously proposed them - my main
>>>> contribution is to stick my neck out and try to make the case :-).
>>>> 1) CF file structures stay 'flat'.
>>>> 2) Allow an _optional_ hierarchy attribute for variables.
>>>> 3) CF would define the attribute name and the rules for the
>>>> attribute. I expect it would be something like: 'hierarchy =
>>>> root.trunk.branch.leaf'
>>>> Key comments:
>>>> a) Since the hierarchy attribute is optional, backwards and
>>>> forwards compatibility should be automatic (except, possibly, for
>>>> updating CF checkers), ie no change is necessary for people who
>>>> don't want to.
>>>> b) An external tool could easily parse a CF file, or set of files,
>>>> that contains the hierarchy attributes to generate an external
>>>> hierarchy structure that can then be used to decide how to further
>>>> process the data.
>>>> c) The external hierarchy could easily be regenerated to keep it
>>>> consistent with the underlying data files.
>>>> c) The hierarchy metadata should be human readable.
>>>> d) All variable CF attributes would stay with the variables (as
>>>> currently), ie no inheritance of CF attributes (to maintain
>>>> compatibility). The common attributes that I think inheritance
>>>> would be most useful for are history attributes, and since CF
>>>> doesn't control history attributes (AFAIK) this would be allowed.
>>>> e) So why not let individuals add their own such syntax? Defining
>>>> the syntax of the hierarchy will allow general CF tools to be
>>>> extended (if they want to), and set the stage for further expansion
>>>> into hierarchies if experience shows that a lot of people are using
>>>> the hierarchy syntax and start asking for more.
>>>> In my opinion, the benefits of this extension would exceed the
>>>> minimal costs of extending the CF standard.
>>>> Let the slings and arrows fly ;-).
>>>> Best wishes,
>>>> Philip
>>>> -----------------------------------------------------------------------
>>>> Dr Philip Cameron-Smith,pjc at llnl.gov <mailto:pjc at llnl.gov>,
>>>> Lawrence Livermore National Lab.
>>>> -----------------------------------------------------------------------
>>>> *From:*CF-metadata [mailto:cf-metadata-bounces at cgd.ucar.edu
>>>> <mailto:metadata-bounces at cgd.ucar.edu>]*On Behalf Of*Steve Hankin
>>>> *Sent:*Wednesday, September 25, 2013 12:34 PM
>>>> *To:*Charlie Zender
>>>> *Cc:*cf-metadata at cgd.ucar.edu <mailto:cf-metadata at cgd.ucar.edu>
>>>> *Subject:*Re: [CF-metadata] Are ensembles a compelling use case for
>>>> "group-aware" metadata? (CZ)
>>>> On 9/24/2013 9:45 PM, Charlie Zender wrote:
>>>>
>>>> It is not my place to determine whether there is a consensus,
>>>> or how close we are, but it's clear to me there is no consensus
>>>> yet. Bryan Lawrence, Steve Hankin, Jonathan Gregory, Karl
>>>> Taylor, and Philip Cameron-Smith are not "on board". I hope
>>>> they will speak-up and say if they concur that maintaining the
>>>> status quo (flat files) is best (period), or whether they do
>>>> wish to extend CF to hierarchies (starting now), or the
>>>> additional information they would need to decide.
>>>>
>>>>
>>>> Hi Charlie et. al.,
>>>>
>>>> Since you have asked .... I have heard two points that seemed to
>>>> bolster Bryan's pov that the multi-model use case is "great but not
>>>> compelling". (See a more positive spin at the end.)
>>>>
>>>> 1. file size. Model outputs today are typically too large for even
>>>> a single variable from a single model to be packaged in a
>>>> single file. Addressing a model ensemble multiplies the size
>>>> barrier by the ensemble size, N. Thus the use of groups to
>>>> package a model ensemble applies only for the cases where user
>>>> is interested in quite a small subset of the model domain, or
>>>> perhaps in pre-processed, data-reduced versions of the
>>>> models. A gut-estimate is that single file solutions, like
>>>> netCDF4 groups addresses 25% or less of the stated use case.
>>>> We could argue over that number, but it seems likely to remain
>>>> on the low side of 50%. (Issues of THREDDS-aggregating files
>>>> bearing groups also deserve to be discussed and understood.
>>>> What works? what doesn't?)
>>>> 2. The problems of the "suitcase packing" metaphor were invoked
>>>> time and again, further narrowing the applicability of the use
>>>> case. The sweet spot that was identified is the case of a
>>>> single user desiring a particular subset from a single data
>>>> provider. Essentially a multi-model ensemble encoded using
>>>> netCDF4 groups would offer a standardized "shopping basket"
>>>> with advantages that will be enjoyed by some high powered
>>>> analysis users.
>>>>
>>>> For this narrower use case I couldn't help asking myself how
>>>> the cost/benefit found through the use of netCDF4 groups
>>>> compares with the cost/benefit of simply zip-packaging the
>>>> individual CF model files. There is almost no cost to this
>>>> alternative. Tools to pack and unpack zip files are universal,
>>>> have UIs embedded into common OSes, and offer APIs that permit
>>>> ensemble analysis to be done on the zip file as a unit at
>>>> similar programming effort to the use of netCDF4 groups.
>>>> Comprehension and acceptance of the zip alternative on the
>>>> part of user communities would likely be instantaneous --
>>>> hardly even a point to generate discussion. Zip files do not
>>>> address more specialized use cases, like a desire to view the
>>>> ensemble as a 2-level hierarchy of models each providing
>>>> multiple scenarios, but the "suitcase" metaphor discussions
>>>> have pointed out the diminishing returns that accrue as the
>>>> packing strategy is made more complex.
>>>>
>>>> The tipping point for me is not whether a particular group of users
>>>> would find value in a particular enhancement. It is whether the
>>>> overall cost/benefit considerations -- the expanded complexity, the
>>>> need to enhance applications, the loss of interoperabilty etc.
>>>> versus the breadth of users and the benefits they will enjoy --
>>>> clearly motivate a change. My personal vote is that thus far the
>>>> arguments fall well short of this tipping point. But maybe there
>>>> are other use cases to be explored. Perhaps in aggregate they may
>>>> tip the cost/benefit analysis. What about the "group of satellite
>>>> swaths" scenario? -- a feature collection use case. AFAIK CF
>>>> remains weak at addressing this need thus far. (If we pursue this
>>>> line of discussion we should add the 'cf_satellite' list onto the
>>>> thread. That community may have new work on this topic to discuss.)
>>>>
>>>> - Steve
>>>> _______________________________________________
>>>> CF-metadata mailing list
>>>> CF-metadata at cgd.ucar.edu <mailto:CF-metadata at cgd.ucar.edu>
>>>> http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata
>>>
>>>
>>>
>>> _______________________________________________
>>> CF-metadata mailing list
>>> CF-metadata at cgd.ucar.edu
>>> http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cgd.ucar.edu/pipermail/cf-metadata/attachments/20130926/50dfb0d5/attachment-0001.html>
Received on Thu Sep 26 2013 - 15:14:00 BST

This archive was generated by hypermail 2.3.0 : Tue Sep 13 2022 - 23:02:41 BST

⇐ ⇒