[CF-metadata] Are ensembles a compelling use case for "group-aware" metadata? (CZ) from Steve Hankin on 2013-09-26 (Archive of CF discussions from 2002 to 2019 on the cf-metadata mailing list)

From: Steve Hankin <steven.c.hankin>
Date: Thu, 26 Sep 2013 08:20:33 -0700

Hi Jim,

Thanks for the description. An interesting use case. It is clear why
netCDF groups add value for you.

Can you add a few words about your users? What software do they use
when accessing the files that you create? What actions do they take to
adapt to (what I gather is) a unique data distribution format? Do they
individually write their own code? Is someone supplying and
maintaining higher level applications that are shared among a community
of users? What about long-term archival? Who handles that and what
data format do they use?

- Steve

================================================================

On 9/26/2013 6:48 AM, Jim Biard wrote:
> Hi.
>
> I am currently building netCDF-4 files that use groups. I'd love it
> if CF were modified such that these files would be "mostly" compliant
> (which would require nothing more than acceptance of groups and
> hierarchical inheritance of 'file-level' attributes). I am well aware
> that my use case is significantly different than most CF use cases,
> but it might help illuminate the discussion. Here's what I'm doing
> and why.
>
> I am building a data product that is much lower level than most (NOAA
> Level 1a) - swaths of raw binary counts accompanied by coefficients
> for algorithms that can be used to convert the counts to calibrated
> scientific unit measurements. The data contained is from the Visible
> Infrared Imaging Radiometer Suite (VIIRS) instrument on the Suomi-NPP
> satellite. I store the data from the VIIRS sensor in 'data' files,
> and the algorithm coefficients in 'supporting data' files. (I
> separate them this way because the contents of one supporting data
> file applies to many data files.)
>
> Each data file is on the order of 200 MB in size, and contains 321
> variables. Each file contains four VIIRS science Raw Data Record
> (RDR) granules (~6 minutes of data). I have groups for:
>
> * imagery data for the 375 m (nadir) resolution bands
> * imagery data for the 750 m single-gain bands
> * imagery data for the 750 m dual-gain bands
> * imagery data for the day/night band
> * engineering data for the instrument
> * ephemeris, attitude, and spacecraft state data
>
>
> The image variables in the four groups of imagery bands have different
> shapes from one another. The engineering and "ephemeris, etc" data
> each have different first dimensions in their shapes.
>
> The supporting data (coefficients) comes to me as 35 different 'binary
> blob' files (C structures written directly to files), totaling around
> 5 MB. I break the contents of each binary blob into its constituent
> variables, and store the variables from each incoming file in a
> separate group. There are 307 variables in each supporting data file.
> The supporting data values change, but at a much lower rate (less
> than or equal to once per week) than the science data.
>
> I chose to use groups because I came to the conclusion that the name
> lengths needed to store all of these variables in flat files would be
> a detriment to human understanding of the contents and groupings of
> the contents. Creating constellations of 41 flat files (one for each
> group) also imposed a significant organization and maintenance burden
> when compared with the use of groups.
>
> The data files only have group attributes at the entire file level and
> for the "ephemeris, etc" group. The "ephemeris, etc" group has
> metadata values that hold for all elements of the group that are
> different from the values for the rest of the data. The supporting
> data files have few file-level attributes (ACDD and CF), and more
> extensive metadata values that are different for each group.
>
> Love it or hate it, this is what I've got. :) As I said at the
> beginning, extending CF to embrace groups and inheritance of group
> (file-level) attributes would make these files compliant. (Or at
> lease mostly compliant. There are no geographic coordinates, for
> example.)
>
> Grace and peace,
>
> Jim
>
> CICS-NC <http://www.cicsnc.org/>Visit us on
> Facebook <http://www.facebook.com/cicsnc> *Jim Biard*
> *Research Scholar*
> Cooperative Institute for Climate and Satellites NC <http://cicsnc.org/>
> North Carolina State University <http://ncsu.edu/>
> NOAA's National Climatic Data Center <http://ncdc.noaa.gov/>
> 151 Patton Ave, Asheville, NC 28801
> e: jim.biard at noaa.gov <mailto:jim.biard at noaa.gov>
> o: +1 828 271 4900
>
>
>
>
> On Sep 25, 2013, at 6:35 PM, "Cameron-smith, Philip"
> <cameronsmith1 at llnl.gov <mailto:cameronsmith1 at llnl.gov>> wrote:
>
>> Hi All,
>> I think Steve's email (below) is a fair summary of how I see the
>> current state of the discussion too.
>> In order to move the discussion forward, I have put forward below a
>> simple strawman suggestion that is very limited, but which I think
>> would capture the most useful piece of hierarchies with minimal
>> impact on CF. Note that credit for many of the elements should go
>> to other people who have previously proposed them - my main
>> contribution is to stick my neck out and try to make the case :-).
>> 1) CF file structures stay 'flat'.
>> 2) Allow an _optional_ hierarchy attribute for variables.
>> 3) CF would define the attribute name and the rules for the
>> attribute. I expect it would be something like: 'hierarchy =
>> root.trunk.branch.leaf'
>> Key comments:
>> a) Since the hierarchy attribute is optional, backwards and forwards
>> compatibility should be automatic (except, possibly, for updating CF
>> checkers), ie no change is necessary for people who don't want to.
>> b) An external tool could easily parse a CF file, or set of files,
>> that contains the hierarchy attributes to generate an external
>> hierarchy structure that can then be used to decide how to further
>> process the data.
>> c) The external hierarchy could easily be regenerated to keep it
>> consistent with the underlying data files.
>> c) The hierarchy metadata should be human readable.
>> d) All variable CF attributes would stay with the variables (as
>> currently), ie no inheritance of CF attributes (to maintain
>> compatibility). The common attributes that I think inheritance
>> would be most useful for are history attributes, and since CF doesn't
>> control history attributes (AFAIK) this would be allowed.
>> e) So why not let individuals add their own such syntax? Defining the
>> syntax of the hierarchy will allow general CF tools to be extended
>> (if they want to), and set the stage for further expansion into
>> hierarchies if experience shows that a lot of people are using the
>> hierarchy syntax and start asking for more.
>> In my opinion, the benefits of this extension would exceed the
>> minimal costs of extending the CF standard.
>> Let the slings and arrows fly ;-).
>> Best wishes,
>> Philip
>> -----------------------------------------------------------------------
>> Dr Philip Cameron-Smith,pjc at llnl.gov <mailto:pjc at llnl.gov>, Lawrence
>> Livermore National Lab.
>> -----------------------------------------------------------------------
>> *From:*CF-metadata [mailto:cf-metadata-bounces at cgd.ucar.edu
>> <mailto:metadata-bounces at cgd.ucar.edu>]*On Behalf Of*Steve Hankin
>> *Sent:*Wednesday, September 25, 2013 12:34 PM
>> *To:*Charlie Zender
>> *Cc:*cf-metadata at cgd.ucar.edu <mailto:cf-metadata at cgd.ucar.edu>
>> *Subject:*Re: [CF-metadata] Are ensembles a compelling use case for
>> "group-aware" metadata? (CZ)
>> On 9/24/2013 9:45 PM, Charlie Zender wrote:
>>
>> It is not my place to determine whether there is a consensus, or
>> how close we are, but it's clear to me there is no consensus yet.
>> Bryan Lawrence, Steve Hankin, Jonathan Gregory, Karl Taylor, and
>> Philip Cameron-Smith are not "on board". I hope they will
>> speak-up and say if they concur that maintaining the status quo
>> (flat files) is best (period), or whether they do wish to extend
>> CF to hierarchies (starting now), or the additional information
>> they would need to decide.
>>
>>
>> Hi Charlie et. al.,
>>
>> Since you have asked .... I have heard two points that seemed to
>> bolster Bryan's pov that the multi-model use case is "great but not
>> compelling". (See a more positive spin at the end.)
>>
>> 1. file size. Model outputs today are typically too large for even
>> a single variable from a single model to be packaged in a single
>> file. Addressing a model ensemble multiplies the size barrier by
>> the ensemble size, N. Thus the use of groups to package a model
>> ensemble applies only for the cases where user is interested in
>> quite a small subset of the model domain, or perhaps in
>> pre-processed, data-reduced versions of the models. A
>> gut-estimate is that single file solutions, like netCDF4 groups
>> addresses 25% or less of the stated use case. We could argue
>> over that number, but it seems likely to remain on the low side
>> of 50%. (Issues of THREDDS-aggregating files bearing groups also
>> deserve to be discussed and understood. What works? what doesn't?)
>> 2. The problems of the "suitcase packing" metaphor were invoked time
>> and again, further narrowing the applicability of the use case.
>> The sweet spot that was identified is the case of a single user
>> desiring a particular subset from a single data provider.
>> Essentially a multi-model ensemble encoded using netCDF4 groups
>> would offer a standardized "shopping basket" with advantages that
>> will be enjoyed by some high powered analysis users.
>>
>> For this narrower use case I couldn't help asking myself how the
>> cost/benefit found through the use of netCDF4 groups compares
>> with the cost/benefit of simply zip-packaging the individual CF
>> model files. There is almost no cost to this alternative.
>> Tools to pack and unpack zip files are universal, have UIs
>> embedded into common OSes, and offer APIs that permit ensemble
>> analysis to be done on the zip file as a unit at similar
>> programming effort to the use of netCDF4 groups. Comprehension
>> and acceptance of the zip alternative on the part of user
>> communities would likely be instantaneous -- hardly even a point
>> to generate discussion. Zip files do not address more
>> specialized use cases, like a desire to view the ensemble as a
>> 2-level hierarchy of models each providing multiple scenarios,
>> but the "suitcase" metaphor discussions have pointed out the
>> diminishing returns that accrue as the packing strategy is made
>> more complex.
>>
>> The tipping point for me is not whether a particular group of users
>> would find value in a particular enhancement. It is whether the
>> overall cost/benefit considerations -- the expanded complexity, the
>> need to enhance applications, the loss of interoperabilty etc. versus
>> the breadth of users and the benefits they will enjoy -- clearly
>> motivate a change. My personal vote is that thus far the arguments
>> fall well short of this tipping point. But maybe there are other use
>> cases to be explored. Perhaps in aggregate they may tip the
>> cost/benefit analysis. What about the "group of satellite swaths"
>> scenario? -- a feature collection use case. AFAIK CF remains weak at
>> addressing this need thus far. (If we pursue this line of discussion
>> we should add the 'cf_satellite' list onto the thread. That
>> community may have new work on this topic to discuss.)
>>
>> - Steve
>> _______________________________________________
>> CF-metadata mailing list
>> CF-metadata at cgd.ucar.edu <mailto:CF-metadata at cgd.ucar.edu>
>> http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata
>
>
>
> _______________________________________________
> CF-metadata mailing list
> CF-metadata at cgd.ucar.edu
> http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cgd.ucar.edu/pipermail/cf-metadata/attachments/20130926/8ea36090/attachment-0001.html>
Received on Thu Sep 26 2013 - 09:20:33 BST

This archive was generated by hypermail 2.3.0 : Tue Sep 13 2022 - 23:02:41 BST