⇐ ⇒

[CF-metadata] Are ensembles a compelling use case for "group-aware" metadata? (CZ)

From: Nan Galbraith <ngalbraith>
Date: Thu, 03 Oct 2013 14:19:44 -0400

> Things are hard to predict, especially in the future. :-)

I agree with your position on groups (as well as on predictions) , but
this discussion
has also made me think that, to the extent that the proponents are
advocating for a
richer structure for attributes, that's something we should consider.

The data model trac ticket has some interesting points of view about
global vs variable
attributes; I think it shows a need for some mechanism to allow users to
specify
relationships like inheritance, precedence (and concatenation vs
replacement) for
attributes in a file.

In my data, there's also a problem with variable attributes applying to
an entire variable,
when often different attribute values are needed for different parts of
the record - e.g.
depth bins may need different provenance information. The work-arounds
for this are
clumsy and non-standard.

If CF could address some of the issues of 'flat' metadata, with an eye
to an eventual need
to support groups, it might make the standard more useful to more
people, without
compromising its simplicity much.

Cheers - Nan


On 10/1/13 9:13 AM, john caron wrote:
> Hi all:
>
> A few thoughts from my (possibly limited) POV.
>
> 1. Best strategy for storing very large collections of data in flat
> files like netcdf?
>
> - store coherent chunks of the dataset in each file. Seems like a good
> file size these days is 100M - 1 Gbyte.
> - choose "coherence" by deciding on the reader's most likely access
> pattern, assuming "write once - read many".
> - build an external index of the entire dataset to make access as fast
> as possible. the index can be rebuilt as needed without modifying the
> data files.
>
> 2. Whats the best use for groups in a netcdf file?
>
> I think its best as a mimic to heirarchical file systems. when there
> are more than a (few?) hundred files in a directory, the listing
> become unwieldy and hard to navigate. also, a file directory is a
> "namespace" that allows the same filename to be used in different
> directories. i think groups should be used in the same way for the
> same reasons. from this POV, an array of groups i think is not a good
> idea.
>
> Zip files are interesting in that they have internal directories, and
> tools like 7zip and Windows file explorer transparently let you browse
> the OS file heirarchy and the internal zip heirarchy.
>
> In THREDDS, we have found a good use for groups as follows. Given a
> collection of GRIB files, if there are multiple horizontal grids, we
> create a seperate group for each horizontal grid, and put all the
> variables that use that horizontal grid into that group. This creates
> a dataset with multiple groups. Each group is actually self-contained,
> so the TDS, when serving the dataset out, will break the dataset back
> into multiple datasets, one for each group. This simplifies things for
> software that is not group aware.
>
> A netcdf file is itself a container that creates a namespace for the
> variables within it. So you dont usually need groups for a file that
> has less than N variables in it (N = 50? 200??). The case of multiple
> coordinate systems is a reasonable exception, because often you have
> exactly the same variables in each group, so you need a seperate
> namespace.
>
> One could apply this argument to ensemble data, and i think its
> reasonable. But as others have pointd out, you likely have much more
> data than you want to place into a single file. So then the file
> becomes the namespace and internal groups are less compelling.
>
> Suppose we are talking about the logical structure of a multi-file
> collection? One could decide that internal groups are a good way to
> keep things clear, even though you perhaps never have more than one
> group in a file. Then, when you "aggregate" the files into a logical
> dataset, the user sees a structure that makes it clear what things
> are. One would also want to be able to specify such a structure in an
> external configuration that describes the collection. No doubt this
> issue will come back when CF works on conventions for multifile
> collections.
>
> 3. What does the TDS currently do with groups?
>
> The TDS data model (aka CDM) handles groups just fine. You can
> aggregate variables in groups across files, as long as they are
> identical. We are waiting for DAP4 to be able to serve these though
> OPeNDAP, but the other services (WMS, WCS, NCSS, cdmremote) work now
> with groups. OTOH, we dont see them much used yet, except in HDF-EOS.
>
> 4. Things are hard to predict, especially in the future.
>
> I think software will evolve to handle the extended data model. I
> think data providers should start to use it, and CF could start
> examining best practices for "extended data model" Conventions
> (Perhaps CF-2 ?). That is to say, providers cant wait for CF
> conventions, they need to try new stuff that CF eventually
> incorporates (or not). Jim Baird's satellite data seems to be a good
> example of this.
>
>
>
> On 9/29/2013 9:46 PM, Charlie Zender wrote:
>> Hi Steve,
>>
>> Thank you for your continued engagement and responses.
>> Looks like CF hasn't the appetite for group hierarchies anytime soon.
>> I'll report the lessons learned in this fruitful discussion to our
>> NASA WG next week. My "concluding" (for now) remarks are interleaved.
>>
>> Best,
>> Charlie
>>
>> Le 25/09/2013 12:33, Steve Hankin a ?crit :>
>>> On 9/24/2013 9:45 PM, Charlie Zender wrote:
>>>> It is not my place to determine whether there is a consensus, or how
>>>> close we are, but it's clear to me there is no consensus yet. Bryan
>>>> Lawrence, Steve Hankin, Jonathan Gregory, Karl Taylor, and Philip
>>>> Cameron-Smith are not "on board". I hope they will speak-up and say if
>>>> they concur that maintaining the status quo (flat files) is best
>>>> (period), or whether they do wish to extend CF to hierarchies
>>>> (starting now), or the additional information they would need to
>>>> decide.
>>> Hi Charlie et. al.,
>>>
>>> Since you have asked .... I have heard two points that seemed to
>>> bolster Bryan's pov that the multi-model use case is "great but not
>>> compelling". (See a more positive spin at the end.)
>>>
>>> 1. file size. Model outputs today are typically too large for
>>> even a
>>> single variable from a single model to be packaged in a single
>>> file. Addressing a model ensemble multiplies the size barrier by
>>> the ensemble size, N. Thus the use of groups to package a model
>>> ensemble applies only for the cases where user is interested in
>>> quite a small subset of the model domain, or perhaps in
>>> pre-processed, data-reduced versions of the models. A
>>> gut-estimate
>>> is that single file solutions, like netCDF4 groups addresses
>>> 25% or
>>> less of the stated use case. We could argue over that number,
>>> but
>>> it seems likely to remain on the low side of 50%. (Issues of
>>> THREDDS-aggregating files bearing groups also deserve to be
>>> discussed and understood. What works? what doesn't?)
>> Your remarks seem most applicable to "enterprise datasets" like CMIP5.
>> CMIP5 is fairly unique. It garners the most press, and is well-known.
>> There are numerous models of smaller scale output size that few have
>> heard of whose outputs strive for and benefit from CF-compliance.
>>
>> I am unfamiliar with THREDDS support for netCDF4 features.
>> The HDF Group supports an HDF5 handler for Hyrax 1.8.8:
>> http://hdfeos.org/software/hdf5_handler/doc/install.php
>> Someone more knowledgable please say whether/how-well TDS integrates
>> group capabilities.
>>
>>> 2. The problems of the "suitcase packing" metaphor were invoked time
>>> and again, further narrowing the applicability of the use
>>> case. The
>>> sweet spot that was identified is the case of a single user
>>> desiring
>>> a particular subset from a single data provider. Essentially a
>>> multi-model ensemble encoded using netCDF4 groups would offer a
>>> standardized "shopping basket" with advantages that will be
>>> enjoyed
>>> by some high powered analysis users.
>> My impression of suitcases is perhaps more graduated than others'.
>> Straightforward suitcases work on local machines.
>> Anyone can do it with free software right now:
>> http://nco.sf.net/nco.html#ncecat
>> http://nco.sf.net/nco.html#ncdismember
>> Let us not confound the issue of whether it works with whether
>> people know how to do it. Much resistance to software change stems
>> from abhorrence of reading the ****ing manual. There is little
>> difference between running unzip and ncdismember on a file other than
>> people are familiar with the former not the latter.
>>
>> Unless there are "external metadata" involved, it is trivial (as in
>> already demonstrated, at least to my satisfaction) to pack and unpack
>> a netCDF4 group suitcase with a collection of flat netCDF3 files.
>> ncecat and ncdismember do this without loss of information.
>> So simple suitcases work on local machines.
>> Servers can leverage that, or re-implement, as they see fit.
>> In either case, it's more a logistical than a conceptual barrier.
>>
>> More complex information like "ensemble axes" require conventions.
>> Conventions could enable use-cases well-beyond simple suitcases.
>> In the meantime, it would be helpful if CF reserved "ensemble" and
>> "group" in attribute names for future use pertaining to netCDF4 groups,
>> rather than some other kind of "group" based on flat files.
>> Doing so might help reduce conflicts with informal ensemble conventions
>> that NCO will test.
>>
>>> For this narrower use case I couldn't help asking myself how the
>>> cost/benefit found through the use of netCDF4 groups compares with
>>> the cost/benefit of simply zip-packaging the individual CF model
>>> files. There is almost no cost to this alternative. Tools to
>>> pack
>>> and unpack zip files are universal, have UIs embedded into common
>>> OSes, and offer APIs that permit ensemble analysis to be done
>>> on the
>>> zip file as a unit at similar programming effort to the use of
>>> netCDF4 groups. Comprehension and acceptance of the zip
>>> alternative on the part of user communities would likely be
>>> instantaneous -- hardly even a point to generate discussion. Zip
>>> files do not address more specialized use cases, like a desire to
>>> view the ensemble as a 2-level hierarchy of models each providing
>>> multiple scenarios, but the "suitcase" metaphor discussions have
>>> pointed out the diminishing returns that accrue as the packing
>>> strategy is made more complex.
>> Yes, zipped files are a trusty standby. No argument there.
>> Users avoid installing and learning new software. Until they don't :)
>> My sense is that users will goad the providers to the "tipping point".
>> Once users see enough well-documented examples of suitcases, they will
>> want and ask data providers to give them suitcases, especially once
>> they realize many suitcases need not ever be unpacked.
>>
>>> The tipping point for me is not whether a particular group of users
>>> would find value in a particular enhancement. It is whether the
>>> overall
>>> cost/benefit considerations -- the expanded complexity, the need to
>>> enhance applications, the loss of interoperabilty etc. versus the
>>> breadth of users and the benefits they will enjoy -- clearly motivate a
>>> change. My personal vote is that thus far the arguments fall well
>>> short of this tipping point. But maybe there are other use cases to be
>>> explored. Perhaps in aggregate they may tip the cost/benefit analysis.
>>> What about the "group of satellite swaths" scenario? -- a feature
>>> collection use case. AFAIK CF remains weak at addressing this need
>>> thus
>>> far. (If we pursue this line of discussion we should add the
>>> 'cf_satellite' list onto the thread. That community may have new work
>>> on this topic to discuss.)
>> Yes, CF has little "market share" for storage of satellite swaths,
>> granules, or so-called object/region references, all HDF-EOS mainstays.
>> Some well-documented solutions exist:
>> http://wiki.esipfed.org/index.php/NetCDF-CF_File_Examples_for_Satellite_Swath_Data
>>
>> Our NASA WG will try to identify such netCDF-API options for storing
>> multiple satellite granules that are a compromise of convenience
>> (multiple granules/groups per file) and CF-compliance.
>> Our users deserve no less.
>>
>> cz
>>
>>> - Steve
>>
>
> _______________________________________________
> CF-metadata mailing list
> CF-metadata at cgd.ucar.edu
> http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata
>


-- 
*******************************************************
* Nan Galbraith        Information Systems Specialist *
* Upper Ocean Processes Group            Mail Stop 29 *
* Woods Hole Oceanographic Institution                *
* Woods Hole, MA 02543                 (508) 289-2444 *
*******************************************************
Received on Thu Oct 03 2013 - 12:19:44 BST

This archive was generated by hypermail 2.3.0 : Tue Sep 13 2022 - 23:02:41 BST

⇐ ⇒