⇐ ⇒

[CF-metadata] Towards recognizing and exploiting hierarchical groups

From: Charlie Zender <zender>
Date: Sun, 15 Sep 2013 18:53:29 -0700

NASA has recently convened an Earth Science Data System Working Group to
explore existing conventions for data and products stored in HDF and to
make recommendations for future developments. The CF Conventions are an
important element in this work, as many scientists and users are
interested in data products that comply with CF. Many members of the
working group are familiar with CF and have been involved in attempts to
apply the CF Conventions to a variety of Earth Science data products.

We have identified a persistent barrier to NASA's greater adoption of
CF: the lack of protocols for exploiting software-defined group
hierarchies for data structures. HDF datasets traditionally collected
and stewarded by NASA often utilize hierarchical (the "H" in HDF)
groups. A chief advantage of netCDF4 over netCDF3 is that it supports a
group API compatible with HDF. Here we outline an approach to
incorporating groups into CF as a step towards recognizing and,
eventually, exploiting groups.

Some aspects of CF (especially the netCDF Conventions like _FillValue,
valid_min) can apply unambiguously to HDF files that use groups, but
other aspects of CF conventions have room for ambiguity when applied to
such HDF files. Clarifying that ambiguity is one role of conventions, so
we would like to start a discussion with the aim of obtaining feedback,
gathering consensus, and eventually, possibly, embedding
"group-awareness" into CF. Unidata's white paper on Conventions for
netCDF4
(http://www.unidata.ucar.edu/software/netcdf/papers/nc4_conventions.html) began
the discussion of potential "group-aware" CF capabilities. Some previous
discussion of "group-aware" CF metadata is contained or referenced in
CF-Metadata Trac tickets 79 (Handling and formatting of vector
quantities in CF) and 90 (Collection of CF enhancements for
interoperable applications) yet the "big discussion" on how/whether CF
should exploit the hierarchical group capabilities of netCDF4 is
unfinished. Below we propose a standard scheme for interpreting metadata
scope in hierarchical (group) files, and suggest one or two new Group
Attributes which we could turn into concrete proposals if interest warrants.

Perhaps the most obvious place to start a discussion on making CF
"group-aware" is the notion of attribute scope: How ought metadata in
one group apply, if at all, to other groups? CF metadata attributes may
be applied at the group level (netCDF4 allows this) yet what should that
mean? Whereas the current CF Convention speaks only of Global Attributes
and Variable Attributes, a "group-aware" CF must explicitly define the
properties of a third category of attributes, Group Attributes. Global
Attributes are a special case of Group Attributes and should share their
properties.

The key technical definition we propose is that Group Attributes shall
apply to the group where they are defined and to its descendents, but
not to that group's ancestors or siblings. Group Attributes apply to all
a group's descendents recursively with an exception: Any group may
redefine an attribute defined in an ancestor group, and that
child-group's definition applies to all its descendents. Thus in cases
where multiple ancestor groups define the same attribute, attribute
values are inherited from the nearest ancestor. Note that these are the
same scoping properties as netCDF4 dimensions.

Our understanding is that this proposal is consistent and
backwards-compatible with CF. However, it would extend the current usage
of CF to files with arbitrary hierarchies of groups. Moreover, it might
be helpful to specifically disallow (or mark as having undefined
consequences) the use of Group Attributes to store metadata that should
always be attached directly to variables. Group Attributes such as
_FillValue, scale_factor, valid_min, might sometimes seem tempting yet
might create more problems than they would solve. Some attributes (e.g.,
Convention) may be useful only as Global Attributes, and not as Group
Attributes for other groups.

What would a "group-aware" CF Convention mean in practice? It is
important to preserve CF backwards compatibility. The metadata
annotation of flat files (e.g., all netCDF3 files) need not be affected
by any "group-aware" CF Convention extensions.

Files with group hierarchies would continue to have Global Attributes
(i.e., Group Attributes at the root group level). Global Attributes are
almost always useful because they apply to the entire file except where
superceded by an attribute of the same name at a lower level. Where
group-oriented attribute conventions would help, we believe, is in
extending the power of CF unambiguously to nested groups.

Imagine a group file in which each top level group holds model results
from a distinct CMIP5 simulation (CCSM, ECMWF, GISS, etc.). Or where
each top level group holds a different satellite-retrieved value of the
same field (ERBE OLR, CERES OLR, etc.), or a different channel from the
same multi-spectral radiometer. It may be helpful to know the relation
of groups to other groups, so that users and tools can learn which are
(or aren't) intercomparable or aggregable. Properties of ensembles
stored as groups that would be helpful to know, in an automated way, by
analysis tools (such as NCO) include: Which groups contain the other
ensemble realizations? Which groups hold other channels of a
multi-spectral instrument? Knowing this information would help users and
analysis tools infer how best to create ensemble statistics, and could
significantly reduce the overall number of files confronting users.

Finally, groups allow containerization of information which can be
useful in avoiding repetition. Some would like to define metadata-only
groups that could then be logically attached to apply to some or all
other groups in a file. Is it desirable for CF to define a standard way
to indicate this?

As the previous examples illustrate, there are at least two levels to a
discussion about "group-aware" CF. The first is scope, i.e., how
attribute meanings are inherited in hierarchies. The second is the more
pragmatic issue of what new CF attributes would allow us to exploit
group hierarchies in a systematic way. We proposed an answer to the
scope issue to kickstart the discussion. We illustrated how a new
attribute (call it "ensemble" for now) might be useful. At this stage we
wish to learn whether CF users/developers are interested in pursuing
"group-aware" CF extensions at all before we develop more
details/wording for specific conventions. Perhaps there are others
working on similar issues, or perhaps the CF maintainers prefer to
receive specific wording of proposals rather than more diffuse
"invitations to discuss" like this. If you have an opinion, then please
let us know.

Until the CF (or some other) Convention tackles the issues of scoping
and Group Attributes, such annotations will be ad hoc. Our goal is to
increase interoperability, and we are eager to hear responses from the
CF community on the direction of "group-aware" extensions to CF.

On behalf of the NASA ESDS HDF5 WG,
Charlie Zender, Ted Habermann, and Peter Leonard
-- 
Charlie Zender, Earth System Sci. & Computer Sci.
University of California, Irvine 949-891-2429 )'(
Received on Sun Sep 15 2013 - 19:53:29 BST

This archive was generated by hypermail 2.3.0 : Tue Sep 13 2022 - 23:02:41 BST

⇐ ⇒