⇐ ⇒

[CF-metadata] Multiple file datasets

From: Steve Hankin <Steven.C.Hankin>
Date: Fri, 20 Nov 2009 09:44:54 -0800

(My apologies for a rushed answer ... proposal deadlines today.)

The CF group discussed this topic at length in the context of how to
infer membership in a model ensemble: how can CF make it evident that
one model run is a close cousin of another's? The basic strategy that
emerged from those discussions was to embed the necessary semantics for
associating files into the global attributes of the files, rather than
to embed specific linkages into the files. One special global attribute
only would be defined and rigidly standardized by name. It would in
turn tell the names of other global attributes that should be consulted
to determine ensemble membership. A match of values for all of those
attributes would indicate ensemble membership. For example

    :ensemble_membership = "institution, model, run_date";
    :institution = "my_institution";
    :model = "my_model";
    :run_date = "my_run_date";

(or whatever -- I just pulled this from the air for illustration).

I believe this is a powerful and general strategy -- applicable to
ensembles, gridspec, and I suspect the swath problem. On the DOWN side,
it means that file linkages are implicit rather than explicit -- i.e.
they must be inferred from the file. On the UP side the solution is

    * simple
    * general
    * human readable (in fact, friendly)
    * machine readable (CF awareness in an application would mean
      knowing the name(s) of the standardized global attributes)
    * stable (has no dependencies on file locations or the order of file
      creations)
    * robust (linkages between files can be recreated at any time)

This strategy could fit elegantly with things like the ncML "scan"
directives; smart scanners that are pointed to collections of files
(either local or remote THREDDS/OPeNDAP accessible) can build the
associations as needed. Ensemble membership, time-aggregation
membership, forecast series membership, gridspec membership -- all can
be properly ordered and sequenced (in principle) through intelligent
file scans based upon CF standardized contents.

    - Steve

================================

V. Balaji wrote:
> The gridspec indeed had a proposal about this. Clearly it was a bit
> off-topic, but some mechanism of referring to other files was needed. It
> consists of an attribute called a link_spec, which has attributes of a
> baseURL, a relative pathname, and a checksum for verifying whether the
> external file being referenced is indeed the one you're looking for.
> There wasn't a special var at link syntax, but I don't see why it couldn't
> have had one.
>
> CMIP5 is proposing a simplified variant on the link_spec. A file
> can have a global attribute "associated_files" which are also
> formed out of a baseURL and relative pathnames. The only permitted
> associated_files are gridspec, and cell areas and volumes that may
> be used in cell_methods.
>
> Other approaches have been proposed in this forum, most notably on Trac
> #24 and #27, the common_concept thread and Benno's namespace thread.
>
> SAFE has been explained already in this thread.
>
> I agree with John, it would be good to consider this problem in
> isolation, without the baggage of gridspecs or common concepts or
> namespaces.
>
> John Caron writes:
>
>> This topic deserves its own heading, so here it is.
>>
>> Perhaps we should gather current practices and ideas. I think
>> Balaji's gridspec has a proposal about this. Can anyone summarize
>> what SAFE does?
>>
>> Im imagining how this is actually used, eg:
>>
>> float data(y,x);
>> data:coordinates = "lat at file1 lon at file2";
>>
>> ????
>>
>>
>>
>> John Graybeal wrote:
>>> I like Bryan's recommendation for a UUID or similar.
>>>
>>> Now I'm going to be annoying and suggest the UUID *could* be a URI,
>>> or these days, an IRI (International ..).
>>>
>>> And I think the way of 'locating' the file should be neither in
>>> packaging nor in local resolution; it should be in global namespace
>>> resolution. This is the way of the future, and is already more
>>> 'permanent' than either packaging or local resolution, IMHO.
>>>
>>> There is one form of URI in particular that is already resolvable: a
>>> URL. OK, that's an old song, but I'm gonna stick to it for a while
>>> longer. That form meets all the other requirements: it can be
>>> registered in a resolver, it can be guaranteed unique (to the same
>>> authority level as a UUID, anyway), and it is a unique string that
>>> can be used to validate the link). And it has the obvious benefit of
>>> being resolvable right now, for as long as the domain is held and
>>> properly maintained (Good URLs don't die).
>>>
>>> Since the last paragraph risks starting another unique identifier
>>> war, I promise not to re-engage unless someone asks me to.
>>> Meanwhile, I like
>>>
>>> John
>>>
>>>
>>> On Nov 19, 2009, at 22:23, Bryan Lawrence wrote:
>>>
>>>> On Thursday 19 November 2009 19:40:08 Jonathan Gregory wrote:
>>>>>> ... In some cases, referencing attributes such as
>>>>>> "coordinates" and "ancillary_variables" would, ideally,
>>>>>> point to a
>>>>>> variable in a different dataset.
>>>>>
>>>>> This is a general problem to which CF doesn't have a solution
>>>>> because it was
>>>>> conceived as a convention for single netCDF files. However we need
>>>>> a solution
>>>>> as often several files should be treated as a single dataset.
>>>>>
>>>>> If the files don't overlap i.e. their contents are complementary,
>>>>> I think it
>>>>> should be satisfactory to allow variables in one file to be
>>>>> pointed to by name
>>>>> from another file, with no other mechanism being required within
>>>>> the file. I
>>>>> don't like the idea of naming one file within another file, as
>>>>> that would be
>>>>> very fragile. Instead, I think the file aggregation should be
>>>>> implied by
>>>>> simply defining the group of files which are to be treated as one
>>>>> file e.g.
>>>>> by putting them in one directory.
>>>>
>>>> It's the old ones that are the best ones :-) :-) this issue keeps
>>>> on coming back ... :-) :-) and we keep trying to ignore it ...
>>>>
>>>> I think we agree that an actual physical filename including path is
>>>> useless. We need both a relative link which relies on the
>>>> preservation of a group of files in a particular arrangement ...
>>>> AND an internal identifier so more robust linking mechanisms can be
>>>> used when (if) the data ends up in a managed environment.
>>>>
>>>> I think it's crucial in this situation to ensure that each file has
>>>> a unique identifier within it (created, for example, with uuid),
>>>> because all solutions which rely on packaging are fragile (SAFE is
>>>> probably better than most), but the bottom line is that users move
>>>> files around ... and we need some way of ensuring that we/they can
>>>> validate the links that are in place are the ones that were
>>>> originally intended.
>>>>
>>>> So relative links would also include the identifier of the intended
>>>> target as well as the relative path in operating system agnostic
>>>> terms.
>>>>
>>>> That identifier can be used in two ways: to validate the link (my
>>>> software can always check that the variable that I just opened
>>>> following a link from another one is the one that was expected by
>>>> checking the container identifier), and b) to produce an identifier
>>>> resolver service for the situation where the packaging has had to
>>>> be broken (which might occur for performance reasons or ...)
>>>>
>>>> CF could recommend something like this ...
>>>>
>>>> Bryan
>>>>
>>>> --
>>>> Bryan Lawrence
>>>> Director of Environmental Archival and Associated Research
>>>> (NCAS/British Atmospheric Data Centre and NCEO/NERC NEODC)
>>>> STFC, Rutherford Appleton Laboratory
>>>> Phone +44 1235 445012; Fax ... 5848;
>>>> Web: home.badc.rl.ac.uk/lawrence
>>>> _______________________________________________
>>>> CF-metadata mailing list
>>>> CF-metadata at cgd.ucar.edu
>>>> http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata
>>>
>>>
>>> --------------
>>> I have my new work email address: jgraybeal at ucsd.edu
>>> --------------
>>>
>>> John Graybeal <mailto:jgraybeal at ucsd.edu>
>>> phone: 858-534-2162
>>> Development Manager
>>> Ocean Observatories Initiative Cyberinfrastructure Project:
>>> http://ci.oceanobservatories.org
>>> Marine Metadata Interoperability Project: http://marinemetadata.org
>>>
>>> _______________________________________________
>>> CF-metadata mailing list
>>> CF-metadata at cgd.ucar.edu
>>> http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata
>>
>> _______________________________________________
>> CF-metadata mailing list
>> CF-metadata at cgd.ucar.edu
>> http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cgd.ucar.edu/pipermail/cf-metadata/attachments/20091120/c57945d6/attachment-0002.html>
Received on Fri Nov 20 2009 - 10:44:54 GMT

This archive was generated by hypermail 2.3.0 : Tue Sep 13 2022 - 23:02:41 BST

⇐ ⇒