[CF-metadata] Towards recognizing and exploiting hierarchical groups (Charlie Zender - Steve Hankin - Richard Signell) from Charlie Zender on 2013-09-21 (Archive of CF discussions from 2002 to 2019 on the cf-metadata mailing list)

From: Charlie Zender <zender>
Date: Fri, 20 Sep 2013 21:14:53 -0700

Hi Philip,

Please read my response to Bryan Lawrence's post on another thread.
It pertains to many of your points. And I may duplicate parts below...

> Hi All,
>
> I like Steve Hankin's point (below) about 'powerful' versus
> 'interoperable' . I hadn't thought about it quite that way before :-).
>
> From my point of view, I do see value in including hierarchical
> information. The most useful case I have seen mentioned so far involves
> putting datasets from different sources (eg different models and
> observations) into a single file. And I can see that there will be
> times when the choice of how to organize the hierarchy is sufficiently
> clear that it would be helpful. Hence, I think this is a valid
> discussion to be having :-).

Yes, groups are great for storing a collection of datasets of all
different ranks and sizes. Instead of having numerous files of
station data and model hyperslabs, intrepid researchers with
a hierachical file could have one or a few hierarchical files.

We could start a whole thread on the use case of "return all the data
sets relevant to this bounding box". Practically impossible with
flat files because of all the grids that could be involved. Yet quite
plausible with groups. However, this case does not interest me as much
as MME. Please be aware of the general issue of file proliferation,
though. Groups tend to reduce this problem.

> What I have not seen mentioned so far is the impact on file sizes. Our
> output simulations generate large datasets and it is impracticable to
> put all the data into a single file. Even if the operating system can
> handle Terabyte, or even Petabyte files, one will have problems
> transferring them and reading them into memory. Hence, for the datasets
> I deal with, we normally work with files each containing one variable
> from one source (and the hierarchy within a file of only one variable
> isn't very interesting ;-).

No one claims groups are always useful.
Some datasets are too big in raw form to make use of hierarchies.
After enough processing though, it may be convenient to merge them
into hierarchical form.
No one wants a suitcase bigger than they can carry.

> Hence, the best use case for using hierarchical structures _inside_ a
> file that I have seen so far is limited to situations where all the
> following are true:
>
> a) there are several different datasets which people would like to
> intercompare, and
>
> b) there is a clear and obvious way to organize the hierarchy, and
>
> c) the datasets are fairly small.

Filesizes of 1 GB are not uncommon.
321 years of monthly mean GCM data for one 1x1 degree surface field.
There's wiggle room in that for meaningful hierarchies.
For many users. Not all.

> I think the case for putting the hierarchical information _outside_ the
> files is stronger. There is clearly no file size problem, and in fact
> might help by reducing the need to access large files. It would also be
> easier to update. There would still be a challenge to make sure that
> externally stored information stays synchronized with the actual
> datafiles, but I don't see that this should be a show stopper.

Thank you for addressing Point 3 of the "compelling ensemble" post.
Here we disagree on the inside vs. outside file storage paradigm.
When information needed for an analysis in not in the file, then
I think that the file is incomplete.
Self-describing datasets are great.
Self-describing and complete datasets are even better.
I prefer solutions that allow for completely self-describing datasets
in a single (netCDF API-accessible) format. I think many would agree.

How can we address Point 3 with flat files without violating Bryan's
concerns about metadata becoming out-of-sync yet retaining everything
in the netCDF-accessible file?

It's a hard task. I think group hierarchies and inheritable group
metadata provide the most natural solution.

Best,
cz

> In summary, I am not yet convinced that the value of allowing hierarchy
> inside files is worth it. I do see greater value in storing hierarchy
> information externally or allowing it to be generated from something
> like the 'dot-appending' system suggested by Steve Hankin (in his Sept
> 16 email).
>
> Best wishes,
>
> Philip

-- 
Charlie Zender, Earth System Sci. & Computer Sci.
University of California, Irvine 949-891-2429 )'(

Received on Fri Sep 20 2013 - 22:14:53 BST

This archive was generated by hypermail 2.3.0 : Tue Sep 13 2022 - 23:02:41 BST