⇐ ⇒

[CF-metadata] Multiple file datasets

From: Jonathan Gregory <j.m.gregory>
Date: Sun, 22 Nov 2009 21:20:10 +0000

Dear all

Using metadata to describe a file internally is much more robust than encoding
filenames in a file, and therefore seems preferable to me. I am concerned that
the gridspec proposal suggests using relative filenames in an attribute in the
file (associate_files). If I understand correctly, that would mean it would be
broken if you chosen to give your local copies of CMIP5 files different names
or store them in a different directory arrangement from the one they have in
the archive. (I don't keep my CMIP3 data in the same arrangement as it is held
in PCMDI.)

I assume that CMIP5 CF-netCDF files will have other metadata in them
identifying the institution and so on, as global attributes, along the lines
that Steve mentions. But to make life easier for software which wants to verify
the relationship of the files, a UUID or a URL would be useful, wouldn't it?
I would suggest that those responsible for archiving and distributing CMIP5
data (PCMDI and others) could assign a unique identifier for each model-
scenario-ensemble_member as the data is made available by the modelling centre.
These are data-spaces within which CF metadata will distinguish every datum,
but the data-spaces themselves need to be distinguished. A unique identifier
would be more robust than a combination of more descriptive metadata, as well
as easier to process, I would say.

Tagged with a unique identifier (UUID or URL), there should be no need for
a checksum on the file. There should be only one file having the right gridspec
details for a given UUID. References within the file to variables in other
files can be just by variable name, without any extended _at_ syntax. But it is
too restrictive to insist that variable names can't be repeated within the
group of files sharing a UUID, since often they will be: if the files are
organised by time ranges, for instance, there will be many time coord variables
and data variables with part of the data, and it would be a great nuisance to
have to give them all unique variable names. I suggest that the rule should be,
when a variable is referred to by name,

* if a variable of that name exists within the same file, it is the one meant.

* if it doesn't exist in the file, there must be only one variable with this
name anywhere in the set of files.

However, I also think that using unique identifiers to define groups will
sometimes be too restrictive. I would like it to be possible to treat any
set of files I choose as a single dataset. For example, since a file with
cell measures like area and volume, or a gridspec file, will often be the same
for many or all experiments using a given model, I may wish to have only one
copy of it, and use it for all the different datasets, even though they have
different unique identifiers. Another example is that I might wish to treat
data from different ensemble members of an experiment as part of the same
dataset, so I can aggregate it with an ensemble axis and then compute stats
by collapsing that axis. This flexibility can easily be achieved simply by
allowing the option of ignoring the unique identifier, but still allowing
references from one file to another by variable name alone. In this more
flexible case, how the group of files is identified depends on the software
being used, of course. This is the principle of the cdms cdscan tool, for
example, which will treat any arbitrary group of files as a single dataset.

To summarise, therefore, I suggest that
* a CF attribute should be defined to store a unique identifier for a dataset
spread across an arbitrary set of files. Within the dataset, other CF metadata
must be sufficient to identify each datum uniquely.
* CMIP5 should assign such an identifier for each run submitted and CMOR should
be able to record it in the files generated. I think these unique IDs would
actually be quite handy for keeping track of CMIP5 data.
* gridspec could refer to variables by name alone, with no associated_files
attribute or checksum being required, as the unique id will serve those
purposes.
* CF should permit variables to be referred to by name alone in another file,
subject to rules (such as proposed above). It is a decision of the data user
whether the associated files should be required to have the same unique ID.

Best wishes

Jonathan
Received on Sun Nov 22 2009 - 14:20:10 GMT

This archive was generated by hypermail 2.3.0 : Tue Sep 13 2022 - 23:02:41 BST

⇐ ⇒