Draft CF data model

Proposed version 0.7

In CF trac ticket 88, proposed by Mark Hedley and accepted on 5th August 2012, it has been decided that CF should adopt a data model. The data model will be a logical abstraction of the concepts of CF data and metadata, and the relationships that exist between these concepts, but will not define an application programming interface (API) for CF. Adopting a data model is believed to offer the following benefits:

The present document proposes a data model corresponding to the CF metadata standard (version 1.5). The data model avoids prescribing more than is needed for interpreting CF as it stands, in order to avoid inconsistency with future developments of CF. This document is illustrated by the accompanying UML diagram of the data model.

As well as describing the CF data model, this document also comments on how it is implemented in netCDF. Since the CF data model could be implemented in file formats other than netCDF, it would be logically better to put the information about CF-netCDF in a separate document, but when introducing the data model for the first time, we feel that this document would be harder to understand if it omitted reference to the netCDF information. We propose that these functions should be separated in a later version of the data model. Some parts of the CF standard arise specifically from the requirements or restrictions of the netCDF file format, or are concerned with efficient ways of storing data on disk; these parts are not logically part of the data model and are only briefly mentioned in this document.

In this document, we use the word "construct" because we feel it to be a more language-neutral term than "object" or "structure". The constructs of this data model might correspond to objects in an OO language.

Field construct

The central concept of the data model is a field construct. In a dataset contained in a single netCDF file, each data variable usually corresponds to a field construct, but a field construct might be a combination of several data variables. In a dataset comprising several netCDF files, a field construct may span data variables in more than one file, for instance from different ranges of a time coordinate (to be introduced by Gridspec in CF version 1.7). Rules for aggregating data variables from one or several files into a single field construct are needed but are not defined by CF version 1.5; such rules are regarded as the concern of data processing software.

This data model makes a central assumption that each field construct is independent. Data variables stored in CF-netCDF files are often not independent, because they share coordinate variables. However, we view this solely as a means of saving disk space, and we assume that software will be able to alter any field construct in memory without affecting other field constructs. For instance, if the coordinates of one field construct are modified, it will not affect any other field construct. Explicit tests of equality will be required to establish whether two data variables have the same coordinates. Such tests are necessary in general if CF is applied to a dataset comprising more than one file, because different variables may then reside in different files, with their own coordinate variables. In a netCDF file, tests for the equality of coordinates between different data variables may be simplified if the data variables refer to the same coordinate variable.

Each field construct may have

All the components of the field construct bar the data array are optional.

Collectively, the domain axis, dimension coordinate, auxiliary coordinate, cell measure and cell method constructs describe the domain in which the data resides. Thus a field construct can be regarded as a domain with data in that domain.

The CF-netCDF formula_terms (see also Transform constructs) and ancillary_variables attributes make links between field constructs. These links are fragile. If a field construct is written to a file, it is not required that any other field constructs to which it is linked are also written to the file. If an operation alters one field construct in a way which could invalidate a relationship with another field construct, the link should be broken. The user of software will have to be aware of these relationships and remake them if applicable and useful.

Domain axis construct

A domain axis construct must contain

Dimension coordinate construct

A dimension coordinate construct indicates the physical meaning and locations of the cells for a unique domain axis of the field.

A dimension coordinate construct may contain

In this data model we permit a domain axis not to have a coordinate array if there is no appropriate numeric monotonic coordinate. That is the case for a dimension that runs over ocean basins or area types, for example, or for a domain axis that indexes timeseries at scattered points. Such domain axes do not correspond to a continuous physical quantity. (They will be called index dimensions in CF version 1.6.)

Auxiliary coordinate construct

An auxiliary coordinate construct provides auxiliary information for interpreting the cells of an ordered list of one or more domain axes of the field.

An auxiliary coordinate construct must contain

and may also contain Auxiliary coordinate constructs correspond to auxiliary coordinate variables named by the coordinates attribute of a data variable in a CF-netCDF file. CF recommends there to be auxiliary coordinate constructs of latitude and longitude if there is two-dimensional horizontal variation but the horizontal coordinates are not latitude and longitude. As for dimension constructs, auxiliary coordinate constructs of different field constructs are independent in the data model.

Cell measure construct

A cell measure construct provides information about the size, shape or location of the cells defined by an ordered list of one or more domain axes of the field.

A cell measure construct may contain

and must contain In CF-netCDF files, cell measures constructs correspond to variables named by the cell_measures attribute of the data variable. As for dimensions, cell measures constructs of different field constructs are independent in the data model.

Cell methods construct

The cell methods construct describes how the data values represent variation of the quantity within cells. It corresponds to the cell_methods attribute of the data variable in CF-netCDF files. It is an ordered list, because the methods specified are not necessarily commutative. Each entry of the list specifies either one or more dimensions, or a CF standard name (to describe variation with respect to a quantity which is not recorded as a dimension of the field), and a method e.g. mean (CF Appendix E). Special methods indicate climatological time processing.

Transform constructs

A transform construct defines a formula for transforming one group of dimension or auxiliary coordinates into another, consistent group of dimension or auxiliary coordinates for the same domain.

Either of these groups of coordinates may not exist, in which case it may be created by applying the transformation, inverting the formula if necessary.

A transform construct contains

Transform constructs correspond to the functions of the CF-netCDF attributes formula_terms, which describes how to compute a vertical coordinate variable from components (CF Appendix D), and grid_mapping, which describes how to transform between longitude-latitude field and the horizontal coordinates of the field construct (CF Appendix F). The transform name is the standard_name of a vertical coordinate variable with formula_terms, and the grid_mapping_name of a grid_mapping variable. The scalar parameters are scalar data variables (which should have units if dimensional) named by formula_terms, and attributes of grid_mapping variables (for which the units are specified by the transform construct). The role of each term in the formulae of the transform construct is identified by its keyword in a formula_terms attribute, or its attribute name in a grid_mapping variable.

Other properties

The other properties recognised by this CF data model correspond to attributes listed in CF Appendix A. For field constructs, the allowed properties are comment, history, institution, long_name, references, source, standard_error_multiplier, standard_name, title, units. Some of these can be global attributes in a CF-netCDF file. In this data model, it is assumed that any relevant global attribute is also an attribute of every data variable, although it is superseded if the data variable has its own attribute. Each field construct in the model has its own independent set of properties. For dimensions and auxiliary coordinate constructs, the allowed properties are axis, calendar, leap_month, leap_year, long_name, month_lengths, positive, standard_name, units. Coordinate constructs of time are optionally climatological; this property is indicated by the presence of the climatology attribute. In any field, any given value of the axis attribute can occur no more than once among all the dimension and auxiliary coordinates of that field. The CF data model allows field, dimension and auxiliary coordinate constructs to have other properties not defined by CF, provided they do not conflict with CF, but since they are not part of the CF standard, the data model does not provide any interpretation of them.

The attributes valid_max, valid_min and valid_range of data variables and coordinate variables are checks on the validity of the values, which could be verified on input and written on output. In this CF data model we assume they do not constrain any manipulations which might be done on the data in memory, and they are not part of the data model.

The attributes _FillValue and missing_value of data variables specify how missing data is indicated in the data array. This data model supports the idea of missing data, but does not depend on any particular method of indicating it, so these attributes are not part of the model.

The attributes add_offset, compress, flag_masks, flag_meanings, flag_values and scale_factor are all used in methods of compressing the data to save space in CF-netCDF files, with or without loss of information. They are not part of this data model because these operations do not logically alter the data, except that the compress attribute implies two alternative interpretations of coordinates (compressed or uncompressed). The "feature type" attribute and associated new conventions, to be introduced in CF version 1.6, will provide a way of packing multiple fields of the same kind of discrete sampling geometry (timeseries, trajectories, etc.) into a single CF-netCDF data variable, in order to save space, since a multidimensional representation with common coordinate variables is typically very wasteful in such cases. This is a kind of compression. The data model would regard each instance of the feature type as an independent field construct. However, the "feature type" attribute itself is also a metadata property that would be a property of the field construct and part of the data model.

The attributes bounds, cell_measures, cell_methods, climatology, Conventions, coordinates, formula_terms and grid_mapping have various special or structural functions in the CF-netCDF file format. Their functions and the relationships they indicate are reflected in the structure of this data model, and these attributes do not correspond directly to properties in the data model.

17th December 2012
Version 0.6 of 12th December 2012
Version 0.5 of 16th October 2012
Version 0.4 of 5th August 2012
Version 0.3 of 6th February 2012
Version 0.2 of 1st August 2011
Original version 0.1 of 10th January 2011

Jonathan Gregory, David Hassell and Mark Hedley