The individual features within a collection need not necessarily contain the same number of elements. For instance observed in situ time series will commonly contain unique numbers of time points, reflecting different deployment dates of the instruments. Other data sources, such as the output of numerical models, may commonly generate features of identical size. CF offers multiple representations to allow the storage to be optimized for the character of the data. Four types of representation are utilized in this chapter:
two multidimensional array representations, in which each feature instance is allocated the identical amount of storage space. In these representations the instance dimension and the element dimension(s) are distinct CF coordinate axes (typical of coordinate axes discussed in chapter 4); and
two ragged array representations, in which each feature is provided with the minimum amount of space that it requires. In these representations the instances of the individual features are stacked sequentially along the same array dimension as the elements of the features; we refer to this combined dimension as the sample dimension.
In the multidimensional array representations, data variables have both an instance dimension and an element dimension. The dimensions may be given in any order. If there is a need for either the instance or an element dimension to be the netCDF unlimited dimension (so that more features or more elements can be appended), then that dimension must be the outer dimension of the data variable i.e. the leading dimension in CDL.
In the ragged array representations, the instance dimension (i
), which sequences the individual features within the collection, and the element dimension, which sequences the data elements of each feature (o
and p
), both occupy the same dimension (the sample dimension). If the sample dimension is the netCDF unlimited dimension, new data can be appended to the file.
In all representations, the instance dimension (which is also the sample dimension in ragged representations) may be set initially to a size that is arbitrarily larger than what is required for the features which are available at the time that the file is created. Allocating unused array space in this way (pre-filled with missing values -- see also section 9.6, Missing data), can be useful as a means to reserve space that will be available to add features at a later time.
The orthogonal multidimensional array representation, the simplest representation, can be used if each feature instance in the collection has identical coordinates along the element axis of the features. For example, for a collection of the timeSeries that share a common set of times, or a collection of profiles that share a common set of vertical levels, this is likely to be the natural representation to use. In both examples, there will be longitude and latitude coordinate variables, x(i), y(i), that are one-dimensional and defined along the instance dimension.
Table 9.2 illustrates the storage of a data variable using the orthogonal multidimensional array representation. The data variable holds a collection of 4 features. The individual features, distinguished by color, are sequenced along the horizontal axis by the instance dimension indices, i1, i2, i3, i4. Each instance contains three elements, sequenced along the vertical with element dimension indices, o1, o2, o3. The i and o subscripts would be interchanged (i.e. Table 9.2 would be transposed) if the element dimension were the netCDF unlimited dimension.
(i1, o1) |
(i2, o1) |
(i3, o1) |
(i4, o1) |
(i1, o2) |
(i2, o2) |
(i3, o2) |
(i4, o2) |
(i1, o3) |
(i2, o3) |
(i3, o3) |
(i4, o3) |
Table 9.2 The storage of a data variable using the orthogonal multidimensional array representation (subscripts in CDL order).
The instance variables of a dataset corresponding to Table 9.2 will be one-dimensional with size 4 (for example, the latitude locations of timeSeries),
lat(i1) |
lat(i2) |
lat(i3) |
lat(i4) |
and the element coordinate axis will be one-dimensional with size 3 (for example, the time
time(o1) |
time(o2) |
time(o3) |
time(o4) |
coordinates that are shared by all of the timeSeries). This representation is consistent with the multidimensional fields described in chapter 5; the characteristic that makes it atypical from chapter 5 (though not incompatible) is that the instance dimension is a discrete axis (see section 4.5).
The incomplete multidimensional array representation can used if the features within a collection do not all have the same number of elements, but sufficient storage space is available to allocate the number of elements required by the longest feature to all features. That is, features that are shorter than the longest feature must be padded with missing values to bring all instances to the same storage size. This representation sacrifices storage space to achieve simplicity for reading and writing.
Table 9.3 illustrates the storage of a data variable using the orthogonal multidimensional array representation. The data variable holds a collection of 4 features. The individual features, distinguished by color, are sequenced by the instance dimension indices, i1, i2, i3, i4. The instances contain respectively 2, 4, 3 and 6 elements, sequenced by the element dimension index with values of o1, o2, o3, .... The i and o subscripts would be interchanged (i.e. Table 9.3 would be transposed) if the element dimension were the netCDF unlimited dimension.
(i1, o1) |
(i2, o1) |
(i3, o1) |
(i4, o1) |
(i1, o2) |
(i2, o2) |
(i3, o2) |
(i4, o2) |
(i2, o3) |
(i3, o3) |
(i4, o3) | |
(i2, o4) |
(i4, o4) | ||
(i4, o5) | |||
(i4, o6) |
Table 9.3. The storage of data using the incomplete multidimensional array representation (subscripts in CDL order).
The contiguous ragged array representation can be used only if the size of each feature is known at the time that it is created. In this representation the data for each feature will be contiguous on disk, as shown in Table 9.4.
(i1, o1) |
(i1, o2) |
(i2, o1) |
(i2, o2) |
(i2, o3) |
(i2, o4) |
(i3, o1) |
(i3, o2) |
(i3, o3) |
(i4, o1) |
(i4, o2) |
(i4, o3) |
(i4, o4) |
(i4, o5) |
(i4, o6) |
Table 9.4. The storage of data using the contiguous ragged representation (subscripts in CDL order).
In this representation, the file contains a count variable, which must be of type integer and
count(i1) |
count(i2) |
count(i3) |
count(i4) |
2 |
4 |
3 |
6 |
must have the instance dimension as its sole dimension. The count variable contains the number of elements that each feature has. This representation and its count variable are identifiable by the presence of an attribute, sample_dimension
, found on the count variable, which names the sample dimension being counted. For indices that correspond to features, whose data have not yet been written, the count variable should have a value of zero or a missing value.
The indexed ragged array representation stores the features interleaved along the sample dimension in the data variable as shown in Table 9.4. The canonical use case for this representation is the storage of real-time data streams that contain reports from many sources; the data can be written as it arrives.
(i1, o1) |
0 | |
(i2, o1) |
1 | |
(i3, o1) |
2 | |
(i4, o1) |
3 | |
(i4, o2) |
3 | |
(i2, o2) |
1 | |
(i4, o3) |
3 | |
(i4, o4) |
3 | |
(i1, o2) |
0 | |
(i2, o3) |
1 | |
(i3, o2) |
2 | |
(i4, o5) |
3 | |
(i3, o3) |
2 | |
(i2, o4) |
1 | |
(i4, o6) |
3 |
Table 9.4 The storage of data using the indexed ragged representation (subscripts in CDL order). The left hand side of the table illustrates a data variable; the right hand side of the table contains the values of the index variable.
In this representation, the file contains an index variable, which must be of type integer, and must have the sample dimension as its single dimension. The index variable contains the zero-based index of the feature to which each element belongs. This representation is identifiable by the presence of an attribute, instance_dimension
, on the index variable, which names the dimension of the instance variables. For those indices of the sample dimension, into which data have not yet been written, the index variable should be pre-filled with missing values.