Opened 10 years ago

Closed 10 years ago

#63 closed enhancement (fixed)

GRIDSPEC: aggregation of block-structured and time dependent data

Reported by: pletzer Owned by: cf-conventions@…
Priority: medium Milestone:
Component: cf-conventions Version:
Keywords: gridspec, aggregation, mosaic, tile Cc: libcf Development <libcf-devel@…>, V.Balaji@…, Jonathan Gregory <j.m.gregory@…>

Description

1. Title

GRIDSPEC to aggregate block-structured and time dependent data

2. Moderator

Balaji (v.balaji@…)

3. Requirement

Current conventions support logical rectangular data stored in a single file. The proposed extension will allow users to aggregate data stored in multiple files, whether the data represent different time slices or are associated with different logically rectangular grids as in the case of the cubed sphere mosaic for instance.

4. Initial Statement of Technical Proposal

GRIDSPEC comprises of two parts: M-SPEC for representing the grid connectivity in mosaics and F-SPEC for time and mosaic data aggregation.

For M-SPEC, a mosaic file contains the list of files where grid information can be extracted with inter-tile connectivity information provided as a map between the set of indices on one tile to indices on the neighbouring tile. This covers both surfacial and volumetric coupling. The former arises for instance when two three-dimensional grids share a surface while the latter arises when the grids are overlapping.

The F-SPEC describes the format of the host file, which acts as a single entry point to the aggregation. Variable data scattered among many files appear as one logical entity when viewed through the host file. Brittleness, in other words the fragility of the aggregation resulting from inadvertent file corruption, file movement and other file operations, is minimized by consolidating all the file names in the host file, which can be straightforwardly re-generated by scanning directories and inspecting the global attributes of the files residing therein.

5. Benefits

Data producers will be able to store data in multiple files while allowing data consumers to access the data as if all the data were stored in a single file.

A number of codes are moving away from longitude-latitude grids because of the singularity of such grids at the pole, which can cause severe limitations in the maximum time steps than models can take. M-SPEC will allow atmospheric and ocean models to store data on their native grids, without having to incur inaccuracies due to lon-lat regridding.

6. Detailed Proposal

Because of the length of this proposal, a separate web page was created to explain in more details our proposed enhancements. The web page contains many examples and is supported by illustrations.

Change History (13)

comment:1 follow-up: Changed 10 years ago by jonathan

Dear Alex

Thank you for making this proposal and for the email discussions we have already had about it, which have cleared up several questions I had. I'd like to raise a few remaining issues.

  1. I think the attribute name cf_type_name for identifying the purpose of variables containing gridspec information is unnecessarily general. Unless you foresee a very closely related application for this attribute, I would suggest you use something less general and hence more informative, such as gridspec_type_name.
  1. Please could you clarify whether spaces are allowed or not surrounding : in the contact lists? As you know, I would favour allowing space but ignoring it, in order to reduce the number of syntax errors which may occur. I don't think it would make much difference to legibility or parsing.
  1. I think that it should be allowed to contain the grid and/or data from more than one tile in a given file. (In general I think we should be as flexible as possible regarding how things are distributed among files; files are only containers for variables and, as Steve Hankin has pointed out, files are fictional when data is not being served directly from a file store.) To enable that, the tile_name would have to be allowed as a variable attribute as an alternative to a global attribute.
  1. I do not see the need to separate static from time-dependent variables. It is easy to tell whether a variable is time-dependent, by examining its coordinates.

Best wishes

Jonathan

comment:2 Changed 10 years ago by stevehankin

This comment is to follow up on the discussion of the merits of "cf_type_name" versus "gridspec_type_name". In a separate discussion of the CF Discrete Geometries conventions a very similar need has arisen for an attribute that captures the role that a variable plays within a (complex) CF dataset. Similar issues have also arisen in the CF satellite conventions and the conventions for unstructured grids (meshes).

John Caron has suggested the simpler attribute name, "cf_role" in the context of the Discrete Geometries discussions. If CF includes a well-managed controlled vocabulary of string values for this attribute, it can satisfy the needs of all of these different conventions, leading to a CF standard that is arguably leaner, cleaner and more easily extensible without the risk of attribute name collisions.

In the case of gridspec I propose that "cf_role" be used in place of "gridspec_type"name" leading to the following controlled vocabulary entries:

o cf_role = "gridspec_coordinate_names"

o cf_role = "gridspec_tile_contacts"

o cf_role = "gridspec_contact_map"

o cf_role = "gridspec_mosaic_filename"

o cf_role = "gridspec_static_data_filename"

o cf_role = "gridspec_time_data_filename"

o cf_role = "gridspec_tile_names"

comment:3 in reply to: ↑ 1 ; follow-ups: Changed 10 years ago by pletzer

Hi Jonathan,

Replying to jonathan:

Dear Alex

Thank you for making this proposal and for the email discussions we have already had about it, which have cleared up several questions I had. I'd like to raise a few remaining issues.

  1. I think the attribute name cf_type_name for identifying the purpose of variables containing gridspec information is unnecessarily general. Unless you foresee a very closely related application for this attribute, I would suggest you use something less general and hence more informative, such as gridspec_type_name.

Although I have changed cf_type_name to gridspec_type_name, it appears other people (see comment by S Hankin) have a need for a more general cf_role type attribute. I propose to use cf_role in place gridspec_type_name. We will still name space the *value* of cf_role with gridspec_ (eg gridspec_coordinate_names) to indicate that this particular cf_role is specific to gridspec. I will make the change if ok with you.

  1. Please could you clarify whether spaces are allowed or not surrounding : in the contact lists? As you know, I would favour allowing space but ignoring it, in order to reduce the number of syntax errors which may occur. I don't think it would make much difference to legibility or parsing.

Yes, allowed and now clarified.

  1. I think that it should be allowed to contain the grid and/or data from more than one tile in a given file. (In general I think we should be as flexible as possible regarding how things are distributed among files; files are only containers for variables and, as Steve Hankin has pointed out, files are fictional when data is not being served directly from a file store.) To enable that, the tile_name would have to be allowed as a variable attribute as an alternative to a global attribute.

The assumption of the present design is that the variables which are logically the same (i.e. the same variable but just distributed across tile and time files) share the same name (e.g. "ta"). To allow multiple tile data to be stored in the same file would require assigning different names to the variables. How would you infer that two variables contribute to the same logical variable without adding new attributes other than tile_name?

  1. I do not see the need to separate static from time-dependent variables. It is easy to tell whether a variable is time-dependent, by examining its coordinates.

In the host file, the file names are classified into grid file names, static file names, and time dependent file names because the rank and dimensions for each are different. Each collection of file names is an array of strings. The rank (dimensionality) of the static file name list is nvars_static x ntiles x nstring whereas the dimensionality of the time dependent file name list is ntimes x nvars_timedep x ntiles x nstring.

Thanks for your feedback!

--Alex

comment:4 in reply to: ↑ 3 Changed 10 years ago by pletzer

Steve and Jonanthan,

I would feel comfortable with this change.

--Alex

'Replying to pletzer:

Hi Jonathan,

Replying to jonathan:

Dear Alex

Thank you for making this proposal and for the email discussions we have already had about it, which have cleared up several questions I had. I'd like to raise a few remaining issues.

  1. I think the attribute name cf_type_name for identifying the purpose of variables containing gridspec information is unnecessarily general. Unless you foresee a very closely related application for this attribute, I would suggest you use something less general and hence more informative, such as gridspec_type_name.

Although I have changed cf_type_name to gridspec_type_name, it appears other people (see comment by S Hankin) have a need for a more general cf_role type attribute. I propose to use cf_role in place gridspec_type_name. We will still name space the *value* of cf_role with gridspec_ (eg gridspec_coordinate_names) to indicate that this particular cf_role is specific to gridspec. I will make the change if ok with you.

  1. Please could you clarify whether spaces are allowed or not surrounding : in the contact lists? As you know, I would favour allowing space but ignoring it, in order to reduce the number of syntax errors which may occur. I don't think it would make much difference to legibility or parsing.

Yes, allowed and now clarified.

  1. I think that it should be allowed to contain the grid and/or data from more than one tile in a given file. (In general I think we should be as flexible as possible regarding how things are distributed among files; files are only containers for variables and, as Steve Hankin has pointed out, files are fictional when data is not being served directly from a file store.) To enable that, the tile_name would have to be allowed as a variable attribute as an alternative to a global attribute.

The assumption of the present design is that the variables which are logically the same (i.e. the same variable but just distributed across tile and time files) share the same name (e.g. "ta"). To allow multiple tile data to be stored in the same file would require assigning different names to the variables. How would you infer that two variables contribute to the same logical variable without adding new attributes other than tile_name?

  1. I do not see the need to separate static from time-dependent variables. It is easy to tell whether a variable is time-dependent, by examining its coordinates.

In the host file, the file names are classified into grid file names, static file names, and time dependent file names because the rank and dimensions for each are different. Each collection of file names is an array of strings. The rank (dimensionality) of the static file name list is nvars_static x ntiles x nstring whereas the dimensionality of the time dependent file name list is ntimes x nvars_timedep x ntiles x nstring.

Thanks for your feedback!

--Alex

comment:5 in reply to: ↑ 3 ; follow-ups: Changed 10 years ago by jonathan

Dear Alex

Although I have changed cf_type_name to gridspec_type_name, it appears other people (see comment by S Hankin) have a need for a more general cf_role type attribute. I propose to use cf_role in place gridspec_type_name.

I'm still not entirely happy about this, but I do appreciate the arguments in favour of it, so I accept the majority view. My concern is that we must be careful not to use a cf_role as a general-purpose dustbin/trashcan attribute in future when we could define something more specific and informative instead.

  1. Please could you clarify whether spaces are allowed or not surrounding : in the contact lists?

Yes, allowed and now clarified.

Thank you.

  1. I think that it should be allowed to contain the grid and/or data from more than one tile in a given file. ...

The assumption of the present design is that the variables which are logically the same (i.e. the same variable but just distributed across tile and time files) share the same name (e.g. "ta"). To allow multiple tile data to be stored in the same file would require assigning different names to the variables. How would you infer that two variables contribute to the same logical variable without adding new attributes other than tile_name?

You would use other metadata, such as standard_name. CF explicitly does not standardise variable names (section 2.5). You aren't doing that either, but by depending on the names of variables in this way, gridspec is assigning more meaning to them than CF generally does, though your convention here is rather like the Unidata convention for identifying coordinate variables. I think this restriction is OK, but it could be stated up-front in F-SPEC that the tiles must be in separate files and that variables in those files which contribute to the same logical variable must have the same name. You probably state that - but maybe not prominently enough?

  1. I do not see the need to separate static from time-dependent variables. It is easy to tell whether a variable is time-dependent, by examining its coordinates.

In the host file, the file names are classified into grid file names, static file names, and time dependent file names because the rank and dimensions for each are different. Each collection of file names is an array of strings. The rank (dimensionality) of the static file name list is nvars_static x ntiles x nstring whereas the dimensionality of the time dependent file name list is ntimes x nvars_timedep x ntiles x nstring.

Since this can be inferred by examining dimensions and coordinates, it's a restriction which could be avoided. However, I accept you are not trying to solve the most general problem.

One other comment: Please could you append to the document a listing of the additions and changes which should be made to the CF conformance document http://cf-pcmdi.llnl.gov/conformance/requirements-and-recommendations/ to support the gridspec proposal?

Best wishes

Jonathan

comment:6 in reply to: ↑ 5 Changed 10 years ago by pletzer

Hi Jonathan,

Replying to jonathan:

  1. I think that it should be allowed to contain the grid and/or data from more than one tile in a given file. ...

The assumption of the present design is that the variables which are logically the same (i.e. the same variable but just distributed across tile and time files) share the same name (e.g. "ta"). To allow multiple tile data to be stored in the same file would require assigning different names to the variables. How would you infer that two variables contribute to the same logical variable without adding new attributes other than tile_name?

You would use other metadata, such as standard_name. CF explicitly does not standardise variable names (section 2.5). You aren't doing that either, but by depending on the names of variables in this way, gridspec is assigning more meaning to them than CF generally does, though your convention here is rather like the Unidata convention for identifying coordinate variables. I think this restriction is OK, but it could be stated up-front in F-SPEC that the tiles must be in separate files and that variables in those files which contribute to the same logical variable must have the same name. You probably state that - but maybe not prominently enough?

Thanks Jonathan. I agree, that was not clearly expressed. We had

Restriction:

  • Time steps must be partitioned into data files in a consistent manner across all tiles.
  • Variables must be partitioned into data files in a consistent manner across all tiles.

Now we have:

Restriction:

  • Time steps must be partitioned into data files in a consistent manner across all files. The same partitioning in time is assumed for all tiles and for all time dependent variables.
  • Variables must be partitioned into data files in a consistent manner across all files. The variable names used in the different tile data files and time data files must be the same as there is no other mechanism to identify a collection of variables as belonging to the same logical entity.

The above restrictions allow one to represent the aggregation in array form suitable for rapid access and time slice indexing.

Thanks again for the feedback.

Best,

--Alex

comment:7 follow-up: Changed 10 years ago by caron

Hi Alex and all, a few comments on this.

First, great job on moving this forward, very important and cool.

Now for the nitpicking:

  1. attribute naming (esp 10.2.2)

I think prefixing all attribute names (global and variable) consistently (eg with "gridspec_") would be good. Then you dont need to prefix attribute values (eg file_type values).

  1. terminology (10.3)

You use "structured grid" without defining i think. is this a synonym for "logical rectangular grid" ?

  1. curvilinear grid definition (10.3)

"coordinates must be supplied as D-Dimensional objects where D is the number of space dimensions". perhaps "variables" is better than "objects" ? Also, this seems to imply one needs (x,y,z) but a common case is (x,y) x z.

  1. (10.5) Restrictions, 5th paragraph

"multiples of 90 deg in index space" doesnt make much sense to me. indices are just numbers, not sure what "90 deg" refers to.

  1. (10.5) encoding of contact ranges

Encoding a series of numbers in ascii seems like a bad idea. why not use

int contact_range(ncontacts, 4) or something?

  1. (10.5) encoding of contact ranges

Is it needed to allow ranges to go negetive? If not, why not just one encoding, presumably the positive one? I dont have any real problem with [start, end+1), but i would think [start, end] might be more natural.

  1. (10.6.1) file name dependent on OS

I would highly recomend not allowing OS dependencies in your encoding if possible. I would suggest standardizing on "/".

  1. (10.6.1) file name absolute vs reletive

You probably should give an example of reletive file resolution. How does one tell if its absolute or reletive file path? Can CFlib supply a routine which does this, perhaps also converting between canonical and OS form?

It would be very helpful to have sample files to look at also.

thanks again,

John

comment:8 in reply to: ↑ 7 ; follow-ups: Changed 10 years ago by pletzer

Replying to caron:

Hi Alex and all, a few comments on this.

First, great job on moving this forward, very important and cool.

Thanks.

Now for the nitpicking:

  1. attribute naming (esp 10.2.2)

I think prefixing all attribute names (global and variable) consistently (eg with "gridspec_") would be good. Then you dont need to prefix attribute values (eg file_type values).

After consultation between authors we agree with this change and I have made the change accordingly. All the attributes, be they variable or global, now have the gridspec_ prefix with gridspec_ having been removed from the global attribute values. Thanks for suggesting this improvement.

  1. terminology (10.3)

You use "structured grid" without defining i think. is this a synonym for "logical rectangular grid" ?

Yes synonymous to logical rectangular grid. Added definition.

  1. curvilinear grid definition (10.3)

"coordinates must be supplied as D-Dimensional objects where D is the number of space dimensions". perhaps "variables" is better than "objects" ? Also, this seems to imply one needs (x,y,z) but a common case is (x,y) x z.

Very good point. I think in this case it is sufficient to concentrate on the (x, y) part of the grid since the vertical axis acts as an independent dimension. I will need to add something regarding this case. In the mean time I added the definition of mixed rectilinear-curvilinear grid.

  1. (10.5) Restrictions, 5th paragraph

"multiples of 90 deg in index space" doesnt make much sense to me. indices are just numbers, not sure what "90 deg" refers to.

Thanks, clarified.

Two tiles that share a contact must have parallel index axes. This means that contacts can only involve foldings that are multiples of 90 deg. in index space, or equivalently rotation matrices whose elements are either 0, -1, or 1. For instance, the i index on one tile can map to the j index on the neighboring tile but not to a mixture of i and j indices. See section 10.5.2 for an example of rotation in index space that is not a multiple of 90 deg.

  1. (10.5) encoding of contact ranges

Encoding a series of numbers in ascii seems like a bad idea. why not use

int contact_range(ncontacts, 4) or something?

A valid point. The idea of using strings would allow us to easily extend the syntax, e.g. by dropping the start of end indices over time.

  1. (10.5) encoding of contact ranges

Is it needed to allow ranges to go negetive? If not, why not just one encoding, presumably the positive one?

Yes it is required because of the folding between tile such that i maps to a decreasing j, for instance.

I dont have any real problem with [start, end+1), but i would think [start, end] might be more natural.

It could be more natural in the sense that the reverse iterator [end, start] would appear symmetric to [start, end]. Let put this to the vote.

  1. (10.6.1) file name dependent on OS

I would highly recomend not allowing OS dependencies in your encoding if possible. I would suggest standardizing on "/".

Sure, would simplify the doc but complicate the API. Let me think about this...

  1. (10.6.1) file name absolute vs reletive

You probably should give an example of reletive file resolution. How does one tell if its absolute or reletive file path? Can CFlib supply a routine which does this, perhaps also converting between canonical and OS form?

Would certainly be possible.

It would be very helpful to have sample files to look at also.

We have some unit tests that run within libcf and which exercise mosaic data aggregation. The wiki contains some examples, which are admittedly short but complete.

Thanks John for careful reading and for your suggestions.

--Alex

thanks again,

John

comment:9 in reply to: ↑ 5 Changed 10 years ago by pletzer

Hi Jonathan,

Replying to jonathan:

One other comment: Please could you append to the document a listing of the additions and changes which should be made to the CF conformance document http://cf-pcmdi.llnl.gov/conformance/requirements-and-recommendations/ to support the gridspec proposal?

Done, see paragraph 10.7.

Best regards,

--Alex

Best wishes

Jonathan

comment:10 in reply to: ↑ 8 Changed 10 years ago by pletzer

Replying to pletzer:

Replying to caron:

Hi John,

Following your recommendation, the syntax for expressing ranges has been changed so that the end indices are now inclusive, ie containing the last valid index. The change has been propagated to the wiki and to LibCF. Thanks for your suggestion.

--Alex

comment:11 in reply to: ↑ 8 Changed 10 years ago by pletzer

Hi John,

  1. (10.6.1) file name dependent on OS

I would highly recomend not allowing OS dependencies in your encoding if possible. I would suggest standardizing on "/".

The need for '/' exists only in the host file, the only file allowed to refer explicitly to a file name. The host file is not meant to be ported to another platform. F-SPEC allows for a cheap (re-)generation of the host file, should the files be transferred to another platform.

Only in the cases where a file system is made globally accessible through NFS or some other mechanism would it make sense to write the data in the host file in portable way across Unix-Windows. In such a case, it should be sufficient to run the application from a cygwin window to convert '/' to '\'.

The issue of converting '/' automatically to '\' on Windows sounds easy but there are subtle issues related to file path portability that cannot be addressed simply. What to do with "C:\" or the case sensitivity of some file systems (incl. Mac OS X) ? What if the host file lists Foo.nc and foo.nc? Perfectly valid on unix bt perhaps not on Windows.

For the above reasons, we prefer to leave the responsibility to encode correct the file paths to the application that generates the host file for the time being.

Thanks for your suggestions.

Best,

--Alex

comment:12 Changed 10 years ago by jonathan

There haven't been any further objections or comments for more than three weeks, so the moderator of the ticket (Balaji) could declare this change to be accepted according to the rules.

Jonathan

comment:13 Changed 10 years ago by balaji

  • Resolution set to fixed
  • Status changed from new to closed

There have been no comments on the Trac over the approval period. All changes on the ticket have been incorporated into the CF Gridspec proposal web page. As moderator I propose that this change be accepted and incorporated in to the next CF release.

Note: See TracTickets for help on using tickets.