Opened 4 years ago

Closed 4 years ago

#159 closed enhancement (wontfix)

charset attribute

Reported by: bob.simons Owned by: cf-conventions@…
Priority: medium Milestone:
Component: cf-conventions Version:
Keywords: Cc: bob.simons@…

Description (last modified by bob.simons)

In order to specify the character set of char and string variables, I propose that we append these two paragraphs to the end of CF section 2.2:

Each char array variable that is to be interpreted as an array of individual characters (not string(s)) must have a "charset" attribute which clarifies that the variable is to be interpreted as individual characters (not string(s)) and specifies the 8-bit character set used by the chars. Values for "charset" are case-insensitive. See http://www.iana.org/assignments/character-sets/character-sets.xhtml . Currently, the only values allowed for "charset" are "ISO-8859-1" and "ISO-8859-15". A scalar char variable may also use the "charset" attribute, which defaults to "ISO-8859-15" if it is not specified.

A string or string array variable (including a char array variable that is to be interpreted as a string or array of strings) may have an "_Encoding" attribute. Alternatively, a file may have a global "_Encoding" attribute which applies to all strings (scalar and array) in the file. Values for "_Encoding" are case-insensitive. See http://www.iana.org/assignments/character-sets/character-sets.xhtml . Currently, the only values allowed for "_Encoding" are "ISO-8859-1", "ISO-8859-15" and "UTF-8". A missing "_Encoding" attribute defaults to "UTF-8".

(This 2017-03-02b version is the consensus revised proposal from Chris Barker, Heiko Klein, and Bob Simons, with further changes requested by Jonathon Gregory.)

Change History (7)

comment:1 Changed 4 years ago by heiko.klein

I very much appreciate the clarification of the character-set for string and char variables, but I would like to modify your approach to harmonize with the NUG, where this is handled differently. The default is since over 10years UTF-8 (this change came together with netcdf4) and the attribute to specify the character set is named _Encoding rather than 'charset'.

From: http://www.unidata.ucar.edu/software/netcdf/netcdf-4/reqs_new.html

  • Strings are stored in UTF-8 Unicode.
  • String data is stored without being interpreted by the library, but an encoding for Unicode strings may be specified with a separate attribute (e.g. "_Encoding"). A global or group attribute could be used to specify the encoding of all strings in a file or group.

I propose therefore the following modification:

All char and string variables may include a '_Encoding' attribute to idenfity the character set (encoding) used by the variable. The value of the attribute must be the "Preferred MIME Name" or "Name" listed at ​http://www.iana.org/assignments/character-sets/character-sets.xhtml . Charset names are case-insensitive. The recommended charset names are "ISO-8859-15" and "UTF-8". A missing _Encoding attribute defaults to UTF-8.

I omit here the 8bit encodings restriction since I don't really see the point. It is technically possible to use 2chars for one UTF-16 character, but it is not recommended.

Both UTF-8 and ISO-8859-15 are backwards compatible with 7-bit ASCII characters, so I dropped the comment about backward compatibility.

I use ISO-8859-15 instead of ISO-8859-1 because -15 is the updated (1999) version, with the mayor change of including the € sign.

I prefer a strict default over ambiguity, and the UTF-8 default aligns with the NUG.

comment:2 Changed 4 years ago by jonathan

Dear Bob and Heiko

Thanks for these contributions. I support this change in Heiko's version.

Best wishes

Jonathan

comment:3 Changed 4 years ago by bob.simons

  • Description modified (diff)

comment:4 Changed 4 years ago by jonathan

Dear Bob

This seems more complicated than Heiko's version. I think he was proposing just one paragraph - isn't that right? His version makes other useful points too: the encoding name is case-insensitive, and it would be good to include the IANA link for information about what these charset designations mean.

The _Encoding attribute should be added to Appendix A.

Best wishes and thanks

Jonathan

comment:5 Changed 4 years ago by bob.simons

  • Description modified (diff)

Well, this has more information than Heiko's version. The debate leads me to write like a lawyer. ;-)

Yes, Heiko's version was one paragraph, but there are two attributes which cover two situations and I think deserve two paragraphs for clarity.

I have brought back the "case-insensitive" sentence and the IANA link from my original version.

If approved "charset" should be added to Appendex A, too.

comment:6 Changed 4 years ago by jonathan

Dear Bob et al.

I think _Encoding is good. I've just consulted the netCDF user guide, and I see they don't include _Encoding as one of their attribute conventions there. Yet the use of the underscore should imply it means something special to the netCDF library, according to their conventions (like _FillValue). Does it have a function in Unidata software?

I see you've combined your two tickets, and the choice between charset or _Encoding indicates whether it's char or string data. I'm not convinced still that we need this distinction. On the email list we are discussing Example H4. To my mind this example shows that there is a problem with the convention as it stands - and thanks for drawing attention to it. But supposing it's legal, so that the cf_role variable identifies a single timeseries location to which the file refers, there is then no ambiguity, is there? Would you use an array of chars (rather than a string) to identify a timeseries location? If you did, what is really the difference of meaning between a single 1D char array and a single string? They're practically equivalent, I would have thought. Are there any other cases where CF is ambiguous about whether a variable is a char array or a string?

Best wishes and thanks

Jonathan

comment:7 Changed 4 years ago by bob.simons

  • Resolution set to wontfix
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.