[CF-metadata] Pre-proposal for "charset" from Bob Simons - NOAA Federal on 2017-02-08 (Archive of CF discussions from 2002 to 2019 on the cf-metadata mailing list)

From: Bob Simons - NOAA Federal <bob.simons>
Date: Wed, 8 Feb 2017 10:00:32 -0800

I think my original pre-proposal has a significant flaw and needs to be
revised.
The problem is: charset needs to be specifiable for all char arrays,
regardless of whether the values should be interpreted as Strings or
individual chars.

I see two basic solutions:

1) Two attributes, but a given variable would only use one of them. The
first part of the attribute name specifies the data type:
  char_charset = "ISO-8859-1"; //identifies a char variable using
ISO-8859-1
or
  string_charset = "ISO-8859-1"; //identifies a String variable using
ISO-8859-1

2) Two attributes that would both be specified for every char/String
variable, e.g.,
  charset = "ISO-8859-1";
  data_type = "String"; //or "char"

In either case, the charsets allowed for char (not String) data must be
restricted to single code page (e.g, "ISO-8859-1") because other encodings
(e.g., "UTF-8") need multiple bytes for some characters..

---
I have a slight preference (2), because it is cleaner and might be better
in the future (I don't know the implications for nc4 and CF2).
Thoughts? Votes?
On Mon, Feb 6, 2017 at 3:08 PM, Bob Simons - NOAA Federal <
bob.simons at noaa.gov> wrote:
> Before I make a formal CF proposal for a "charset" attribute, I would like
> to get comments and suggestions from all of you.
>
> This is a proposal to solve the problem of distinguishing strings from
> arrays of characters and the problem of identifying the string's character
> encoding. Presumably, it would be appended to section 2.2.
>
> An example of actual need is: Many/most current uses of multidimensional
> char arrays are intended to be interpreted as Strings. But some files,
> e.g., Argo profile float profiles, have single char data that are stored in
> char arrays.
>
> Another example, while most nc files just use 7-bit ASCII characters in
> strings, some use 8-bit characters. Some such files appear to use
> charset=Windows-1252, others use Mac OS Roman, others use ISO-8859-1, but
> the the charset is not specified and there is currently no official CF way
> to specify it.
>
> Another advantage of this proposal is that it provides a way to support
> Unicode (and thus all of the world's languages) via the UTF-8 encoding
> which is useful as we increasingly work with people from non-US,
> non-European countries.
>
> A possible extension of this is to allow a few special additional
> pseudo-charset names:
> * "HTML" - the chars are to be interpreted as an array of Strings with
> HTML content, using the ISO-8859-1 charset. Non-ISO-8859-1  must be encoded
> using the &#d; format where d is the decimal number of a Unicode character.
> * "XML" -  the chars are to be interpreted as a an array of Strings with
> XML content, using the ISO-8859-1 charset. Non-ISO-8859-1 characters must
> be encoded using the &#d; format where d is the decimal number of a Unicode
> character.
>
> Thank you for considering this.
>
>
> --- The Actual Pre-Proposal
> Use the "charset" attribute to indicate that a multidimensional
> char array should be interpreted as an array of Strings,
> not an array of individual characters.
> The value of "charset" also serves to specify the character set
> used to encode the strings
> and must be the name of one of the 8-bit encodings
> (since CF chars are 8-bits) listed at
> http://www.iana.org/assignments/character-sets/character-sets.xhtml .
> Charset names are case-insensitive.
> The only charsets which are recommended are "ISO-8859-1" and "UTF-8".
> For backwards compatibility, if "charset" is not defined,
> it remains ambiguous whether a char array should be interpreted as
> holding an array of individual characters or an array of Strings.
>
>
> --- An Example: Encoding three Strings: "It", "Book", and "5 &euro;".
> The Unicode code point for the Euro symbol is 20AC (in hexadecimal),
> which is 8364 (in decimal).
> The Euro symbol is encoded in UTF-8 as 3 bytes: E2 82 AC (in hexadecimal).
> So a file would store these strings in a char array as:
>   dimensions
>     words = 3;
>     strLen = 5;
>   char myWords[words][strLen] = "It[0][0][0]", "Book[0]", "5 [E2][82][AC]";
>     charset = "UTF-8";
>
>
> --
> Sincerely,
>
> Bob Simons
> IT Specialist
> Environmental Research Division
> NOAA Southwest Fisheries Science Center
> 99 Pacific St., Suite 255A      (New!)
> Monterey, CA 93940               (New!)
> Phone: (831)333-9878 <(831)%20333-9878>            (New!)
> Fax:   (831)648-8440 <(831)%20648-8440>
> Email: bob.simons at noaa.gov
>
> The contents of this message are mine personally and
> do not necessarily reflect any position of the
> Government or the National Oceanic and Atmospheric Administration.
> <>< <>< <>< <>< <>< <>< <>< <>< <><
>
>
-- 
Sincerely,
Bob Simons
IT Specialist
Environmental Research Division
NOAA Southwest Fisheries Science Center
99 Pacific St., Suite 255A      (New!)
Monterey, CA 93940               (New!)
Phone: (831)333-9878            (New!)
Fax:   (831)648-8440
Email: bob.simons at noaa.gov
The contents of this message are mine personally and
do not necessarily reflect any position of the
Government or the National Oceanic and Atmospheric Administration.
<>< <>< <>< <>< <>< <>< <>< <>< <><
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cgd.ucar.edu/pipermail/cf-metadata/attachments/20170208/363b8d90/attachment.html>

Received on Wed Feb 08 2017 - 11:00:32 GMT

This archive was generated by hypermail 2.3.0 : Tue Sep 13 2022 - 23:02:42 BST