[CF-metadata] Pre-proposal for "charset" from Bob Simons - NOAA Federal on 2017-02-06 (Archive of CF discussions from 2002 to 2019 on the cf-metadata mailing list)

From: Bob Simons - NOAA Federal <bob.simons>
Date: Mon, 6 Feb 2017 15:08:17 -0800

Before I make a formal CF proposal for a "charset" attribute, I would like
to get comments and suggestions from all of you.

This is a proposal to solve the problem of distinguishing strings from
arrays of characters and the problem of identifying the string's character
encoding. Presumably, it would be appended to section 2.2.

An example of actual need is: Many/most current uses of multidimensional
char arrays are intended to be interpreted as Strings. But some files,
e.g., Argo profile float profiles, have single char data that are stored in
char arrays.

Another example, while most nc files just use 7-bit ASCII characters in
strings, some use 8-bit characters. Some such files appear to use
charset=Windows-1252, others use Mac OS Roman, others use ISO-8859-1, but
the the charset is not specified and there is currently no official CF way
to specify it.

Another advantage of this proposal is that it provides a way to support
Unicode (and thus all of the world's languages) via the UTF-8 encoding
which is useful as we increasingly work with people from non-US,
non-European countries.

A possible extension of this is to allow a few special additional
pseudo-charset names:
* "HTML" - the chars are to be interpreted as an array of Strings with HTML
content, using the ISO-8859-1 charset. Non-ISO-8859-1 must be encoded
using the &#d; format where d is the decimal number of a Unicode character.
* "XML" - the chars are to be interpreted as a an array of Strings with
XML content, using the ISO-8859-1 charset. Non-ISO-8859-1 characters must
be encoded using the &#d; format where d is the decimal number of a Unicode
character.

Thank you for considering this.

--- The Actual Pre-Proposal
Use the "charset" attribute to indicate that a multidimensional
char array should be interpreted as an array of Strings,
not an array of individual characters.
The value of "charset" also serves to specify the character set
used to encode the strings
and must be the name of one of the 8-bit encodings
(since CF chars are 8-bits) listed at
http://www.iana.org/assignments/character-sets/character-sets.xhtml .
Charset names are case-insensitive.
The only charsets which are recommended are "ISO-8859-1" and "UTF-8".
For backwards compatibility, if "charset" is not defined,
it remains ambiguous whether a char array should be interpreted as
holding an array of individual characters or an array of Strings.

--- An Example: Encoding three Strings: "It", "Book", and "5 €".
The Unicode code point for the Euro symbol is 20AC (in hexadecimal),
which is 8364 (in decimal).
The Euro symbol is encoded in UTF-8 as 3 bytes: E2 82 AC (in hexadecimal).
So a file would store these strings in a char array as:
  dimensions
    words = 3;
    strLen = 5;
  char myWords[words][strLen] = "It[0][0][0]", "Book[0]", "5 [E2][82][AC]";
    charset = "UTF-8";

-- 
Sincerely,
Bob Simons
IT Specialist
Environmental Research Division
NOAA Southwest Fisheries Science Center
99 Pacific St., Suite 255A      (New!)
Monterey, CA 93940               (New!)
Phone: (831)333-9878            (New!)
Fax:   (831)648-8440
Email: bob.simons at noaa.gov
The contents of this message are mine personally and
do not necessarily reflect any position of the
Government or the National Oceanic and Atmospheric Administration.
<>< <>< <>< <>< <>< <>< <>< <>< <><
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cgd.ucar.edu/pipermail/cf-metadata/attachments/20170206/904345ff/attachment.html>

Received on Mon Feb 06 2017 - 16:08:17 GMT

This archive was generated by hypermail 2.3.0 : Tue Sep 13 2022 - 23:02:42 BST