[CF-metadata] Pre-proposal for "charset" from Chris Barker on 2017-02-22 (Archive of CF discussions from 2002 to 2019 on the cf-metadata mailing list)

From: Chris Barker <chris.barker>
Date: Wed, 22 Feb 2017 10:03:48 -0800

**
NOTE: this looks like it got tacked on to another thread -- please start a
new thread for a new topic. (or gamil messed up...)
**

Sorry for being dense here, but I'm confused. I see in the netCDF(4) spec:

"""
The atomic external types supported by the netCDF interface are:
...
NC_CHAR 8-bit character byte
...
NC_STRING variable length character string *
...
"""

So shouldn't one use a 2-D (or higher dim) array of NC_CHAR type if that's
indeed what you have?

Or is this about supporting netcdf3, which doesn't (I don't think) have a
string type?

It does have a BYTE type, which I would be inclined to use for a CHAR. But
then I suppose you'd need to tell readers that it was intended to be a
character...

Other notes:

Do folks want/need to support full Unicode characters? If so I think you'd
need a 4 byte type -- cal it NC_UCHAR? -- and anything else would be
variable-length, which would kind of kill the whole point of a character
type...

Small note: I'd prefer "encoding" to "charset" -- at least if you want to
support "full" unicode, rather than only one-byte-per-char encodings.

> > The only charsets which are recommended are "ISO-8859-1" and "UTF-8".
>>
>
UTF-8 is problematic because it uses a variable number of bytes per
character (codepoint?).

If we want to support proper Unicode, then we need to either:

use a variable-length string type (the netcdf 4 NC_STRING type?)

or

Use 4 bytes per char.

Since UTF-* is a superset of ascii, it can be dangerous -- folks can say
"this is UTF-*", and if they only happen to use the ASCII subset, al works
fine, and then someone goes and tries to put a weird high-codepoint
character in there, and all goes to heck.

I see that netcdf4 supports UTF-8 for names within the file (variable
names, dimension names, etc), but that works because the number of bytes is
known and constant once created.

Again, I'm maybe speaking from ignorance, I haven't dug into Unicode and CF
And netcdf in any depth at all.

> > --- An Example: Encoding three Strings: "It", "Book", and "5 €".

> > > The Unicode code point for the Euro symbol is 20AC (in hexadecimal),
>> > > which is 8364 (in decimal).
>> > > The Euro symbol is encoded in UTF-8 as 3 bytes: E2 82 AC (in
>> hexadecimal).
>> > > So a file would store these strings in a char array as:
>> > > dimensions
>> > > words = 3;
>> > > strLen = 5;
>> > > char myWords[words][strLen] = "It[0][0][0]", "Book[0]", "5
>> [E2][82][AC]";
>> > > charset = "UTF-8";
>>
>
this is tough -- how do you know what strLen should be? You could get UTF-8
characters chopped off if it was too short.

Though I suppose that's a problem for the file writer to figure out.

-CHB

-- 
Christopher Barker, Ph.D.
Oceanographer
Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception
Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cgd.ucar.edu/pipermail/cf-metadata/attachments/20170222/8a3fb45f/attachment.html>

Received on Wed Feb 22 2017 - 11:03:48 GMT

This archive was generated by hypermail 2.3.0 : Tue Sep 13 2022 - 23:02:42 BST