[CF-metadata] Pre-proposal for "charset" from Bob Simons - NOAA Federal on 2017-02-22 (Archive of CF discussions from 2002 to 2019 on the cf-metadata mailing list)

From: Bob Simons - NOAA Federal <bob.simons>
Date: Wed, 22 Feb 2017 10:38:38 -0800

As for needing a different subject for the email: I'm lumping together 2
new related attribute names: "charset=..." and "data_type=string|char" so
that the information stored in char variables in netcdf-3 files can be
easily and unambiguously interpreted.

You are correct. My proposal is for netcdf-3 files since they only support
chars, not true strings.

As for "encoding" vs "charset", I'm open to different names. I chose
"charset" because that is the name used in HTML and is widely used in other
places. Yes, XML uses "encoding". To me, the word "charset" seems
preferable because it is more specific than "encoding" (which also has a
more general purpose meaning).

As for full Unicode support via UTF-8 vs UTF-16: Since netcdf-3 only
supports 8bit chars, the 16bit UTF-16 is not an option. Yes, in UTF-8,
characters are encoded via different number of bytes. Yes, this would take
pre-planning on the file writers part (precompute the max length needed).
Yes, the writer and the reader would have to handle this correctly. But
UTF-8 is the only way I know of to support full Unicode using only 8bit
chars for the underlying storage. It is very widely used. Every modern
piece of software that can read or write text files supports it. It is the
default for both XML and HTML 5.

Full Unicode support seems very useful. Other charsets support only a 255
character subset of full Unicode. So UTF-8 is the only way to support a
wide range of characters in a given string variable.

If the file writer doesn't need full Unicode, they can use "ISO-8859-1"
(which is compatible with 7bit ASCII) or some other charset in which each
character has a fixed size (1 8bit char).

On Wed, Feb 22, 2017 at 10:03 AM, Chris Barker <chris.barker at noaa.gov>
wrote:

> **
> NOTE: this looks like it got tacked on to another thread -- please start a
> new thread for a new topic. (or gamil messed up...)
> **
>
> Sorry for being dense here, but I'm confused. I see in the netCDF(4) spec:
>
> """
> The atomic external types supported by the netCDF interface are:
> ...
> NC_CHAR 8-bit character byte
> ...
> NC_STRING variable length character string *
> ...
> """
>
> So shouldn't one use a 2-D (or higher dim) array of NC_CHAR type if that's
> indeed what you have?
>
> Or is this about supporting netcdf3, which doesn't (I don't think) have a
> string type?
>
> It does have a BYTE type, which I would be inclined to use for a CHAR. But
> then I suppose you'd need to tell readers that it was intended to be a
> character...
>
> Other notes:
>
> Do folks want/need to support full Unicode characters? If so I think you'd
> need a 4 byte type -- cal it NC_UCHAR? -- and anything else would be
> variable-length, which would kind of kill the whole point of a character
> type...
>
> Small note: I'd prefer "encoding" to "charset" -- at least if you want to
> support "full" unicode, rather than only one-byte-per-char encodings.
>
> > > The only charsets which are recommended are "ISO-8859-1" and "UTF-8".
>>>
>>
> UTF-8 is problematic because it uses a variable number of bytes per
> character (codepoint?).
>
> If we want to support proper Unicode, then we need to either:
>
> use a variable-length string type (the netcdf 4 NC_STRING type?)
>
> or
>
> Use 4 bytes per char.
>
> Since UTF-* is a superset of ascii, it can be dangerous -- folks can say
> "this is UTF-*", and if they only happen to use the ASCII subset, al works
> fine, and then someone goes and tries to put a weird high-codepoint
> character in there, and all goes to heck.
>
> I see that netcdf4 supports UTF-8 for names within the file (variable
> names, dimension names, etc), but that works because the number of bytes is
> known and constant once created.
>
> Again, I'm maybe speaking from ignorance, I haven't dug into Unicode and
> CF And netcdf in any depth at all.
>
> > > --- An Example: Encoding three Strings: "It", "Book", and "5 €".
>
>> > > The Unicode code point for the Euro symbol is 20AC (in hexadecimal),
>>> > > which is 8364 (in decimal).
>>> > > The Euro symbol is encoded in UTF-8 as 3 bytes: E2 82 AC (in
>>> hexadecimal).
>>> > > So a file would store these strings in a char array as:
>>> > > dimensions
>>> > > words = 3;
>>> > > strLen = 5;
>>> > > char myWords[words][strLen] = "It[0][0][0]", "Book[0]", "5
>>> [E2][82][AC]";
>>> > > charset = "UTF-8";
>>>
>>
> this is tough -- how do you know what strLen should be? You could get
> UTF-8 characters chopped off if it was too short.
>
> Though I suppose that's a problem for the file writer to figure out.
>
> -CHB
>
> --
>
> Christopher Barker, Ph.D.
> Oceanographer
>
> Emergency Response Division
> NOAA/NOS/OR&R (206) 526-6959 voice
> 7600 Sand Point Way NE (206) 526-6329 fax
> Seattle, WA 98115 (206) 526-6317 main reception
>
> Chris.Barker at noaa.gov
>

-- 
Sincerely,
Bob Simons
IT Specialist
Environmental Research Division
NOAA Southwest Fisheries Science Center
99 Pacific St., Suite 255A      (New!)
Monterey, CA 93940               (New!)
Phone: (831)333-9878            (New!)
Fax:   (831)648-8440
Email: bob.simons at noaa.gov
The contents of this message are mine personally and
do not necessarily reflect any position of the
Government or the National Oceanic and Atmospheric Administration.
<>< <>< <>< <>< <>< <>< <>< <>< <><
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cgd.ucar.edu/pipermail/cf-metadata/attachments/20170222/5d9f5ad3/attachment.html>

Received on Wed Feb 22 2017 - 11:38:38 GMT

This archive was generated by hypermail 2.3.0 : Tue Sep 13 2022 - 23:02:42 BST