[CF-metadata] Pre-proposal for "charset" from Chris Barker on 2017-02-28 (Archive of CF discussions from 2002 to 2019 on the cf-metadata mailing list)

From: Chris Barker <chris.barker>
Date: Mon, 27 Feb 2017 17:07:32 -0800

On Wed, Feb 22, 2017 at 12:08 PM, Bob Simons - NOAA Federal <
bob.simons at noaa.gov> wrote:

> I do like ISO-8859-1, because
> * It is compatible with ASCII for chars 0-127, which is all that ASCII
> specifies.
> * Any variable that has just 7bit ASCII chars can be labelled
> "charset=ISO-8859-1".
> * It is the most commonly used single-page 8bit charset for supporting the
> European languages.
> * It is widely used and supported.
>

all good. And I don't know if this is only the Python implementation, but
at least in Python, 8859-1 can read ANY binary data, and it round-trips
through a "proper" unicode object to get teh saem bytes back.

i.e. if the data are not 8859-1 or are malformed for some reason, the
8859-1 decoder will not error out on any input, and if you re-encode it,
you'll get back the same bytes you started with. Really nice property.

I do like UTF-8 because it is the only charset that supports full Unicode
> (all UTF-16/UCS-4/UTF-32 characters) in an 8bit encoding (since that is all
> we have for characters in netcdf-3 files: 8bit chars).
>

Again, I think this is a non-issue -- UTF-32 uses 4 bytes per char, i.e. 4
chars per codepoint. no reason you couldn't put UTF-32 encoded data in a
char array (C programmer do it all the time :-) )

> And it is incredibly widely used and supported in software.

All the rest of your reasons are good -- UTF-8 is the best choice.

So my proposal is: charset can specify any single-page (8bit) character
> set, but the two recommended charsets would be "ISO-8859-1" (for most
> simple cases) and "UTF-8" (for harder cases / full Unicode).
>

sounds good. though part of me wants to say that "ISO-8859-1" and "UTF-8"
should be the only options!

(darn those legacy files!)

Also -- I don't think yu can call UTF-8 an 8bit character set.

I'd also like the work "encoding" to be used instead of character set
wherever possible. "charset" comes from, and still implies, a 1-byte per
character system.

But that that's really a nitpick.

-CHB

-- 
Christopher Barker, Ph.D.
Oceanographer
Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception
Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cgd.ucar.edu/pipermail/cf-metadata/attachments/20170227/1b1d757f/attachment.html>

Received on Mon Feb 27 2017 - 18:07:32 GMT

This archive was generated by hypermail 2.3.0 : Tue Sep 13 2022 - 23:02:42 BST