⇐ ⇒

[CF-metadata] Pre-proposal for "charset"

From: Chris Barker <chris.barker>
Date: Wed, 22 Feb 2017 11:06:03 -0800

On Wed, Feb 22, 2017 at 10:38 AM, Bob Simons - NOAA Federal <
bob.simons at noaa.gov> wrote:

> As for needing a different subject for the email: I'm lumping together 2
> new related attribute names: "charset=..." and "data_type=string|char" so
> that the information stored in char variables in netcdf-3 files can be
> easily and unambiguously interpreted.
>

somehow it got smashed in with the thread about geometries.. maybe that was
my email client. But anyway, away we go!


> You are correct. My proposal is for netcdf-3 files since they only support
> chars, not true strings.
>

so maybe make it clear that for netcdf4, one should use strings? I'm not
sure if there is anything in CF now that is 3 vs 4 specific...


> As for "encoding" vs "charset", I'm open to different names. I chose
> "charset" because that is the name used in HTML and is widely used in other
> places. Yes, XML uses "encoding". To me, the word "charset" seems
> preferable because it is more specific than "encoding" (which also has a
> more general purpose meaning).
>

not a biggie -- +0 for encoding from me.


> As for full Unicode support via UTF-8 vs UTF-16:
>

well, UTF-16 is the worst option -- let's never use that! UCS-4 is the way
to go if you want full unicode support and constant bytes per charactor.
though "wastes" space.


> Since netcdf-3 only supports 8bit chars, the 16bit UTF-16 is not an option.
>

well, sure, but at the binary level a CHAR is simply an unsigned 8-bit
integer -- so you could stuff any encoding into an array of CHAR.

 But UTF-8 is the only way I know of to support full Unicode using only
> 8bit chars for the underlying storage.
>

see above, but:


> It is very widely used. Every modern piece of software that can read or
> write text files supports it. It is the default for both XML and HTML 5.
>

yeah, it really is the best compromise -- and becoming the universal form
for data interchange.


> If the file writer doesn't need full Unicode, they can use "ISO-8859-1"
> (which is compatible with 7bit ASCII)
>

I'd vote for ASCII and ISO-8859-1 as the only options (Or the HIGHLY
RECOMMENDED options, at least).

-CHB

-- 
Christopher Barker, Ph.D.
Oceanographer
Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception
Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cgd.ucar.edu/pipermail/cf-metadata/attachments/20170222/563e8238/attachment.html>
Received on Wed Feb 22 2017 - 12:06:03 GMT

This archive was generated by hypermail 2.3.0 : Tue Sep 13 2022 - 23:02:42 BST

⇐ ⇒