⇐ ⇒

[CF-metadata] Pre-proposal for "charset"

From: Bob Simons - NOAA Federal <bob.simons>
Date: Wed, 22 Feb 2017 12:08:24 -0800

I don't like "ASCII" because it only applies to 7 bits even though chars
have 8 bits. So specifying "ASCII" still leaves ambiguity if any of the
chars have the 8th bit set. The file writer may know the variable will only
have 7 bit values, but it is safer for the reader to read the variable with
a decoder that handles 8 bit values. "ASCII" is trouble, so there is no
reason to encourage it, especially when there are compatible alternatives
like ISO-8859-1.

I do like ISO-8859-1, because
* It is compatible with ASCII for chars 0-127, which is all that ASCII
specifies.
* Any variable that has just 7bit ASCII chars can be labelled
"charset=ISO-8859-1".
* It is the most commonly used single-page 8bit charset for supporting the
European languages.
* It is widely used and supported.

I do like UTF-8 because it is the only charset that supports full Unicode
(all UTF-16/UCS-4/UTF-32 characters) in an 8bit encoding (since that is all
we have for characters in netcdf-3 files: 8bit chars). And it is incredibly
widely used and supported in software.

UTF-16/UTF-32/UCS-4 are not possible options because netcdf-3 files only
have an 8bit char data type, not 16 or 32bit chars. If we want to support
more than 255 different characters in a given char variable, UTF-8 is
really the only option (which is fine because it is a good option).

So my proposal is: charset can specify any single-page (8bit) character
set, but the two recommended charsets would be "ISO-8859-1" (for most
simple cases) and "UTF-8" (for harder cases / full Unicode).




On Wed, Feb 22, 2017 at 11:06 AM, Chris Barker <chris.barker at noaa.gov>
wrote:

> On Wed, Feb 22, 2017 at 10:38 AM, Bob Simons - NOAA Federal <
> bob.simons at noaa.gov> wrote:
>
>> As for needing a different subject for the email: I'm lumping together 2
>> new related attribute names: "charset=..." and "data_type=string|char" so
>> that the information stored in char variables in netcdf-3 files can be
>> easily and unambiguously interpreted.
>>
>
> somehow it got smashed in with the thread about geometries.. maybe that
> was my email client. But anyway, away we go!
>
>
>> You are correct. My proposal is for netcdf-3 files since they only
>> support chars, not true strings.
>>
>
> so maybe make it clear that for netcdf4, one should use strings? I'm not
> sure if there is anything in CF now that is 3 vs 4 specific...
>
>
>> As for "encoding" vs "charset", I'm open to different names. I chose
>> "charset" because that is the name used in HTML and is widely used in other
>> places. Yes, XML uses "encoding". To me, the word "charset" seems
>> preferable because it is more specific than "encoding" (which also has a
>> more general purpose meaning).
>>
>
> not a biggie -- +0 for encoding from me.
>
>
>> As for full Unicode support via UTF-8 vs UTF-16:
>>
>
> well, UTF-16 is the worst option -- let's never use that! UCS-4 is the way
> to go if you want full unicode support and constant bytes per charactor.
> though "wastes" space.
>
>
>> Since netcdf-3 only supports 8bit chars, the 16bit UTF-16 is not an
>> option.
>>
>
> well, sure, but at the binary level a CHAR is simply an unsigned 8-bit
> integer -- so you could stuff any encoding into an array of CHAR.
>
> But UTF-8 is the only way I know of to support full Unicode using only
>> 8bit chars for the underlying storage.
>>
>
> see above, but:
>
>
>> It is very widely used. Every modern piece of software that can read or
>> write text files supports it. It is the default for both XML and HTML 5.
>>
>
> yeah, it really is the best compromise -- and becoming the universal form
> for data interchange.
>
>
>> If the file writer doesn't need full Unicode, they can use "ISO-8859-1"
>> (which is compatible with 7bit ASCII)
>>
>
> I'd vote for ASCII and ISO-8859-1 as the only options (Or the HIGHLY
> RECOMMENDED options, at least).
>
> -CHB
>
> --
>
> Christopher Barker, Ph.D.
> Oceanographer
>
> Emergency Response Division
> NOAA/NOS/OR&R (206) 526-6959 voice
> 7600 Sand Point Way NE (206) 526-6329 fax
> Seattle, WA 98115 (206) 526-6317 main reception
>
> Chris.Barker at noaa.gov
>



-- 
Sincerely,
Bob Simons
IT Specialist
Environmental Research Division
NOAA Southwest Fisheries Science Center
99 Pacific St., Suite 255A      (New!)
Monterey, CA 93940               (New!)
Phone: (831)333-9878            (New!)
Fax:   (831)648-8440
Email: bob.simons at noaa.gov
The contents of this message are mine personally and
do not necessarily reflect any position of the
Government or the National Oceanic and Atmospheric Administration.
<>< <>< <>< <>< <>< <>< <>< <>< <><
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cgd.ucar.edu/pipermail/cf-metadata/attachments/20170222/294e8699/attachment.html>
Received on Wed Feb 22 2017 - 13:08:24 GMT

This archive was generated by hypermail 2.3.0 : Tue Sep 13 2022 - 23:02:42 BST

⇐ ⇒