Eizi TOYODA wrote:
> Hi all,
>
> Nobody here is trying to discourage the use of micro sign. What
> character encoding are you going to use?
>
> Micro sign (http://www.fileformat.info/info/unicode/char/00b5/index.htm)
> is single byte 0xB5 in Latin-1 (aka ISO-8859-1) but becomes
> double-byte 0xC2 0xB5 in UTF-8. There is also confusing Greek small
> letter mu (http://www.fileformat.info/info/unicode/char/03bc/index.htm)
> which is 0xCE 0xBC. In short this letter is bad for computer
> processing if we don't have mechanism to specify character encoding.
>
> UDUNITS 2 API has "encoding" argument, and users can choose either
> ASCII, Latin-1, or UTF-8. Accordingly "udunits2" command has options
> -A -L and -U. It is enough for library that users have control and
> responsibility. But CF is a standard of metadata that is exchanged
> among people to avoid confusion.
>
> The CF community can choose many ways. I'd like to see views on the community:
>
> (1) Create a global attribute to specify character encoding (like XML)
> I believe this won't work.
> (2) Declare that CF uses UTF-8
> Probably many people simply ignore that and put single 0xB5 as micro sign.
> (3) Recommends only US-ASCII letters in "units" attribute
> Very conservative, but that is consistent with allowing only
> English in standardized attributes.
> (4) Do nothing
> I have to warn programmers to anticipate any byte pattern above.
> That would work if only micro sign is an extension to ASCII.
>
> Best Regards,
>
Strings stored in netCDF (eg variable names, attributes, String data in
netCDF-4) are interpreted as UTF-8, and theres no standard way to
indicate a different encoding. CF could add such a mechanism, but unless
they do, by default "CF uses UTF-8". It would probably be worth speaking
to this explicitly in the CF doc; I would advocate sticking with UTF-8.
Requiring US-ASCII for attributes that CF defines is reasonable also.
As always, there's a tension between CF creating "best practices" for
new file writers vs trying to define conformance (and staying backwards
compatible). I think spelling out the unit should be best practice, but
keeping a "backwards compatible" version of udunits-2 seems necessary.
Received on Sat Mar 27 2010 - 08:40:20 GMT