--- I have a slight preference (2), because it is cleaner and might be better in the future (I don't know the implications for nc4 and CF2). Thoughts? Votes? On Mon, Feb 6, 2017 at 3:08 PM, Bob Simons - NOAA Federal < bob.simons at noaa.gov> wrote: > Before I make a formal CF proposal for a "charset" attribute, I would like > to get comments and suggestions from all of you. > > This is a proposal to solve the problem of distinguishing strings from > arrays of characters and the problem of identifying the string's character > encoding. Presumably, it would be appended to section 2.2. > > An example of actual need is: Many/most current uses of multidimensional > char arrays are intended to be interpreted as Strings. But some files, > e.g., Argo profile float profiles, have single char data that are stored in > char arrays. > > Another example, while most nc files just use 7-bit ASCII characters in > strings, some use 8-bit characters. Some such files appear to use > charset=Windows-1252, others use Mac OS Roman, others use ISO-8859-1, but > the the charset is not specified and there is currently no official CF way > to specify it. > > Another advantage of this proposal is that it provides a way to support > Unicode (and thus all of the world's languages) via the UTF-8 encoding > which is useful as we increasingly work with people from non-US, > non-European countries. > > A possible extension of this is to allow a few special additional > pseudo-charset names: > * "HTML" - the chars are to be interpreted as an array of Strings with > HTML content, using the ISO-8859-1 charset. Non-ISO-8859-1 must be encoded > using the &#d; format where d is the decimal number of a Unicode character. > * "XML" - the chars are to be interpreted as a an array of Strings with > XML content, using the ISO-8859-1 charset. Non-ISO-8859-1 characters must > be encoded using the &#d; format where d is the decimal number of a Unicode > character. > > Thank you for considering this. > > > --- The Actual Pre-Proposal > Use the "charset" attribute to indicate that a multidimensional > char array should be interpreted as an array of Strings, > not an array of individual characters. > The value of "charset" also serves to specify the character set > used to encode the strings > and must be the name of one of the 8-bit encodings > (since CF chars are 8-bits) listed at > http://www.iana.org/assignments/character-sets/character-sets.xhtml . > Charset names are case-insensitive. > The only charsets which are recommended are "ISO-8859-1" and "UTF-8". > For backwards compatibility, if "charset" is not defined, > it remains ambiguous whether a char array should be interpreted as > holding an array of individual characters or an array of Strings. > > > --- An Example: Encoding three Strings: "It", "Book", and "5 €". > The Unicode code point for the Euro symbol is 20AC (in hexadecimal), > which is 8364 (in decimal). > The Euro symbol is encoded in UTF-8 as 3 bytes: E2 82 AC (in hexadecimal). > So a file would store these strings in a char array as: > dimensions > words = 3; > strLen = 5; > char myWords[words][strLen] = "It[0][0][0]", "Book[0]", "5 [E2][82][AC]"; > charset = "UTF-8"; > > > -- > Sincerely, > > Bob Simons > IT Specialist > Environmental Research Division > NOAA Southwest Fisheries Science Center > 99 Pacific St., Suite 255A (New!) > Monterey, CA 93940 (New!) > Phone: (831)333-9878 <(831)%20333-9878> (New!) > Fax: (831)648-8440 <(831)%20648-8440> > Email: bob.simons at noaa.gov > > The contents of this message are mine personally and > do not necessarily reflect any position of the > Government or the National Oceanic and Atmospheric Administration. > <>< <>< <>< <>< <>< <>< <>< <>< <>< > > -- Sincerely, Bob Simons IT Specialist Environmental Research Division NOAA Southwest Fisheries Science Center 99 Pacific St., Suite 255A (New!) Monterey, CA 93940 (New!) Phone: (831)333-9878 (New!) Fax: (831)648-8440 Email: bob.simons at noaa.gov The contents of this message are mine personally and do not necessarily reflect any position of the Government or the National Oceanic and Atmospheric Administration. <>< <>< <>< <>< <>< <>< <>< <>< <>< -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mailman.cgd.ucar.edu/pipermail/cf-metadata/attachments/20170208/363b8d90/attachment.html>Received on Wed Feb 08 2017 - 11:00:32 GMT
This archive was generated by hypermail 2.3.0 : Tue Sep 13 2022 - 23:02:42 BST