⇐ ⇒

[CF-metadata] Pre-proposal for "charset"

From: Chris Barker <chris.barker>
Date: Mon, 6 Mar 2017 09:41:44 -0800

Hi all,

I tried to post this note last week, in response to the TRAC ticket, but it
doesn't seem to have gone through. Sorry if this is a repeat.

Note that it seems Bob has lost momentum on this one, and closed the
ticket. However, the fact that the OP is dropping out doesn't mean it's
still not a good idea. For my part, I think it IS a good idea, though I'm
also not motivated enough to push it through. Hopefully someone is
motivated enough to iron out the last details -- I think we are close.

TL;DR:

It is clear (to me, anyway) that an array of chars and an array of string
are different things, so it makes enormous sense for CF to have a way to
clearly specify the distinction.

However, it is not clear whether there are enough use-cases in the wild (or
future?) that use arrays of chars as arrays of char, rather than as
strings. If there are not, then there isn't much point this thi proposal
(though not much of a downside, either....)

What I intended to post last week:

I'm not sure if I can comment on a TRAC ticket (I don't seem to be able to)
so I'm putting this note here.

 I think _Encoding is good. I've just consulted the netCDF user guide, and
> I see they don't include _Encoding as one of their attribute conventions
> there. Yet the use of the underscore should imply it means something
> special to the netCDF library, according to their conventions


Ahh! I was wondering about that. Some searching has revealed:

>
"""

> Note on char data: Although the characters used in netCDF names must be
encoded as UTF-8, character data may use other encodings. The variable
attribute ?_Encoding? is reserved for this purpose in future
implementations.

> """

>
in:

>
http://www.unidata.ucar.edu/software/netcdf/netcdf-4/
newdocs/netcdf/Classic-Format-Spec.html

>
So I think yes, _Encoding is special to netcdf, and thus the correct
spelling.

 I see you've combined your two tickets, and the choice between charset or
> _Encoding indicates whether it's char or string data. I'm not convinced
> still that we need this distinction. On the email list we are discussing
> Example H4.


IIUC, that was simply an example ( and I have not been following the
discussion ) of an ambiguous case -- not the only driver behind this idea.


> Are there any other cases
> where CF is ambiguous about whether a variable is a char array or a
> string?
>

It seems patently obvious to me that if a CHAR is a data type, then an ND
array of char type is a perfectly reasonable entity to use.

And if there is no STRING data type then a file reader has no idea whether
a 2D array of char is actually a 2D array of scalar char types or a 1D
array of strings, and yes, they would be read and used differently.

This gets particularly tricky if you want to convert from netcdf3 (no
string type) to netcdf4(string type) -- do you convert the char array to a
string type?

This is not a specious example -- if I read a netcdf file with an
intelligent reader, I will likely convert a ND char array to a (N-1)D array
of strings in the "native" format. (say a numpy array of strings).

Then if I write that array out to a netcdf4 file, it would get written as a
String array. If the char array was intended to be an array of Strings,
this is great. If it was intended to be an array of individual chars, then
I will have just inadvertently changed the semantics of the data.

There is also the issue of specifying an encoding -- if you want to specify
an encoding for those chars without turning them into strings -- what do
you do?

All this being said -- the key question remains:

Are there any files out in the wild that DO use ND arrays of NC_CHAR that
are not intended to be interpreted as a (N-1)D array of Strings?

Prior art:

I just whipped up a little Python script to play with this (see enclosed).
If I create a netcdf3 file with a (6,4) array of char, then ncdump it, it
looks like this:

$ ncdump char_test.nc

netcdf char_test {
dimensions:
    first = 6 ;
    second = 4 ;

variables:
    char a_char_array(first, second) ;

data:
 a_char_array =
  "ABCD",
  "EFGH",
  "IJKL",
  "MNOP",
  "QRST",
  "UVWX" ;
}

So ncdump (i.e. CDL) is pretty much interpreting it as an array of strings.
Actually,l not quite -- it is a 2D array of CHAR that you "write" as 1D
array of strings...

However, if you load it via the Python netCDF4library, you get a 2D array
of individual characters.

Size of array is: (6, 4)
datatype of array is: |S1
contents of array is:

[['A' 'B' 'C' 'D']
 ['E' 'F' 'G' 'H']
 ['I' 'J' 'K' 'L']
 ['M' 'N' 'O' 'P']
 ['Q' 'R' 'S' 'T']
 ['U' 'V' 'W' 'X']]

(note that numpy does not have a char type -- so it is represented as a
String of length-1 (one-byte-per-char string)

You convert it into an array of strings if you want:

arr = np.fromstring(arr.tostring(), dtype='S%i' % arr.shape[1])

size of array is: (6,)
datatype of array is: |S4
contents of array is:

['ABCD' 'EFGH' 'IJKL' 'MNOP' 'QRST' 'UVWX']

But you'd need to know the intent of the data, to know that you need to do
that.

I haven't looked to see what a CF-aware library (Like Iris or maybe
netcdf-Java) does with 2D arrays of characters.

In the end, in the Python world, we say that "explicit is better than
implicit" -- so while we probably need to say the default is for a ND array
of CHAR to be interpreted a (N-1)D array of strings, having a way to say "I
really want this to be a char array" seems like a good idea to me -- what's
the downside?

-CHB




-- 
Christopher Barker, Ph.D.
Oceanographer
Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception
Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cgd.ucar.edu/pipermail/cf-metadata/attachments/20170306/dcacaab6/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test_char.py
Type: text/x-python-script
Size: 1028 bytes
Desc: not available
URL: <http://mailman.cgd.ucar.edu/pipermail/cf-metadata/attachments/20170306/dcacaab6/attachment.bin>
Received on Mon Mar 06 2017 - 10:41:44 GMT

This archive was generated by hypermail 2.3.0 : Tue Sep 13 2022 - 23:02:42 BST

⇐ ⇒