⇐ ⇒

[CF-metadata] Pre-proposal for "charset"

From: Jonathan Gregory <j.m.gregory>
Date: Mon, 6 Mar 2017 17:47:31 +0000

Dear Chris

Yes, we can reopen the ticket. I think the _Encoding for char is a good idea,
especially if it's an NUG convention.

> Are there any files out in the wild that DO use ND arrays of NC_CHAR that
> are not intended to be interpreted as a (N-1)D array of Strings?

That is the question. In particular, since this the CF convention we're
talking about, are there any char arrays which are part of CF, where the
intent is not clear?

Cheers

Jonathan

----- Forwarded message from Chris Barker <chris.barker at noaa.gov> -----

> Date: Mon, 6 Mar 2017 09:41:44 -0800
> From: Chris Barker <chris.barker at noaa.gov>
> To: "cf-metadata at cgd.ucar.edu" <cf-metadata at cgd.ucar.edu>
> Subject: Re: [CF-metadata] Pre-proposal for "charset"
>
> Hi all,
>
> I tried to post this note last week, in response to the TRAC ticket, but it
> doesn't seem to have gone through. Sorry if this is a repeat.
>
> Note that it seems Bob has lost momentum on this one, and closed the
> ticket. However, the fact that the OP is dropping out doesn't mean it's
> still not a good idea. For my part, I think it IS a good idea, though I'm
> also not motivated enough to push it through. Hopefully someone is
> motivated enough to iron out the last details -- I think we are close.
>
> TL;DR:
>
> It is clear (to me, anyway) that an array of chars and an array of string
> are different things, so it makes enormous sense for CF to have a way to
> clearly specify the distinction.
>
> However, it is not clear whether there are enough use-cases in the wild (or
> future?) that use arrays of chars as arrays of char, rather than as
> strings. If there are not, then there isn't much point this thi proposal
> (though not much of a downside, either....)
>
> What I intended to post last week:
>
> I'm not sure if I can comment on a TRAC ticket (I don't seem to be able to)
> so I'm putting this note here.
>
> I think _Encoding is good. I've just consulted the netCDF user guide, and
> > I see they don't include _Encoding as one of their attribute conventions
> > there. Yet the use of the underscore should imply it means something
> > special to the netCDF library, according to their conventions
>
>
> Ahh! I was wondering about that. Some searching has revealed:
>
> >
> """
>
> > Note on char data: Although the characters used in netCDF names must be
> encoded as UTF-8, character data may use other encodings. The variable
> attribute ?_Encoding? is reserved for this purpose in future
> implementations.
>
> > """
>
> >
> in:
>
> >
> http://www.unidata.ucar.edu/software/netcdf/netcdf-4/
> newdocs/netcdf/Classic-Format-Spec.html
>
> >
> So I think yes, _Encoding is special to netcdf, and thus the correct
> spelling.
>
> I see you've combined your two tickets, and the choice between charset or
> > _Encoding indicates whether it's char or string data. I'm not convinced
> > still that we need this distinction. On the email list we are discussing
> > Example H4.
>
>
> IIUC, that was simply an example ( and I have not been following the
> discussion ) of an ambiguous case -- not the only driver behind this idea.
>
>
> > Are there any other cases
> > where CF is ambiguous about whether a variable is a char array or a
> > string?
> >
>
> It seems patently obvious to me that if a CHAR is a data type, then an ND
> array of char type is a perfectly reasonable entity to use.
>
> And if there is no STRING data type then a file reader has no idea whether
> a 2D array of char is actually a 2D array of scalar char types or a 1D
> array of strings, and yes, they would be read and used differently.
>
> This gets particularly tricky if you want to convert from netcdf3 (no
> string type) to netcdf4(string type) -- do you convert the char array to a
> string type?
>
> This is not a specious example -- if I read a netcdf file with an
> intelligent reader, I will likely convert a ND char array to a (N-1)D array
> of strings in the "native" format. (say a numpy array of strings).
>
> Then if I write that array out to a netcdf4 file, it would get written as a
> String array. If the char array was intended to be an array of Strings,
> this is great. If it was intended to be an array of individual chars, then
> I will have just inadvertently changed the semantics of the data.
>
> There is also the issue of specifying an encoding -- if you want to specify
> an encoding for those chars without turning them into strings -- what do
> you do?
>
> All this being said -- the key question remains:
>
> Are there any files out in the wild that DO use ND arrays of NC_CHAR that
> are not intended to be interpreted as a (N-1)D array of Strings?
>
> Prior art:
>
> I just whipped up a little Python script to play with this (see enclosed).
> If I create a netcdf3 file with a (6,4) array of char, then ncdump it, it
> looks like this:
>
> $ ncdump char_test.nc
>
> netcdf char_test {
> dimensions:
> first = 6 ;
> second = 4 ;
>
> variables:
> char a_char_array(first, second) ;
>
> data:
> a_char_array =
> "ABCD",
> "EFGH",
> "IJKL",
> "MNOP",
> "QRST",
> "UVWX" ;
> }
>
> So ncdump (i.e. CDL) is pretty much interpreting it as an array of strings.
> Actually,l not quite -- it is a 2D array of CHAR that you "write" as 1D
> array of strings...
>
> However, if you load it via the Python netCDF4library, you get a 2D array
> of individual characters.
>
> Size of array is: (6, 4)
> datatype of array is: |S1
> contents of array is:
>
> [['A' 'B' 'C' 'D']
> ['E' 'F' 'G' 'H']
> ['I' 'J' 'K' 'L']
> ['M' 'N' 'O' 'P']
> ['Q' 'R' 'S' 'T']
> ['U' 'V' 'W' 'X']]
>
> (note that numpy does not have a char type -- so it is represented as a
> String of length-1 (one-byte-per-char string)
>
> You convert it into an array of strings if you want:
>
> arr = np.fromstring(arr.tostring(), dtype='S%i' % arr.shape[1])
>
> size of array is: (6,)
> datatype of array is: |S4
> contents of array is:
>
> ['ABCD' 'EFGH' 'IJKL' 'MNOP' 'QRST' 'UVWX']
>
> But you'd need to know the intent of the data, to know that you need to do
> that.
>
> I haven't looked to see what a CF-aware library (Like Iris or maybe
> netcdf-Java) does with 2D arrays of characters.
>
> In the end, in the Python world, we say that "explicit is better than
> implicit" -- so while we probably need to say the default is for a ND array
> of CHAR to be interpreted a (N-1)D array of strings, having a way to say "I
> really want this to be a char array" seems like a good idea to me -- what's
> the downside?
>
> -CHB
>
>
>
>
> --
>
> Christopher Barker, Ph.D.
> Oceanographer
>
> Emergency Response Division
> NOAA/NOS/OR&R (206) 526-6959 voice
> 7600 Sand Point Way NE (206) 526-6329 fax
> Seattle, WA 98115 (206) 526-6317 main reception
>
> Chris.Barker at noaa.gov

> #!/usr/bin/env python
>
> # code to test char data in netcdf
>
> import numpy as np
> import netCDF4
>
> # create an example file
> ds = netCDF4.Dataset("char_test.nc", 'w', format='NETCDF3_CLASSIC')
> first = ds.createDimension("first", 6)
> second = ds.createDimension("second", 4)
>
> char_var = ds.createVariable("a_char_array", 'c', ('first', 'second'))
>
> for i in range(len(first)):
> for j in range(len(second)):
> char_var[i, j] = chr(65 + (i * len(second)) + j)
>
>
> ds.close()
>
> # read it:
> ds = netCDF4.Dataset("char_test.nc")
>
> var = ds.variables['a_char_array']
> print "the netCDF4 variable object:"
> print var
>
> arr = var[:]
> print "Now a 2D array of characters"
> print "size of array is:", arr.shape
> print "datatype of array is:", arr.dtype
> print "contents of array is:"
> print arr
>
> # convert to array of strings:
> arr = np.fromstring(arr.tostring(), dtype='S%i' % arr.shape[1])
> print
> print "Now a 1D array of strings"
> print "size of array is:", arr.shape
> print "datatype of array is:", arr.dtype
> print "contents of array is:"
> print arr
>
>

> _______________________________________________
> CF-metadata mailing list
> CF-metadata at cgd.ucar.edu
> http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata


----- End forwarded message -----
Received on Mon Mar 06 2017 - 10:47:31 GMT

This archive was generated by hypermail 2.3.0 : Tue Sep 13 2022 - 23:02:42 BST

⇐ ⇒