Opened 8 years ago
Last modified 2 years ago
#99 new enhancement
Taxon Names and Identifiers
Reported by: | lowry | Owned by: | cf-conventions@… |
---|---|---|---|
Priority: | high | Milestone: | |
Component: | cf-conventions | Version: | |
Keywords: | Cc: |
Description
New section to be added to the Convention
6.1.2 Taxon Names and Identifiers A taxon is a named level within a biological classification, such as a class, genus and species. Within the marine environment there are at least half a million taxa. However, CF isn't confined to the marine environment and so the number runs into millions, even billions. When a variable in CF describes a property of a taxon, such as its numeric concentration or abundance one approach would be to incorporate the taxon name into the Standard Name. However, experience with other parameter vocabularies has shown that this can quickly become unsustainable. Consequently, taxonomic names are handled in a similar manner to geographic names using a generic Standard Name for the data variable plus co-ordinate variables to carry the label text. The data variable is labelled using Standard Names of the form 'property_of_taxon_in_medium'. For example, taxon abundance in a water body would be described by the Standard Name 'number_concentration_of_taxon_in_sea_water'. The labelling co-ordinate variables have the Standard Names 'taxon_name' and 'taxon_identifier'. The taxon name included in the data must be taken from a recognised source. Currently, these are the World Register of Marine Species or WoRMS (http://www.marinespecies.org/), which is the preferred resource for the marine environment or the International Taxonomic Information System or ITIS (http://www.itis.gov/) for terrestrial flora and fauna. Note that the only requirement for CF is that the name used is registered in at least one of the named resources. It does not have to be designated as 'valid'. The taxon_identifier from either WoRMS (the aphia ID) or ITIS (the taxonomic serial number or TSN) need to include namespace strings, which are 'aphia:' or 'tsn:'. For example, Calanus finmarchicus is encoded as either 'aphia:104464' or 'tsn:85272'. For the marine domain WoRMS has more complete coverage and so aphia Ids are preferred. Example 6.3 This example shows how the taxonomic information would be encoded for a simple time series of abundance for two taxa. For clarity, a lot of information - such as the time variable has been omitted. dimensions;
time=1000; string80=80; taxon=2;
variables:
float abundance(time,taxon);
abundance:standard_name="number_concentration_of_taxon_in_sea_water"; abundance:coordinates="taxon_identifier taxon_name";
char taxon_name(taxon,string80);
taxon_name:standard_name="taxon_name";
char taxon_identifier(taxon,string80);
taxon_name:standard_name="taxon_identifier";
data; taxon_name = "Calanus finmarchicus", "Calanus helgolandicus" taxon_label = "aphia:104464", "aphia:104466"
Consequences for Standard Names
The following new Standard Names are required to describe the label variables and to support the bacterial data request that inspired the creation of this ticket. One more has been included in support of the above example. taxon_name The human-readable label for the taxon such as Calanus finmarchicus. The label should be registered in either WoRMS or ITIS and spelled exactly as registered. taxon_identifier The machine-readable identifier for the taxon registration in either WoRMS (the aphia ID) or ITIS (the taxonomic serial number or TSN), including namespace. The namespace strings are 'aphia:' or 'tsn:'. For example, Calanus finmarchicus is encoded as either 'aphia:104464' or 'tsn:85272'. For the marine domain WoRMS has more complete coverage and so aphia Ids are preferred. colony_forming_unit_number_concentration_of_taxon_in_sea_water "Colony Forming Unit" means an estimate of the viable bacterial or fungal numbers determined by counting colonies grown from a sample. "Number concentration" means the number of particles or other specified objects per unit volume. "Taxon" means an organism named in the taxon_name and taxon_identifier variables. number_concentration_of_taxon_in_sea_water "Number concentration" means the number of particles or other specified objects per unit volume. "Taxon" means an organism named in the taxon_name and taxon_identifier variables.
Change History (26)
comment:1 Changed 8 years ago by jonathan
comment:2 Changed 8 years ago by graybeal
A few suggestions on this, which I love to see proposed.
The use of the term 'label' with respect to standard names is a bit confusing, since 'label' is something I'd expect the long name to do. An example: "using a generic Standard Name for the data variable plus co-ordinate variables to carry the label text. The data variable is labelled using Standard Names of the form 'property_of_taxon_in_medium'." The terms used in the body of the standard when speaking of the standard name are always either 'identify' or 'describe', which is more appropriate for a standard name.
I had to reread Jonathan's suggestion, and I think it amounts to this: "Both taxon name and taxon identifier are required, to maximize understanding and interoperability. They may be obtained, as a corresponding pair, from either WORMS or ITIS."
First, are we sure we want to specify these are the only two acceptable sources, just because they are the two most prominent/recognizable/acceptable sources at this time? Since the identifier effectively identifies the source, I'm inclined to accept any source the user deems acceptable, perhaps strongly recommending these two. But your judgment works for me here.
I'm OK with requiring both name and ID. I'm a little confused by Jonathan's last paragraph, particularly "missing data can be given for any taxon which doesn't have an identifier". If they are in ITIS or WORMS, they have an identifier. If they aren't -- and this is partly why I suggested not constraining to ITIS and WORMS -- they need to have an identifier from _somewhere_, or they aren't really a taxon. I was rather hoping the identifier could be a URL, or at least a URN, but it appears ITIS doesn't provide such a thing. (!?!)
In any case, the suggested handling of a missing ID made me wonder if we are talking about two possible usage scenarios. (1) The user variable being described is simply a number (number_concentration, for example), with many measurements being taken, and they all have the same taxon info. (2) There are 3 variables being described; the measurement number, and one or two variables that describe the taxon for that particular measurement number. In this case the additional variables' values define the meaning of the primary variable, which could be different for each 'row'.
Are we proposing this solution for case (1), case (2), or both?
Regarding the conformance of the name to the ID, I think we should stipulate that one of these two values is authoritative, and the other is informative. Since the ID is truly what uniquely specifies the taxon in the database, I think it should be considered authoritative; the other is explanatory text. The name for that ID may even change over time (if those DBs work as I believe them to), wacky as that seems; but I don't think this represents an intolerable conflict.
Finally, in the original proposal ' "Taxon" means an organism named in the taxon_name and taxon_identifier variables.' appears twice.
comment:3 Changed 8 years ago by jonathan
Dear John
I am sure you and Roy know more about the available taxonomic databases. If CF isn't going to provide its own, I think we should be explicit about which ones should be used, and it should be as few as possible. That is because, in the limiting case that every data provider used a different taxonomic database, the datasets would no longer be comparable. You wouldn't know whether Graybeal species number 94308 was the same as Lowry species number 612095, even if they did have the same species name, since the names are not regarded as reliable. So I don't think we ought to leave it open to the data writer to use any database they deem to be acceptable.
Ideally we would have only one external authority, but Roy says that is not sufficient, and suggests there are two. To maximise portability of data, I therefore suggested that it be recommended for both to be used. However, in some cases the species concerned will be in one but not the other. That is when there will be missing data in one of the auxiliary coordinates. For instance:
variables: int aphiaID(taxa); aphiaID:_FillValue=-1; aphiaID:standard_name="taxon_identifier"; int tsn(taxa); tsn:_FillValue=0; tsn:standard_name="taxonomic_serial_number"; data: taxon_name="Homo sapiens", "Fraxinus excelsior", "Struthio camelus"; aphiaID=1,32768,-1; tsn=42,0,7776;
In this entirely made-up example, F. excelsior appears in WoRMS but not ITIS, while S. camelus is in ITIS but not WoRMS, so there are missing data elements in the auxiliary coordinate variables.
I think if both are provided, as recommended, they should be consistent and it is an error if they are not. For example, TSN 42 might actually be Pan troglodytes rather than H. sapiens. This would be an error. If we just said, "let WoRMS take precedence", the purpose of providing TSN as well would be undermined. If we provide both and they are consistent, software with a preference for one of them can use that one. If they are not guaranteed to be consistent, you would get different results depending on which identifier you use.
Best wishes
Jonathan
comment:4 Changed 8 years ago by graybeal
Thank you for the clarification. I think this is a critical detail worth additional attention.
The premise of semantic interoperability, based on extensive real-world experience, is that interoperability can not be achieved through constraining everyone to use one vocabulary (or two). There are any number of good reasons that the chosen vocabulary(ies) may not be sufficient. With semantics, this weakness is easily overcome by creating relations between vocabularies. In some cases those relations are precise and homomorphic; in other cases they are descriptive instead, but still powerful, and extensible with time.
So if you give me *one* taxonomic ID, I don't need a second one, or a name, or a verification of their relationship; I will have, external to CF, the tools and relations that tell me how those are related. (WORMS is a 'best practices' case of this; you can look up a matching entry in ITIS directly from the WORMS entry.) Trying to replicate this functionality makes CF more complex, at no value to CF, because the linked open data and semantic communities will take care of it much more robustly, and much less expensively for everyone.
Roy has said these are the two vocabularies, yes, and I am sure he knows more about them than I do. But when I asked a practicing biologist about ITIS, this was the answer I got: "Species 2000 was an umbrella group that combined ITIS and other sources. SP2000 provides LSIDs (including for ITIS names)." When I looked at Species 2000, it indicated records are harvested from 3 ITIS databases, a large number of WORMS databases, and about 100 others. From this I conclude that ITIS and WORMS are indeed very valuable and credible, and there are many other sources of taxonomic data that are valuable and credible.
This illustrates a clear choice. We can attempt to collect and summarize the combined wisdom of the biological-technical community to keep abreast over time about which of these taxonomic databases are necessary and sufficient for CF. Or we can defer that judgment to the ongoing efforts in the community, which seem likely to continue to be ongoing. Certainly if we want to encourage interoperability, recommending the most prominent (WORMS, ITIS, SP2000, ?) as a good source of taxa and their unique identifiers seems sensible.
comment:5 Changed 8 years ago by lowry
Thanks Jonathan for your constructive feedback, with which I have no issues. I did initially consider a variable for each id, but felt my suggestion in the example would be more acceptable.
As to John's point I strongly feel that a name given should be from an authoritative source and that source should provide an identifier that does something useful. I also strongly feel that acceptable sources should be named and that WoRMS and ITIS should be included as both of these offer a governance that provide a mechanism for verifying and adding proposals for new names. However, I don't believe that the list of two should be prescriptive. Should the community require it then obviously the list could be extended.
On the specific point of whether S2000 should be added depends on whether LSIDs are in active use in the community likely to use CF. In my community (SeaDataNet?) they aren't - we use aphiaIDs, but common sense has prevailed and the aphiaID has been incorporated into the LSID. For example, my favourite copepod C finmarchicus as an LSID of urn:lsid:marinespecies.org:taxname:104464 and an aphiaID of 104464. So, the list could be extended as and when required, but at the moment I don't feel anything is gained by adding S2000.
comment:6 Changed 8 years ago by painter1
I'm sorry to bother you - this is just a test. Google seems to be rejecting messages as spam when the originate in the CF Trac system.
- Jeff Painter
comment:7 Changed 8 years ago by painter1
Once again I apologize for having to send a test message.
- Jeff Painter
comment:8 Changed 8 years ago by painter1
This is another test message, hopefully the last one.
- Jeff Painter
comment:9 Changed 7 years ago by graybeal
I see this ticket, on Taxon Names and Identifiers, has not been addressed since the original discussion over a year ago.
I think it is most important that the ticket move forward. Though Roy's team may have moved on, this problem will need to be addressed in CF sooner or later. While only Roy, Jonathan, and I have discussed it, I suspect many CF lurkers have need for this capability.
The following issues seem acceptably resolved:
- promoting 6.1.1 on "Geographic regions" to 6.3 (i.e. remove it from 6.1), and adding Roy's as 6.4. Then 6.1 and 6.2 will describe mechanisms in CF, and 6.3 and 6.4 applications of these mechanisms.
- Initial text rewording by Jonathan: "A taxon is a named level within a biological classification, such as a class, genus and species. Quantities dependent on taxa have generic standard_names containing the word taxon, and the taxa are identified by auxiliary coordinate variables."
- Requiring name and identifier is reasonable (to make the description self-contained).
The following questions are open:
- How many identifier/sources if multiple are available? Roy suggested 1, Jonathan recommends 2, John suggests user's choice.
- How many sources? Roy suggested 2 (extensible), John says CF should not limit (and if it does, the 2 suggested are not the best 2).
- What kind of identifier? Roy suggested namespace + ':' + local text ID; Jonathan proposed (agreeable to Roy) separate int variables for WORMS aphia ID vs ITIS taxon species name; and John prefers globally unique identifiers, LSIDs being the common practice (not offered directly by ITIS, only indirectly through Catalog of Life). In Jonathan's scheme each ID type would have a separate int variable, dimensioned to the number of taxa being defined.
(Incidentally, http://www.jbiomedsem.com/content/2/1/7 provides a detailed analysis of the Catalog of Life identifier approach, which integrates the data from ITIS, WORMS, and Species 2000, among many others, and includes thoughts of why the CoL approach wasn't more widely adopted (at that time anyway). Another extended discussion at http://soyouthinkyoucandigitize.wordpress.com/2013/01/28/what-gets-linked-to-global-unique-identifiers-guids-in-natural-history-collection-digitization/. The point is that while going round and round is definitely possible, I want to cleanly account for more than what a specific part of the CF community does today, if we can.)
Looking for a common path, the following seems pretty close:
- Support multiple identifier sources; specifying those to be provided _if available_
- if it isn't available in ITIS or WORMS, it should still be citable
- if the user always uses WORMS, we should not force them to translate to ITIS, and vice versa
- While I happen to think Catalog of Life is more suitable than ITIS, I'll forego the argument as long as we aren't exclusive
- Use Jonathan's proposed approach for WORMS and ITIS, but allow the extension for others (e.g., CoL) for other globally unique identifiers; with any globally unique identifier to be given the standard name taxon_global_identifier, and can be text (which most will be) or int (for UUIDs, for example)
- The comparability of identifiers A to B to C etc. will inevitably be done at a domain-specific application level, well beyond the concern of CF (but readily achievable by domain experts)
- It won't be necessary to define unique identifier types for each source, since globally unique identifiers are by their nature distinguishable and uniquely relatable to their source
- If we accept this adjustment, we don't have to argue on the merits whether Catalog of Life is better than ITIS (not so much because of LSIDs, but because it includes many more sources than just ITIS).
So this might give us the following example:
variables: int aphiaID(taxa); aphiaID:_FillValue=-1; aphiaID:standard_name="taxon_identifier"; int tsn(taxa); tsn:_FillValue=0; tsn:standard_name="taxonomic_serial_number"; string col(taxa); col:_FillValue="null"; col:standard_name="taxon_global_identifier"; col:comment="LSID from Catalog of Life"; data: taxon_name="Homo sapiens", "Fraxinus excelsior", "Struthio camelus"; aphiaID=1,32768,-1; tsn=42,0,7776; col="urn:lsid:catalogueoflife.org:taxon:f33e0fe1-ac8e-11e3-805d-020044200006:col20140401", "urn:lsid:catalogueoflife.org:taxon:0ad7462a-ac8f-11e3-805d-020044200006:col20140401", "urn:lsid:catalogueoflife.org:taxon:ebff2886-ac8e-11e3-805d-020044200006:col20140401";
comment:10 Changed 7 years ago by jonathan
Dear John
Thank you for moving this forward. I am not an expert and defer to you and Roy, but what you propose seems fine to me. I think the text for the standard would need to be more specific about the content of a variable containing taxon global identifiers. Wouldn't it have to be string data, in order to be able to say (as a URN) what sort of identifier it is, as in your example? Can it be required to be a URN? If the col variable is an array of strings, following the netCDF classic model (as we do so far in CF), it should be a 2D char array.
I note with sadness that F. excelsior is currently threatened by a nasty disease in England.
Jonathan
comment:11 follow-up: ↓ 12 Changed 7 years ago by graybeal
This from Roy via the CF list 2014.04.23:
Dear John and Jonathan,
Resurrecting Trac ticket 99 has been on my Todo list for some time - thanks John. I'm not in the office until Tuesday and my Trac login credentials are on my work PC. Hence this reply via the normal list.
Since the last correspondence on this ticket, SeaDataNet? have adopted a LSID URI syntax incorporating the AphiaID driven by standards from the OBIS community for identifying taxa. It would make a lot of sense to adopt this in CF. I don't have the details to hand, but if anyone from VLIZ or ICES is watching this thread maybe they could provide the precise syntax of this URI. If not, I'll dig it out next week. As Jonathan says, this will need to be a string array.
John - are you able to write draft text for inclusion in the CF documentation?
Cheers, Roy.
comment:12 in reply to: ↑ 11 Changed 7 years ago by graybeal
Replying to graybeal:
This from Roy via the CF list 2014.04.23: John - are you able to write draft text for inclusion in the CF documentation?
In theory, yes. I am having trouble making the time right now but will try to get to it.
Meanwhile, we got another request on the list that involved chemical names and organism parts. So far CF tends to include chemical names in the standard name, but if amount_of_<substance>_in_<body part> combinations is high I wonder if that is not another candidate for the same treatment. Something I'll keep in the back of my mind if I start writing a draft.
comment:13 Changed 3 years ago by martin.juckes
Hello, I've come to this for the first time today, so apologies if I have missed some points already discussed. This contribution is motivated by an email by Roy to the CF list. The general approach looks good, but I'd like to propose a few modifications.
(1) The implied constraint relating taxon_name and taxon_identifier in the current proposal looks untidy to me (the relationship is implied by the fact that they both occur as coordinates of abundance). I suggest a slight modification to place the identifier as a coordinate of the taxon_name array, making the taxon_name array a self-contained structure.
(2) taxonomy is a broader science, but "taxon" is a term only used widely in the biological sciences. the proposed is well defined, but I think it would be clearer if the biological nature of the taxonomy was explicitly stated in the standard name: number_concentration_of_biological_taxon_in_sea_water
(3) there should be some reference to the classification system being used .. and perhaps that should be restricted to a small collection of approved systems. This would then be another coordinate with a standard name such as "reference_classification";
float abundance(time,taxon); abundance:standard_name="number_concentration_of_biological_taxon_in_sea_water"; abundance:coordinates="name"; char name(taxon,string80); name:standard_name="taxon_name"; name:coordinates="identifier classification"; char classification(string80); classification:standard_name="reference_classification"; char identifier(taxon,string80); identifier:standard_name="taxon_identifier";
cheers, Martin
comment:14 Changed 3 years ago by lowry
Dear Martin,
This work is coming back to life after lying fallow for four years. I hope to finish the job this time.
Regarding your comments:
1) Any improvements on the simple idea of two arrays of dimension taxon that are co-ordinates of abundance are most welcome. I don't see any problem with your suggestion. If anybody else sees any issues please holler.
2) I plan to put in a Standard Names proposal in a week or so. Before then I propose to consult with some people in the OBIS community as to whether 'taxon' is the appropriate term to include in the Standard Name and if so how it should be defined. I suggest we defer this discussion until I put in the proposal including that input.
3) LSIDs are governed URNs with the format:
urn:lsid:<Authority>:<Namespace>:<ObjectID>[:<Version>]
This includes the reference classification in the <Authority> element and these are restricted by the LSID governance, which is well respected. Your suggestion restricts a given abundance data set to a single reference classification, but some datasets could require the coverage of both WoRMS and ITIS. Using LSIDs permits this. Consequently, I disagree with you on this point.
comment:15 Changed 3 years ago by martin.juckes
Dear Roy,
(1) Thanks (2) OK (3) Sorry, I hadn't understood all the implications of the LSID discussion. I'll wait to see your revised proposal,
regards, Martin
comment:16 Changed 3 years ago by jonathan
Dear Roy and Martin
Regarding point (1) above, I think the initial proposal is consistent with CF. taxon_identifier and taxon_name are both string-valued auxiliary coordinate variables of the data variable of abundance, with the same single dimension taxon, which implicitly associates them. Martin's suggestion has the coordinates attribute on name, which is an auxiliary coordinate variable. That isn't consistent with CF, since only data variables have this attribute, so it would require a new convention and implies a change to the CF data model. I think that's a problem and would favour the initial proposal.
Cheers
Jonathan
comment:17 Changed 3 years ago by martin.juckes
Dear Jonathan,
I don't want to hold up the discussion on this point, but I still believe that an explicit link between taxon_identifier and taxon_name would be an improvement on the implicit link which the current proposal provides. The CF convention states that a "coordinate may be represented as a scalar variable (i.e. a data variable which has no netCDF dimensions)", which implies that a data variable may be a coordinate variable -- I could not find anything in the convention text relating to the assertion that a coordinate variable cannot be a data variable, but this may be due to the fact that I have not looked in the right place. Is there anything in the text of the convention to say that a coordinate variable is not a data variable?
Cheers, Martin
comment:18 Changed 3 years ago by jonathan
Dear Martin
You're right, a coordinate variable of one data variable could be a data variable in its own right. However the CF convention is centred on the idea of "data variable", and as far as Roy's data variable is concerned the variables named by its coordinates attribute are coordinate variables. Software based on the CF convention would not expect them to have a coordinates attribute of their own, so it would not follow it to find more coordinate variables. A coordinates attribute of a coordinate variable has no defined meaning in CF - it's not illegal, but not part of the CF convention.
Another point is that in my mind taxon_identifier and taxon_name are of equivalent status, so it seems to me they should be linked to the data variable in the same way.
Best wishes
Jonathan
comment:19 Changed 3 years ago by lowry
Draft text for CF Conventions to complete this ticket
Taxon names and identifiers
A taxon is a named level within a biological classification, such as a class, genus and species. Quantities dependent on taxa have generic standard_names containing the phrase biological_taxon, and the taxa are identified by auxiliary coordinate variables. The taxon co-ordinate variables consist of a plain language name (biological_taxon_name) plus one or more identifiers referring to internet resources from sources agreed by discussion on the CF list. The currently accepted identifiers and their Standard Names are:
- Life Science Identifier (LSID): biological_taxon_lsid. This is a URN with the syntax urn:lsid:<Authority>:<Namespace>:<ObjectID>[:<Version>]. This includes the reference classification in the <Authority> element and these are restricted by the LSID governance. It is strongly recommended in CF that the authority chosen is World Register of Marine Species (WoRMS) for oceanographic data and Integrated Taxonomic Information System (ITIS) for freshwater and terrestrial data. WoRMS LSIDs are built from the AphiaID such as urn:lsid:marinespecies.org:taxname:104464 for AphiaID 104464. This may be converted to a URL by adding prefixes such as http://lsid.twg.org/. ITIS LSIDs are built from the TSN, such as urn:lsid:itis.gov:itis_tsn:180543.
It is an error if the biological_taxon_name does not agree with the name resolved from the biological_taxon_lsid or other accepted identifier. Missing data can be given for any taxon which doesn't have an identifier.
A skeleton example for taxonomic abundance time series is:
time=100; string80=80; taxon=2; variables: float time (time) time:standard_name="time" float abundance(time,taxon); abundance:standard_name="number_concentration_of_biological_taxon_in_sea_water"; abundance:coordinates="taxon_lsid taxon_name"; char taxon_name(taxon,string80); taxon_name:standard_name="biological_taxon_name"; char taxon_lsid(taxon,string80); taxon_lsid:standard_name="biological_taxon_lsid"; data; time = ……100 values abundance = ….200 values taxon_name = "Calanus finmarchicus", "Calanus helgolandicus" taxon_lsid = "urn:lsid:marinespecies.org:taxname:104464", "urn:lsid:marinespecies.org:taxname:104466"
comment:20 Changed 3 years ago by jonathan
Dear Roy
Thanks for the text.
- What is the proposed number of this new section?
- In the email list, I suggest that organisms_from_biological_taxon would be a better phrase than biological_taxon in the standard names.
- Since there is only one bullet, I think it would be better not to have a list. If another kind of identifier is allowed later, the text can be modified to introduce a list. At the moment, I would suggest
The taxon auxiliary coordinate variables are string-valued. The plain-language name of the taxon may be contained in a variable with standard_name of biological_taxon_name. A Life Science Identifier may be contained in a variable with standard_name of biological_taxon_lsid. This is a URN ..."
- I'm not clear whether one or both of these must be present. We should clarify this.
- The example doesn't appear to be in correct CDL - at least, not the dialect we use in other examples, which have only one statement per line, and every line ends with ;.
Best wishes
Jonathan
comment:21 Changed 2 years ago by lowry
Hopefully the final rewrite of ticket 99 to address Jonathan’s comments. Only issue where there is still a difference of opinion is the positioning of the section in the document.
Jonathan Gregory proposed that we should tidy the CF document by promoting 6.1.1 on "Geographic regions" to 6.3 (i.e. remove it from 6.1), and adding yours as 6.4. Then 6.1 and 6.2 will describe mechanisms in CF, and 6.3 and 6.4 applications of these mechanisms.
My interpretation is that this section is an application of the ‘Alternative co-ordinates’ CF mechanism (Section 6.2) in the same way as ‘Geographic Regions’ are an application of the ‘Labels’ CF mechanism (Section 6.1). Example 6.3. (Model level numbers) is also an application of ‘Alternative co-ordinates’. Consequently, I would make Example 6.3 Section 6.1.1 and the following ‘Taxon names and identifiers’ text Section 6.1.2.'
6.1.2. Taxon names and identifiers
A taxon is a named level within a biological classification, such as a class, genus and species. Quantities dependent on taxa have generic standard_names containing the phrase organisms_in_taxon, and the taxa are identified by auxiliary coordinate variables.
The taxon auxiliary coordinate variables are string-valued. The plain-language name of the taxon must be contained in a variable with standard_name of biological_taxon_name. A Life Science Identifier may be contained in a variable with standard_name of biological_taxon_lsid. This is a URN with the syntax urn:lsid:<Authority>:<Namespace>:<ObjectID>[:<Version>]. This includes the reference classification in the <Authority> element and these are restricted by the LSID governance. It is strongly recommended in CF that the authority chosen is World Register of Marine Species (WoRMS) for oceanographic data and Integrated Taxonomic Information System (ITIS) for freshwater and terrestrial data. WoRMS LSIDs are built from the AphiaID such as urn:lsid:marinespecies.org:taxname:104464 for AphiaID 104464. This may be converted to a URL by adding prefixes such as http://lsid.twg.org/. ITIS LSIDs are built from the TSN, such as urn:lsid:itis.gov:itis_tsn:180543.
The biological_taxon_name co-ordinate included for human readability is mandatory. The biological_taxon_lsid co-ordinate included for software agent readability is optional, but strongly recommended. If both are present the biological_taxon_name must match the name resolved from the biological_taxon_lsid exactly. If LSIDs are available for some taxa in a dataset then the biological_taxon_lsid co-ordinate should be included and missing data given for those taxa that do not have an identifier.
A skeleton example for taxonomic abundance time series is:
time=100; string80=80; taxon=2; variables; float time (time); time:standard_name="time"; float abundance(time,taxon); abundance:standard_name="number_concentration_of_organisms_in_taxon_in_sea_water"; abundance:coordinates="taxon_lsid taxon_name"; char taxon_name(taxon,string80); taxon_name:standard_name="biological_taxon_name"; char taxon_lsid(taxon,string80); taxon_lsid:standard_name="biological_taxon_lsid"; data; time = ……100 values; abundance = ….200 values; taxon_name = "Calanus finmarchicus", "Calanus helgolandicus" taxon_lsid = "urn:lsid:marinespecies.org:taxname:104464", "urn:lsid:marinespecies.org:taxname:104466";
comment:22 Changed 2 years ago by jonathan
Dear Roy
Thanks. That looks fine to me! Am I right that you propose the order of sections to be
6 Labels and alternative coordinates 6.1 Labels 6.1.1 Geographic regions 6.1.2 Taxon names and identifiers 6.2 Alternative coordinates
Best wishes
Jonathan
comment:23 Changed 2 years ago by lowry
Thanks Jonathan,
Sorting me out yet again - I knew I'd get a typo (or two) in there somewhere!!!
What I intended to suggest is:
6 Labels and alternative coordinates
6.1 Labels
6.1.1 Geographic regions
6.2 Alternative co-ordinates
6.2.1 Model level numbers
6.2.2 Taxon names and identifiers
Hope that now makes more sense.
comment:24 Changed 2 years ago by jonathan
Dear Roy
Ah, I see. Thanks!
I've never liked 6.1 having one subsection. It makes me uncomfortable. Although it's nothing to do with this proposal, I thought we might take advantage of the opportunity to change it. Since labels and alternative coordinates are very similar concepts, would it look OK to take the para which introduces the current 6.2 "In some situations a dimension may have alternative sets ..." and put it as a third para to the preamble of the current 6.1. Then we could eliminate one of the levels:
6 Labels and alternative coordinates [with preamble introducing both of them in three paras]
6.1 Geographic regions
6.2 Model level numbers
6.3 Taxon names and identifiers
How would that be? If you think this would derail the conclusion of this ticket, I'll delay it to another time.
Jonathan
comment:25 Changed 2 years ago by lowry
I don't have strong feelings either way, but feel that others might. I would be much happier getting the new section added and closing this ticket to get clear of Trac. How about making your suggested restructuring a GitHub? ticket once this is all done and dusted?
comment:26 Changed 2 years ago by jonathan
Yes, that's perfectly fine. We will leave it as you propose in this ticket. Jonathan
Dear Roy
Thanks for making this proposal. You provide good arguments in support of doing this way. I think that quite a lot of the above text is actually the arguments in support of making the change, and we will not need to include it all in the CF standard document. I would suggest that the text might be
Then go on to describe the conventions for names and IDs, and give the example(s).
You compare this proposal to geographic regions, and I agree with that, but it's a bit more complicated because of the alternative sets of labels. I propose that we should tidy the CF document by promoting 6.1.1 on "Geographic regions" to 6.3 (i.e. remove it from 6.1), and adding yours as 6.4. Then 6.1 and 6.2 will describe mechanisms in CF, and 6.3 and 6.4 applications of these mechanisms.
I am still concerned about the possibility for confusion in identification of taxa, but I accept that we have to work with the best there is! I'd like to suggest something a bit more demanding, however:
Thus, I would put the two sorts of identifier into separate auxiliary coordinate variables. Since they have different standard names, they wouldn't need the namespace identifier, and the values can be integers, rather than strings. Missing data can be given for any taxon which doesn't have an identifier.
What do you think?
Best wishes
Jonathan