On January 8, 2020 Plazi had the good luck and honor to be the mediator of GBIF's 50.000th dataset. It was probably predictable to some degree, since the chances have been over 56%. Plazi has deposited over 28K of the 50K datasets on GBIF.
This dataset is representative of a rapidly growing corpus of datasets not based on collections in their original sense, that is, specimens in a natural history museum. Instead, they are materials citations liberated from taxonomic treatments, the part of a publication covering a specific taxon, in a taxonomic publication. In other words, they are not a list of collected physical specimens but rather, like image-based observations, digital representations of a physical specimen. A materials citation is a reference of a physical specimen, much like a bibliographic reference to an article. As such, they are the logical, albeit mostly "missing link" between a traditionally physical specimen and its facts discovered by a scientist. These can include textual facts, tables, figures and implicit reference to external facts, physical specimens or other treatments. Even today, almost all the treatments are an artifact of the printed media where all the links are implicit, from very cryptic to verbose. A holotype citation refers to a holotype by the label data, but not by a persistent identifier allowing linking, such as a digital object identifier (DOI) for an article.
Having the materials citation in GBIF is a great resource. It links not only to a copy of the article, all too often closed access, but the fully open taxonomic treatment data provided by the respective author. They are most valuable if the citations are linked to their physical specimen deposits, rapidly growing through large scale initiatives like the European DiSSCo, the US based iDigBio or the Global Plants complemented by digitization activities at natural history museums. Increasingly, GBIF has both the specimen occurrence and the materials citations deposit independently, and thus just the connecting the two (or increasingly many others) is missing. The persistent identifier of the specimen will play a decisive role in this endeavor.
This dataset is a very nice show case that lends itself to telling the story behind it and how it is a part of the research data life cycle. Following is the story about how this call came about.
Jeremy Miller from Naturalis, and Torsten Dikow from the Smithsonian along with their teams decided to create another cybercataloge similar to the cybercataloge of Afrotropical apiocerid flies by Dikow and Agosti, this time covering the lestoid damselflies. They were inspired by Smithsonian odonate collection holding one of the world’s largest odonate collection complemented with the one at Naturalis. The term "cybercatalog" is used here to denote a taxonomic catalog pointing the reader to the taxon-specific, open-access information on the world wide web.
How has this been done?
GBIF imports the dataset and uses the data to create various displays of it, from an overview of the entire datasets, a subset of accepted names or occurrences, to individual accepted names and occurrences.
Together with GBIF, there is access to this data for different stake-holders via provided services. All the data is open access, can be cited in different ways and downloaded in various formats such as html, valid XHTML, generic RDF, JSON, and Plazi generic XML) and has appropriate licenses attached to it for legal clarity.
With this article in hand Jeremy and colleagues have one more piece of the puzzle in place toward creating their envisioned cybercatalog. Additionally, GBIF has additional names (e.g. Lestes wallacei) for their taxonomic backbone and eventually also the COL, currently not including the original combination.
What have we learned?
Taxonomic publications are very rich in citations. They are, however, almost all implicit, cryptic and thus only available and actionable with considerable manual effort, domain knowledge and access to libraries and online subscriptions. No wonder, that we do not know what we know regarding biodiversity.
Looking at this article specifically, the original name Lestes wallacei does not exist in any of the databases (GBIF, COL, NHM) but only as Orolestes wallacei.
From a cataloging point, we still cannot follow from Orolestes wallacei to the original description via the treatment where the change of name has been initiated.
The NHM London has not yet published the specimen data on the type in GBIF, although it is available under Orolestes wallacei on its system.
This and the other 28K taxonomic publications based dataset in GBIF show its potential but also the current limitation of this process.
What should be done?
The primary and most important lesson is to change the way we publish taxonomic works to a workflow that creates immediately actionable data, proven by its reuse in GBIF. This is already implemented by the Biodiversity Data Journal, enabled by using Taxpub/JATS.
The second lesson is that we need to enhance and create workflows allowing finding, converting, linking and making the data imprisoned in legacy publications actionable and as automated as possible. As is obvious from the above description, this is a real community effort, and a daunting task with an estimated 500 million published pages of rapidly growing, largely untapped corpus of taxonomic literature.