26.01.2020 08:07

50.000th GBIF dataset: a brief history


The 50,000th dataset (data) published at GBIF, including the data liberated from the publication by Kirby from 1889 with the description of Lestes wallacei from Sarawak, in honor of its collector Alfred Russel Wallace.

On January 8, 2020 Plazi had the good luck and honor to be the mediator of GBIF's 50.000th dataset. It was probably predictable to some degree, since the chances have been over 56%. Plazi has deposited over 28K of the 50K datasets on GBIF.

This dataset is representative of a rapidly growing corpus of datasets not based on collections in their original sense, that is, specimens in a natural history museum. Instead, they are materials citations liberated from taxonomic treatments, the part of a publication covering a specific taxon, in a taxonomic publication. In other words, they are not a list of collected physical specimens but rather, like image-based observations, digital representations of a physical specimen. A materials citation is a reference of a physical specimen, much like a bibliographic reference to an article. As such, they are the logical, albeit mostly "missing link" between a traditionally physical specimen and its facts discovered by a scientist. These can include textual facts, tables, figures and implicit reference to external facts, physical specimens or other treatments. Even today, almost all the treatments are an artifact of the printed media where all the links are implicit, from very cryptic to verbose. A holotype citation refers to a holotype by the label data, but not by a persistent identifier allowing linking, such as a digital object identifier (DOI) for an article.

Having the materials citation in GBIF is a great resource. It links not only to a copy of the article, all too often closed access, but the fully open taxonomic treatment data provided by the respective author. They are most valuable if the citations are linked to their physical specimen deposits, rapidly growing through large scale initiatives like the European DiSSCo, the US based iDigBio  or the Global Plants complemented by digitization activities at natural history museums. Increasingly, GBIF has both the specimen occurrence and the materials citations deposit independently, and thus just the connecting the two (or increasingly many others) is missing. The persistent identifier of the specimen will play a decisive role in this endeavor.


This dataset is a very nice show case that lends itself to telling the story behind it and how it is a part of the research data life cycle. Following is the story about how this call came about.


Jeremy Miller from Naturalis, and Torsten Dikow from the Smithsonian along with their teams decided to create another cybercataloge similar to the cybercataloge of Afrotropical apiocerid flies by Dikow and Agosti, this time covering the lestoid damselflies. They were inspired by Smithsonian odonate collection holding one of the world’s largest odonate collection complemented with the one at Naturalis. The term "cybercatalog" is used here to denote a taxonomic catalog pointing the reader to the taxon-specific, open-access information on the world wide web.

How has this been done?

  • For damselflies (Odonata), starting points to build a cyber-catalogue already exist as catalogue rudiments in Catalogue of Life, and other publications. They provide seed references. In the case of the cited COL reference database Odonata, this link in fact doesn't work anymore since 2017, nor does its snapshot in the Way Back Machine at the Internet Archive.
  • Deciphering and complementing the often very rudimentary bibliographic citations, such as "Kirby, 1889" to "Kirby, W. F., 1889. Descriptions of new genera and species of Odonata in the collection of the British Museum, chiefly from Africa. Proceedings of the Zoological Society of London: 1889: 297-303".
  • Finding the respective publications, or at least the digital copies of the journal, like in this case in the Biodiversity Heritage Library.
  • Getting the article, and if it is based on scans of the respective pages, running optical character recognition (OCR) on it to get a clean, improved PDF that we can handle most completely.
  • Decoding and analyze the PDF to create a copy including words, figures and text stream.
  • Liberating the data and semantically enhancing it.
  • Uploading the article to TreatmentBank.
  • Creating deposits of the treatments and figures, that is making them open, FAIR data via the Biodiversity Literature Repository (BLR).
  • Running a quality control check and fixing errors using a pre-defined criteria.
  • Manually finding the type specimen and annotating the materials citation by searching the respective specimen database at the Natural History Museum London, where the Wallace collection is housed.
  • Alerting GBIF on its upload to TreatmentBank so GBIF can import it as new Darwin Core Archive. It re-imports after changes in article have been made.

 

GBIF imports the dataset and uses the data to create various displays of it, from an overview of the entire datasets, a subset of accepted names or occurrences, to individual accepted names and occurrences.

 

Together with GBIF, there is access to this data for different stake-holders via provided services. All the data is open access, can be cited in different ways and downloaded in various formats such as html, valid XHTML, generic RDF, JSON, and Plazi generic XML) and has appropriate licenses attached to it for legal clarity.


With this article in hand Jeremy and colleagues have one more piece of the puzzle in place toward creating their envisioned cybercatalog. Additionally, GBIF has additional names (e.g. Lestes wallacei) for their taxonomic backbone and eventually also the COL, currently not including the original combination.

What have we learned?

Taxonomic publications are very rich in citations. They are, however, almost all implicit, cryptic and thus only available and actionable with considerable manual effort, domain knowledge and access to libraries and online subscriptions. No wonder, that we do not know what we know regarding biodiversity.

Looking at this article specifically, the original name Lestes wallacei does not exist in any of the databases (GBIF, COL, NHM) but only as Orolestes wallacei.


From a cataloging point, we still cannot follow from Orolestes wallacei to the original description via the treatment where the change of name has been initiated.

The NHM London has not yet published the specimen data on the type in GBIF, although it is available under Orolestes wallacei on its system.

This and the other 28K taxonomic publications based dataset in GBIF show its potential but also the current limitation of this process.


What should be done?


The primary and most important lesson is to change the way we publish taxonomic works to a workflow that creates immediately actionable data, proven by its reuse in GBIF. This is already implemented by the Biodiversity Data Journal, enabled by using Taxpub/JATS.

The second lesson is that we need to enhance and create workflows allowing finding, converting, linking and making the data imprisoned in legacy publications actionable and as automated as possible. As is obvious from the above description, this is a real community effort, and a daunting task with an estimated 500 million published pages of rapidly growing, largely untapped corpus of taxonomic literature.