18.02.2021 10:16

A new GBIF Plazi data issue feedback loop

The new GBIF-Plazi alert system making use of the Github issue tracker

Plazi’s goal is to discover known biodiversity, and make it widely available using the tools available in the digital age. This means liberating data hidden in libraries, and more recently, and increasingly hidden in the PDF-prison, paywalls and unstructured text.

The grand challenge facing us is an estimated 500 Million published pages of scholarly biodiversity related publications, and 17,000 new discovered and described species every year along with a multiple of annotations on already known species.

This clearly can not be done by any one institution, despite the gains made in computation technology and power available to individuals technology. It needs collaboration and a strategy. With its rapidly growing corpus of liberated taxonomic treatments and technology, Plazi hopes to inspire others to join in this endeavour. While the greatest interest and richness in the data lies in the details who collected when where and what species this can only be achieved if its context too is liberated and made available as FAIR data.

Our current strategy is to make one million taxonomic treatments, including taxonomic names, openly accessible, and by leveraging treatment citations to build the catalogue of life with each name linked to its scientific argument and data. We are well on our way.

To tackle this, we have developed necessary tools and infrastructures or will, if necessary. They allow converting PDF documents into a text and multimedia stream that can be further processed by adding tags enhancing single words to entire sections with a meaning (semantic enhancement) so both humans and machines can understand its content. This is quite a technical challenge.

Materials citations attract great attention. They reference specimens in various formats used in the research process leading to the research results, that is the published taxonomic treatments. However, they are a collateral in the current production. We try to discover them, make them accessible via the Global Biodiversity Information Facility (GBIF) as occurrences. We routinely do quality control, and check whether holotype citations are properly extracted, but not the reminder, unless dedicated resources are available.

We believe that, despite this raw form, we provide a useful service by drawing attention to specimens in collections and species that are not recorded in GBIF. We believe this will eventually lead to better data. It will raise the awareness of the authors and publishers regarding the value to publishing specimen data in a standardized way, as has been proposed by EJT and Pensoft.

With your input, we can make this data even better, and the material citations fitter for more uses.

The launch of the new feedback button in GBIF reflects the commitment by GBIF and Plazi to care about  data quality. A user can send a request for a review of specific data to Plazi allowing its quality control team to study and resolve the issue and update the GBIF record. Analysis of these requests can further make us aware of reasons for the related issues and develop measures to mitigate them in the future.

We hope this will contribute to a better way scientists publish their research results.