Expanding the biodiversity knowledge graph

Discovery of species is based on specimens and communicated through taxonomic treatments described in and presented as parts of scientific publications. In an ideal world, this would allow us to ask questions such as “what is known about this specimen?” or “what does a gene sequence of a cited physical specimen look like?”, and the results would appear immediately on our computer screens in a format suitable for further analysis.

This ideal world of instant insights into biodiversity data is still not here yet, but we have made big steps towards it. Through international initiatives such as DiSSCo and iDigBio, and national ones such as SwissCollNet, tens of millions of specimens are being digitized. These digital specimens are aggregated through the Global Biodiversity Information Facility (GBIF) along with data from other genomic and citizen science projects. In fact, these various sources often refer to the same specimens, thus enriching our knowledge of them.

Material citations are the citation of a specimen in a taxonomic publication, and point to the source of what the scientist discovered through the combined analysis of the specimen and its congeners. Though this relationship of a material citation to its specimen looks simple, it is complex from a technical point of view. The links can be predicted only to some degree of accuracy, for example by a clustering algorithm from GBIF or via a matching algorithm used by Plazi. In either case, they provide a unique starting point for further curation.

A major issue in digitization is how and what data are collected, and the reliability and quality of the conversion process. Since most of the data is not digitized originally with the intention of linking one data point to another, it is algorithmically not simple to create matches. In other words, human curation is required to accept or reject a proposed match. For this reason, a matching service has been developed allowing users to curate the links. Once a link is accepted, an identifier of the linked specimen can be inserted into the material citations, thereby expanding the knowledge graph one link at a time.

Matching service user interface. The proposed match is between a material citation and possible specimens in GBIF. Each field provides a matching score with green the highest, as well as an overall score.

Plazi in collaboration with SIBiLS and COST Mobilise will conduct a training course to interested persons on 27-28 February, 2023 to learn the underlying concepts used in digitizing taxonomic publications and occurrences in GBIF, operating the matching service, and how to decide whether a match is acceptable or not.

This digitization service, development of the learning materials and the training course is a collaboration between SIBiLS, Plazi and the Natural History Museum of Bern, Switzerland, supported by Swissuniversities eBioDiv project, Arcadia and Horizon Europe funded BiCIKL and COST Mobilise action projects.