One of the big challenges in biology - how many species are on planet Earth, where do they live and what do they do - is increasingly relevant in the age of the biodiversity crisis. In the digital age this turns awkward, because it becomes obvious that we do not even know what we know about biodiversity. Whilst the libraries have been the undisputed source of knowledge in the analogue age, having hard copies in libraries does not serve the needs anymore. Neither does the Portable Document Format (PDF), the currently most widespread publishing format and required by the Codes of nomenclature. None of them allows machines to access the facts in the estimated 500 Million pages of legacy publications. It requires a huge amount of scientists’ time to catalogue the content, for example to build the Catalogue of Life, a goal the latest since the Rio Earth Summit in 1992. But a catalogue itself isn’t enough because this does not allow to explore the many implicit links, the taxonomic treatments, traits and specimens used to describe new or enhance the knowledge on existing species with additional data.
In fact, in the digital age, an expert's identification of a specimen should include the reference to the cited treatments, analogous to depositing a voucher specimen in genomic or ecological studies. An expert opinion is without such evidence a cul du sac in the digital age.
The missing digital access it is even more annoying, because scientists have been building, albeit so far implicitly, a huge citation network: citing specimens, previous treatments of taxa, publications and using a standard shared domain specific vocabulary such as the Latin Binomen for species names, or morphological terms that allows to compare different data sets.
Finally, the rapid growth of technology allowing unprecedented analyses, visualization, or making the network navigable is increasing the gap between the way we operate and what’s possible.
To know known biodiversity is a complex endeavour. This spans from getting the attention of the scientists, publishers and funders, that publications are not mere tools for career enhancement, but that its data is a contribution to building a knowledge graph, where the scientist is not primarily consuming one publication after the other, but analysing the data from within many to very many publications. This includes getting access to the literature, converting it into a machine readable format, in the most extreme from scanning the source, text conversion, to modeling knowledge, creating and applying domain specific vocabularies to build new, sustainable infrastructures. It requires changes in the sciences to include funding of new infrastructures to measures science output using alternative metrics based on data beyond the articles per se.
A combination of highly automated and human interaction based tools is needed to control the large amount of converted and annotated data to contribute to build the biodiversity knowledge graph. These controls encompass from proper extraction of text streams to complex citations, such as treatment citations, the building blocks of the catalogue of life, include elements such as taxonomic names and bibliographic reference citations that can only be checked with support of automated processes.
The sheer number of facts and links embedded in a single publication, not to speak about the annual production or the backlog waiting in the libraries, can only be processed using automated workflow such as our TreatmentBank service and its interaction with the Biodiversity Literature Repository at Zenodo.
Thanks to sophisticated tools and collaboration with partners quality control can be delivered. For example, thanks to our Synospecies service and its underlying AllegroGraph Database we noticed that several defining treatments appear to be missing, and further analysis with SPARQL helped us quickly find the reason for the problem.
Together this will help us not only to provide data for reuse in the Biodiversity Literature Repository to unravel and understand known biodiversity, but provide data that is fit for use by other services such as the Global Biodiversity Information Facility (GBIF), SIBliS, LINDAS, and thus disseminate this data recognized by the Convention of Biological Diversity as the basis of the Global Taxonomic Impediment hampering the conservation of biodiversity.