12.01.2017 12:06

2017 needs to bring better access to taxonomic data

2016 showed progress in providing access to taxonomic data in near realm time. But much more is needed.

The current way of publishing, cataloguing and providing access to taxonomic data, to say it mildly, is an almost complete failure. 


  1. We (our taxonomic community) have an estimate of 10% of our literature digital, mainly through efforts of by BHL. The many silos of PDFs accumulated by individual scientists don't count, since they are not part of the community, i.e. accessible for everybody.
  2. We do not have a catalogue of life for all living species, nor are we able to build one without an additional huge effort that is not on the horizon.
  3. We do not know, what has been published in 2016.
  4. The traditional publications are made for human consumption which makes data extraction extremely cumbersome.
  5. We do not fulfill the role of taxonomy as a service to the widest community to deliver the reference system to share data about species and thus endanger taxonomy to become even more obsolete - which is not really necessary to happen with the current tools at our fingertips.


After having worked for over 14 years in modeling and extracting taxonomic content from publications, we (at Plazi) don't see any silver lining that we can deal with the huge backlog of literature, and a slight for ongoing publishing.

With a great effort we now can automatically extract content from scientific publications, that results together with those easily imported from taxpub/XML published articles from Pensoft with an estimate of 25% of the new described species for 2016, including the metadata, the taxonomic treatments, the illustrations and in many case the types material, including the collection code and specimen code. Today we have over 60,000 tagged images on BLR, complementing the ca 30,000 images on Flickr through BHL.

We now have for the articles, the treatments, and illustrations respective metadata added, and persistent identifiers, that are also included in the metadata them whenever one cites another. 

This workflow is mainly working for born digital articles. Tackling at a same level scanned articles is a magnitude more complex, which makes it even less hopeful that it will be done somewhere in the near future.

For 2016, at Plazi we extracted 4 new families, 376 new genera, 4.684 new species, 42.207 taxonomic treatments of 40.870 unique names from 60 different journals. The data is accessible at Plazi and the Biodiversity Literature Repository

Plazi data is automatically imported in GBIF where it is one of the major name contributors and one of the few providing treatments, allowing linking a name usage to the respective treatment and from there to the original article and illustrations - which for a nomenclatural point of view allows to check, besides the exact publishing date, all what is needed to understand whether a name is available. But it also allows to start to understand the scientific bases for new names, which is all too often very thin, i.e. one single specimen based descriptions (see eg Miller et al., 2016) or consult the new taxa feature on Plazi

Our taxonomists' chance is that we have one of the most advanced publication system for the entire scientific publishing world available. In fact it has been developed for the taxonomic world thanks to a collaboration with Pensoft who implemented it, and a collaboration with Plazi and the US National Institutes of Heath, which for this reason also allowed to include taxonomic articles into PubMed Central.

The above approach has another advantage, that all is open access and thus available for anybody anywhere in the world. In fact the implementation of the Open Biodiversity Knowledge Management System (OBKMS) will make all the data that is being published at Pensoft and extracted by Plazi available into the Linked Open Data Cloud. With other words, our really important data will become a first class citizen we want it to be.

But it needs a community effort to make it  happen, that is to provide not a fraction but all the data of our discoveries right at the moment we publish. Please join this effort!