10 years of Plazi. On expedition to discover the known biodiversity

This is the view of Donat Agosti, cofounder of Plazi.

In 1992, I wrote an article in the followup of the Rio Earth Summit in the Swiss newspaper Neue Zürcher Zeitung titled “Brauchen wir zu wissen, wie viele Arten es gibt?» Now, 2018, we are still discussing this question, and I am still convinced we should. Should, because we still don’t know know the answer. But must, because of the rapidly increasing, unprecendented loss of biodiversity.

Our world has dramatically changed since 1992. I can’t tell you my argumentation in this newspaper article, because I have only a hardcopy somewhere, I can’t find it online, and I forgot 1. This is annoying in a world where everything seems to have a digital fingerprint. If it is only this one personal work, this might be acceptable. But this is the case for probably 90% of all the printed scientific publications covering the description of the world’s biodiversity. And this is the reason, we don’t know how many species we know, not to speak of how many there are.

The story doesn’t end here. We continue to publish. Not anymore in paper, but digitally. The largest part of these new research results are closed access, not registered, and we have no idea what data the article includes. Some months to years later some humans enter the data into respective databases, which might have a link to the article, but still access to the deep content - the data - is not possible. Each article includes in the average 7 images – highly important illustrations following well established standards with goal (in the mind) to create a seamless corpus of illustrations depicting the Earth’ species. The same holds true for the taxonomic treatments that include anything from descriptions, summaries of distribution, behavior, references to the observed specimens, to synonymy.

This is in stark contrast to what is happening elsewhere. The genomics community develops ever faster and more efficient methods to collect DNA sequences and building up their own systems to study the world’s diversity. The citizen scientists have in place incredible tools to collect data of their objects, producing monthly millions of observation records with very precise geodata and a sophisticate quality control to ensure that the identification of the record is correct. This data is now the main staple of the Global Biodiversity Information Facility (GBIF) and probably the only dataset to study changes in (bird) biodiversity sufficient for monitoring requested back in 1992 in the Convention on Biological Diversity. These are all alternative ways to discover and chart the world’s biodiversity with its own constraints.

Back in 2003 with the help of the US NSF and Deutsche Forschungsgemeinschaft we started to a complementary approach which I like to think of the “Second wave of biodiversity discovery”: Discovering what we should know or with other words, what we have been publishing, and in fact what we continually publish. The idea is simple: if we have access to the data in all publications, even if we could not extract all of it, we could link it to what Tim Berners-Lee called the Knowledge Graph. This would work better if we model the taxonomic domain and discover the respective elements in the published literature, and more so if we explicitly identify (i.e., tag) and link upfront in the publishing process – an alternative few seriously are willing to discuss, let alone implement.

The user groups we organized at the American Museum of Natural History informed a research team which led to a first model put down in TaxonX schema which attempted to cover all the elements characteristic of taxonomic treatments. The lucky circumstances of the meeting of a very diverse team of scientists, from library, to computer, to biological sciences had also the advantage that we had many connections beyond taxonomy itself and a good overview what is happening in the various domains. One of the fruitful connections has been with the US National Center for Biotechnology Information of the National Library of Medicine, and the team that maintains the Journal Archival Tag Suite (JATS), which convinced us to use the lessons learned from TaxonX to create a taxonomy specific extension of JATS, which ultimately became TaxPub.

In 2008, at the Linnaean 250 year celebration in Paris, the Bulgarian publisher Pensoft has not only to agreed to consider the relevance of taxonomic treatments as the core element of taxonomic publishing that ought to be citable and retrievable from each respective taxonomic name, but also to change its publishing workflow to be based on JATS/Taxpub. This at the same time opened the door to Pensoft’s submission of taxonomic works into PubMed, another first.

Parallel to this, we run increasingly into the issue of copyright. Our activities have been on the radar of Kew Botanical Gardens who invited us to participate in a meeting about access. Together, we not only questioned the argument that copyright should be used and applied to protect the incredible and potentially highly valuable work done at the Garden. On the way home, writing a constitution became possibile. Within a very short time this converged with the insight that we had to “incorporate” and “brand” US, a loosely formed group of specialists with the same mission, to be more efficient. This let the founding skype on March 14, 2008, where Plazi Association was born.

Plazi’s mission is to foster open access to taxonomic work and make this knowledge an integral part of the science infrastructure. Our collaboration with Zenodo at CERN, with whom we actively collaborated from their very beginning, is a very important element in our virtual expedition. It provides our community a stable, state of the art, for the time being unlimited repository, along with shared interests such as making each data object citable using DataCite DOIs, enhanced with links to related items, all with usage statistics, and fully automated upload and annotations. With this we (together with Pensoft) provide a repository to the community that is independent of Plazi’s fate and which allows to compare others in the fledgling DiSSCo. The lesson from our collaboration with Zenodo taught as that what we consider a huge data, is in fact just dwarfed by large scale science projects like Large Hadron Collider at CERN, and with that not to consider storage as a limiting factor again.

Dealing with a rapidly growing number of images became another challenge. How can we best make use of them in a repository that has been build for single, but very large datasets as opposed to many small images? How can we make use of image analyses to contribute to automated identification of specimens? Why not build the world’s index of scientific taxonomic illustrations and make it another gateway to find out what we know about our species and in which publications? Suddenly having access to so many liberated images which nobody had ever before is one of the most stunning results of our expedition. It also shows that discoveries can not all be planned – we originally started focused on textual objects and later realized that illustrations are scientific data as well.

Today, at our tenth anniversary, we recognize that we are probably still far away from being engulfed into the necessary real large-scale expedition to discover the known biodiversity. With increasing experience, we have learned about the challenges. But we are even more convinced that making it integral the body of global knowledge; readily findable and citable, so that the work in related fields – genomic, citizen-science and museum collection digitization – can be linked, and that they in turn can make use of the existing knowledge, is decisive to ultimately conserve the world’s biodiversity.

We are proud that we have managed to develop ways to highly automatically open-up scientific publications, provide long term, sustainable access to all the data therein, and foster the debate on Open Access.

We are aware of many shortcomings, but they help to look into the future, to find solutions and continually stay engaged into a grand, exciting area of discovery.

All this development would not have been possible without a continued support from the US NSF and DFG (Collaborative Research: Development of New Digital Library Applications), the European Union Framework Program 7 (FP 7: ViBRant, pro-iBiosphere, EU BON), Horizon 2020 (ICEDIG), Zenodo, the University of Massachusetts (Boston) and a very prolific collaboration with our partner Pensoft. Last but not least, the incredible dedication and investment of voluntary work, sympathetic partners and in kind contributions through the last ten years has been, and still is, a main pillar of Plazi’s dynamic to uncover known biodiversity.

In future News Items, specific aspects of our experiences following our vision to “discover the known biodiversity" will follow, as well as other team members views.

  1. Thanks to Neue Zürcher Zeitung I got a copy of the article, and thanks to Zenodo it will be accessible from now on (DOI: 10.5281/zenodo.1198575↩︎