TreatmentBank API

What is a treatment?

The Plazi TreatmentBank [1] deals with scientific, published, biosystematic literature. It is the literature documenting and describing all the world’s ca 1.9 Million known species in an estimated corpus of over 500 Million published pages. The cited publications in Plazi are all available at the Biodiversity Literature Repository [2] at Zenodo/CERN.

Treatments are well defined parts of articles that define the particular usage of a scientific name by an author at a given time (the publication) [3]. With other words, each scientific name has one to several treatments, depending whether there exists only an original description of a species, or whether there are subsequent re-descriptions. Similar to bibliographic references, treatments can be cited, and subsequent usages of names cite earlier treatments.

Treatments are a synthesis of the knowledge of a given species at a given time. They can be very rich in data, explicitly or implicitly, detailed or summarized, and include many references to external data sources, such as scientific names, collection codes, DNA-codes.

The data can be semantically enhanced, and linked. Treatments as parts of publication need be extracted. Most recently, treatments are tagged in electronic publications with the National Library of Medicine’s Journal Article Tag Suites (JATS) TaxPub extension [3]. This allows automatic extraction. Still the majority of the ca. 2000 journals and books publishing treatments use the PDF format at best. Plazi has tools to extract treatments, enhance the embedded data and import it into its SRS- Treatment Search Portal for public online access.

The data, that is, treatments and observation data, can be viewed as HTML, XML, RDF, or can be harvested with the protocols provided below. The data is provided for harvesting as Darwin Core-Archives.

What is a DarwinCore Archive?

The Darwin Core Archive format is a simple and extensible schema for sharing biodiversity data, especially catalogue data based on the ratified Darwin Core terms and the Darwin Core text guidelines [4]. Darwin Core is a standard for describing sample data in the Biodiversity Informatics community. It has been developed by the Global Biodiversity Information Facility (GBIF).. DarwinCore Archives use a table-based, “spreadsheet-style” format that is more comfortable and familiar to biologists. It uses plain text-files but it is tied to processes that support consistency and stability.

Fig. Schematic representation of a Darwin Core Archive and its components [4]

The GBIF GNA format consists of a set of files where one (or more) files represents the ‘core’ taxonomic data where a single row represents a single taxon reference. The DarwinCore Taxon class provides the majority of concepts supported in the format that enable taxonomic and nomenclatural semantics and syntax (classification, taxonomic and nomenclatural synonymy, status, etc.) to be expressed.

Other files represent “extensions” to this core table and allow additional data elements to be linked to a taxon in the core table with a many to one relationship. The overall topology of one or more of these extensions to the core table is referred to as a “star schema” and provides a compromise between an overly simple flat-file representation of data and more complex multi-related files. In addition to these files, an additional descriptor file named “meta.xml” serves as a key to the other files. Collectively, these files can be further zipped into a single compressed archive file for portability. This compressed file is known as a Darwin Core Archive (DwCA) file [4].

The Darwin Core Archive used by Plazi

There is one archive per article stored in Plazi, containing the data from all the treatments in the article. Archives contain nine files:

meta.xml: description of columns in data files eml.xml: archive meta data, i.e., bibliographic citation of article, etc. taxa.txt: the archive core file, containing one row per taxon in the nomenclature section of a treatment, thus one or multiple rows per treatment, with any after the first for each treatment handling synonymizations. occurrences.txt: occurrence data, containing one row per materials citation, with an ID reference to taxa.txt description.txt: description data, containing one row per descriptive treatment section, with an ID reference to taxa.txt distribution.txt: general distribution data, one row per distribution statement, with an ID reference to taxa.txt media.txt: full text treatments with HTML markup with additional meta data like a bibliographic citation, one row per treatment, with an ID reference to taxa.txt references.txt: bibliographic references to individual treatments, one row per treatment, with an ID reference to taxa.txt vernaculars.txt: vernacular names of treatment taxa, currently empty, as we do not have or mark this kind of data

For a detailed description of the content of each file see Appendix: Darwin Core Archive Content

Treatment Data representation in Plazi

The treatment data is stored in the Treatment Search Portal in native, generic XML included in tagged original publications. The tagged elements are (a) additionally stored in dedicated index structures to support search and (b) extracted and exported in several formats, including DwCA.

A treatment document includes two main elements, the header including the metadata based on the Metadata Object Description Schema (MODS) and the body.

tax:taxonx tax:taxonxHeader tax:taxonxBody The data XML can be converted via XSLT into HTML, TaxonX XML (a schema developed to model biosystematics legacy literature), and RDF and HTML

HTML: http://treatment.plazi.org/id/31F96F41-E3E0-02BD-8898-5A4F3A20E45A (this is also the persistent httpURI used as identifier for treatments)

Plain XML: http://tb.plazi.org/GgServer/xslt/31F96F41E3E002BD88985A4F3A20E45A

TaxonX XML: http://tb.plazi.org/GgServer/taxonx/31F96F41E3E002BD88985A4F3A20E45A

RDF: http://tb.plazi.org/GgServer/rdf/31F96F41E3E002BD88985A4F3A20E45A or http://treatment.plazi.org/id/31F96F41-E3E0-02BD-8898-5A4F3A20E45A.rdf

The terms used in TaxonX and RDF are either imported from existing schemas (such as Darwin Core for observation records, MODS for bibliographic data) or are, if not available, defined in schemas (TaxonX) or ontologies (RDF: in development)

Plazi API

Treatment data is open access and can be accessed via HTTP GET as described in detail below. The treatment data is provided in HTML, various XML flavors, and RDF.

Obtaining a list of all the treatments available from Plazi

HTTP GET http://tb.plazi.org/GgServer/xml.rss.xml
Response (RSS, in Atom XML, encoded in UTF-8)

Entries of interest

  • channel/item/link: the link to the XML treatment
  • channel/item/title: the taxon name and authority

Accessing a particular DwC-Archive

HTTP GET http://tb.plazi.org/GgServer/dwca/.zip
Replace with any UUID from the GBIF-provided listing (see below). It is also possible to directly use the endpoint URL from that listing list.

Example:

http://tb.plazi.org/GgServer/dwca/23A1465DDF212F7DA589F41341B83FCC.zip Response (ZIP Archive, containing XML and tab separated TXT files, all encoded in UTF-8)

Entries of interest:

  • eml.xml: an XML file containing the meta data of the publication, in MODS format
  • taxa.txt: a tab separated TXT file listing the taxa and treatments the DwC-Archive contains, plus higher taxonomy; the Identifier column takes the form - .taxon, and the treatment UUID can be used to access the treatment on the Plazi servers (see below)
  • occurrences.txt: a tab separated TXT file containing occurrence data; the TaxonID column references the Identifier column in taxa.txt, the data column - headers are DwC terms
  • media.txt: a tab separated TXT file containing HTML versions of the treatments; the TaxonID column references the Identifier column in taxa.txt, the HTML - treatments are located in the Description column
  • references.txt: A detailed description of contents can be found here http://github.com/plazi/Plazi-Communications/wiki/GBIF#darwin-core-archive

Accessing a particular treatment on the Plazi servers

HTTP GET tb.plazi.org/GgServer/html/<treatmentUUID>
Replace with the actual treatment UUID from the taxa.txt file found in DwC-Archives

Example:

http://tb.plazi.org/GgServer/html/8C4CE845A6DEE6FDFD1600A70D5BC71B Response (HTML, encoded in UTF-8): a web page displaying the treatment

HTTP GET http://tb.plazi.org/GgServer/xml/<treatmentUUID>
Replace with the actual treatment UUID from the taxa.txt file found in DwC-Archives

Example:

http://tb.plazi.org/GgServer/xml/8C4CE845A6DEE6FDFD1600A70D5BC71B Response (XML, encoded in UTF-8): the raw, generic XML version of the treatment, which all other representations are generated from

HTTP GET http://tb.plazi.org/GgServer/taxonx/<treatmentUUID>
Replace with the actual treatment UUID from the taxa.txt file found in DwC-Archives

Example:

http://tb.plazi.org/GgServer/taxonx/8C4CE845A6DEE6FDFD1600A70D5BC71B Response (XML, encoded in UTF-8): a TaxonX XML version of the treatment

List of Plazi’s available DwC-Archives from GBIF API

GBIF is a regular harvester of Plazi data and can be used as an alternative site.

HTTP GET https://api.gbif.org/v1/organization/7ce8aef0-9e92-11dc-8738-b8a03c50a862/publishedDataset;
Replace <20k> with any multiple of 20 (including 0) to page through the list. It is also possible to use a limit other than 20, with the offset then being a multiple of that other limit.

Example (first 20 datasets):

http://api.gbif.org/v1/organization/7ce8aef0-9e92-11dc-8738-b8a03c50a862/publishedDataset?limit=20&offset=0

Response (JSON)

{
    "offset": 0,
    "limit": 1, 
    "endOfRecords": false, 
    "count": 1129, 
    "results": [
        { 
            "key": "3e8b196b-c482-47f1-9574-772141310c40", 
            "installationKey": "7ce8aef1-9e92-11dc-8740-b8a03c50a999", 
            "publishingOrganizationKey": "7ce8aef0-9e92-11dc-8738-b8a03c50a862", 
            "external": false, "numConstituents": 0, 
            "type": "CHECKLIST", 
            "title": "Revision of the ant genus Myrmoteras in the Malay Archipelago (Hymenoptera, Formicidae).", 
            "description": "UNAVAILABLE", 
            "language": "eng", 
            "homepage": "http://tb.plazi.org/GgServer/summary/23A1465DDF212F7DA589F41341B83FCC", 
            "citation": { 
                "text": "Plazi.org taxonomic treatments database: Revision of the ant genus Myrmoteras in the Malay Archipelago (Hymenoptera, Formicidae)." 
            }, 
            "rights": "No known copyright restrictions apply. See Agosti, D., Egloff, W., 2009. Taxonomic information exchange and copyright: the Plazi approach. BMC Research Notes 2009, 2:53 for further explanation.", 
            "lockedForAutoUpdate": false, 
            "createdBy": "plazi", 
            "modifiedBy": "crawler.gbif.org", 
            "created": "2014-06-28T12:55:54.089+0000", 
            "modified": "2014-11-25T13:29:20.716+0000", 
            "contacts": [...], 
            "endpoints": [{ "key": 45389, 
            "type": "DWC_ARCHIVE", 
            "url": "http://plazi.cs.umb.edu/GgServer/dwca/23A1465DDF212F7DA589F41341B83FCC.zip", 
            "createdBy": "plazi", 
            "modifiedBy": "plazi", 
            "created": "2014-06-28T12:55:54.604+0000", 
            "modified": "2014-06-28T12:55:54.604+0000", 
            "machineTags": [] }], 
            "machineTags": [...], "tags": [], 
            "identifiers": [{ "key": 23594, 
            "type": "UUID", 
            "identifier": "23A1465DDF212F7DA589F41341B83FCC", 
            "createdBy": "plazi", 
            "created": "2014-06-28T12:55:54.334+0000" }], 
            "comments": [], 
            "bibliographicCitations": [], 
            "curatorialUnits": [], 
            "taxonomicCoverages": [], 
            "geographicCoverages": [], 
            "temporalCoverages": [], 
            "keywordCollections": [], 
            "countryCoverage": [], 
            "collections": [], 
            "dataDescriptions": [] 
        }
    ] 
} 

Entries of interest:

endOfRecords: if false, increasing offset will return further datasets
count: total number of available Plazi datasets
results.endpoints.url: the URL of the DwC-Archive containing the data on
results.identifiers.identifier: the UUID of the dataset
results.homepage: the URL of an HTML page listing the taxonomic treatments whose data is contained in the DwC-Archive

References

  1. Plazi http://plazi.org
  2. Biodiversity Literature Repository. https://zenodo.org/collection/user-biosyslit
  3. Catapano T. 2010. TaxPub: An Extension of the NLM/NCBI Journal Publishing DTD for Taxonomic Descriptions. Proceedings of the Journal Article Tag 1. ite Conference 2010 (pdf)
  4. Darwin Core Archive

Appendix: Darwin Core Archive Content

taxa.txt

http://rs.tdwg.org/dwc/terms/taxonID: treatment UUID + .taxon for taxon, treatment ID + .syn for new junior synonyms
http://rs.tdwg.org/dwc/terms/namePublishedIn: reference string of original description
http://rs.tdwg.org/dwc/terms/acceptedNameUsageID: blank, except for new junior synonyms
http://rs.tdwg.org/dwc/terms/parentNameUsageID: blank
http://rs.tdwg.org/dwc/terms/originalNameUsageID: blank
http://rs.tdwg.org/dwc/terms/kingdom: taxon@kingdom
http://rs.tdwg.org/dwc/terms/phylum: taxon@phylum
http://rs.tdwg.org/dwc/terms/class: taxon@class
http://rs.tdwg.org/dwc/terms/order: taxon@order
http://rs.tdwg.org/dwc/terms/family: taxon@family
http://rs.tdwg.org/dwc/terms/genus: taxon@genus
http://rs.tdwg.org/dwc/terms/taxonRank: taxon@rank
http://rs.tdwg.org/dwc/terms/scientificName: taxon name
http://rs.tdwg.org/dwc/terms/taxonomicStatus: blank except for new junior synonyms, where "synonym", "homotypicSynonym" if we have a syntype
http://rs.tdwg.org/dwc/terms/nomenclaturalStatus: blank
http://purl.org/dc/terms/references: HTTP URI of treatment

occurrences.txt

http://rs.tdwg.org/dwc/terms/occurrenceID: treatment UUID + ".mc." + materials citation ID
http://rs.tdwg.org/dwc/terms/taxonID: treatment UUID + ".taxon", referencing taxa.txt
http://rs.tdwg.org/dwc/terms/catalogNumber: mc@specimenCode (explode to one record per specimen code if possible)
http://rs.tdwg.org/dwc/terms/collectionCode: mc@collectionCode (explode to one record per collection code if possible)
http://rs.tdwg.org/dwc/terms/institutionCode: blank
http://rs.tdwg.org/dwc/terms/typeStatus: mc@typeStatus (blank if none given)
http://rs.gbif.org/terms/1.0/verbatimLabel: mc text
http://rs.tdwg.org/dwc/terms/sex: mc@sex (also other specimen types like "queen", "worker", etc.)
http://rs.tdwg.org/dwc/terms/individualCount: mc@specimenCount (explode things like "5 workers, 2 females" to one record per typified specimen count if possible)
http://rs.tdwg.org/dwc/terms/eventDate: mc@collectingDate
http://rs.tdwg.org/dwc/terms/recordedBy: mc@collectorName
http://rs.tdwg.org/dwc/terms/recordNumber: blank
http://rs.tdwg.org/dwc/terms/decimalLatitude: mc@latitude
http://rs.tdwg.org/dwc/terms/decimalLongitude: mc@longitude
http://rs.tdwg.org/dwc/terms/minimumElevationInMeters: mc@elevation, or mc@elevationMin if given
http://rs.tdwg.org/dwc/terms/maximumElevationInMeters: mc@elevationMax if given
http://rs.tdwg.org/dwc/terms/country: mc@collectingCountry
http://rs.tdwg.org/dwc/terms/stateProvince: mc@stateProvince or mc@collectingRegion
http://rs.tdwg.org/dwc/terms/municipality: mc@collectingMunicipality
http://rs.tdwg.org/dwc/terms/locality: mc@location
http://purl.org/dc/terms/references: HTTP URI of treatment

description.txt

http://rs.tdwg.org/dwc/terms/taxonID: treatment UUID + ".taxon", referencing taxa.txt
http://purl.org/dc/terms/type: subSubSection@type
http://purl.org/dc/terms/description: subSubSection text
http://purl.org/dc/terms/language: blank (except if we have language detection (might be reusable from spell checker))
http://purl.org/dc/terms/source: article citation

distribution.txt

http://rs.tdwg.org/dwc/terms/locationID: treatment UUID + "." + location UUID
http://rs.tdwg.org/dwc/terms/taxonID: treatment UUID + .taxon, referencing taxa.txt
http://rs.tdwg.org/dwc/terms/country: mc@collectinCountry
http://rs.tdwg.org/dwc/terms/locality: mc@location
http://rs.tdwg.org/dwc/terms/occurrenceStatus: mc@typeStatus

media.txt

http://purl.org/dc/terms/identifier: treatment UUID + .text
http://rs.tdwg.org/dwc/terms/taxonID: treatment UUID + .taxon, referencing taxa.txt
http://purl.org/dc/terms/type: purl.org/dc/dcmitype/Text
http://iptc.org/std/Iptc4xmpExt/1.0/xmlns/CVterm: "http://rs.tdwg.org/ontology/voc/SPMInfoItems#GeneralDescription"
http://purl.org/dc/terms/format: text/html
http://purl.org/dc/terms/title: taxon + author + year
http://purl.org/dc/terms/description: treatment HTML
http://rs.tdwg.org/dwc/terms/additionalInformationURL: treatment HTTP URI
http://ns.adobe.com/xap/1.0/rights/UsageTerms: Public Domain
http://purl.org/dc/terms/rights: No known copyright restrictions apply. See Agosti, D., Egloff, W., 2009. Taxonomic information exchange and copyright: the Plazi approach. BMC Research Notes 2009, 2:53 for further explanation.
http://ns.adobe.com/xap/1.0/rights/Owner: blank
http://purl.org/dc/terms/contributor: ((Pensoft|Zootaxa) via )?Plazi
http://purl.org/dc/terms/creator: author list, semicolon separated
http://purl.org/dc/terms/bibliographicCitation: bibliographic reference string

references.txt

http://purl.org/dc/terms/identifier: treatment UUID + .ref for article (treatment) reference, cited treatment ID (from treatmentCitation@httpUri) + .ref for original description reference
http://rs.tdwg.org/dwc/terms/taxonID: treatment ID + .taxon, referencing taxa.txt
http://eol.org/schema/reference/publicationType: bibRef@type
http://eol.org/schema/reference/full_reference: reference text
http://eol.org/schema/reference/primaryTitle: bibRef@title
http://purl.org/dc/terms/title: bibRef@journal or bibRef@volumeTitle
http://purl.org/ontology/bibo/pages: blank
http://purl.org/ontology/bibo/pageStart: treatment first page
http://purl.org/ontology/bibo/pageEnd: treatment last page
http://purl.org/ontology/bibo/journal: bibRef@journal
http://purl.org/ontology/bibo/volume: bibRef@part
http://purl.org/dc/terms/publisher: bibRef@publisher
http://purl.org/ontology/bibo/authorList: bibRef@author, semicolon separated
http://purl.org/ontology/bibo/editorList: bibRef@editor, semicolon separated
http://purl.org/dc/terms/created: bibRef@year
http://purl.org/dc/terms/language: blank
http://purl.org/ontology/bibo/uri: bibRef@URL, if available
http://purl.org/ontology/bibo/doi: bibRef@DOI, if available

vernaculars.txt

http://rs.tdwg.org/dwc/terms/taxonID: treatment UUID + .taxon, referencing taxa.txt
http://purl.org/dc/terms/language: en
http://rs.tdwg.org/dwc/terms/vernacularName: vernacular name

Notes

  • Plazi background documents
  • Download the description as PDF
  • Support and Questions: Please contact our support with any questions
  • Version: 20150223