<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" version="2.0">
  <channel>
    <title>DSpace Community: Plazi.org Archive</title>
    <link>http://hdl.handle.net/10199/15386</link>
    <description>Archive of Documents Produced by Plazi.org</description>
    <image>
      <title>The Channel Image</title>
      <url>http://plazi.org:8080/dspace/retrieve/16673</url>
      <link>http://hdl.handle.net/10199/15386</link>
    </image>
    <textInput>
      <title>The Community's search engine</title>
      <description>Search the Channel</description>
      <name>search</name>
      <link>http://plazi.org:8080/dspace/simple-search</link>
    </textInput>
    <item>
      <title>A New Approach towards Bibliographic Reference Identification, Parsing and Inline Citation Matching</title>
      <link>http://hdl.handle.net/10199/19094</link>
      <description>Title: A New Approach towards Bibliographic Reference Identification, Parsing and Inline Citation Matching
&lt;br/&gt;
&lt;br/&gt;Authors: Gupta, Deepank; Morris, Bob; Catapano, Terry; Sautter, Guido
&lt;br/&gt;
&lt;br/&gt;Abstract: A number of algorithms and approaches have been proposed towards the problem of scanning and digitizing research papers. We can classify work done in the past into three major approaches: regular expression based heuristics, learning based algorithm and knowledge based systems. Our findings point to the inadequacy of existing open-source solutions such as Paracite for papers with “micro-citations” in various European Languages. This paper describes the work done as part of the Google Summer of Code 2008 using a combination of regular-expression based heuristics and knowledge-based systems to develop a system which matches inline citations to their corresponding bibliographic references and identifies and extracts metadata from references. The description, implementation and results of our approach have been presented here. Our approach enhances the accuracy and provides better recognition rates.</description>
      <pubDate>Sat, 18 Jul 2009 22:58:59 GMT</pubDate>
    </item>
    <item>
      <title>Creating digital resources from legacy documents – an experience report from the biosystematics domain.</title>
      <link>http://hdl.handle.net/10199/19093</link>
      <description>Title: Creating digital resources from legacy documents – an experience report from the biosystematics domain.
&lt;br/&gt;
&lt;br/&gt;Authors: Sautter, Guido; Agosti, Donat; Böhm, Klemens; Klingenberg, Christiana
&lt;br/&gt;
&lt;br/&gt;Abstract: Digitized legacy document marked up with XML can be used in many ways, e.g., to generate RDF statements about the world described. A prerequisite for doing so is that the document markup is of sufficient quality. Since fully automated markup-generation methods cannot ensure this, manual corrections and cleaning are indispensable. In this paper, we report on our experiences from a digitization and markup project for a large corpus of legacy documents from the biosystematics domain, with a focus on the use of modern tools. The markup created covers both document structure and semantic details. In contrast to previous markup projects reported on in literature, our corpus consists of large publications that comprise many different semantic units, and the documents contain OCR noise and layout artifacts. A core insight is that digitization and automated markup on the one hand and manual cleaning and correction on the other hand should be tightly interleaved, and that tools supporting this integration yield a significant improvement.</description>
      <pubDate>Sat, 30 May 2009 22:58:59 GMT</pubDate>
    </item>
    <item>
      <title>Taxonomic information exchange and copyright: the Plazi approach</title>
      <link>http://hdl.handle.net/10199/19092</link>
      <description>Title: Taxonomic information exchange and copyright: the Plazi approach
&lt;br/&gt;
&lt;br/&gt;Authors: Agosti, Donat; Egloff, Willi
&lt;br/&gt;
&lt;br/&gt;Abstract: Background: A large part of our knowledge on the world's species is recorded in the corpus of&#xD;
biodiversity literature with well over hundred million pages, and is represented in natural history&#xD;
collections estimated at 2 – 3 billion specimens. But this body of knowledge is almost entirely in&#xD;
paper-print form and is not directly accessible through the Internet. For the digitization of this&#xD;
literature, new territories have to be chartered in the fields of technical, legal and social issues that&#xD;
presently impede its advance. The taxonomic literature seems especially destined for such a&#xD;
transformation.&#xD;
Discussion: Plazi was founded as an association with the primary goal of transforming both the&#xD;
printed and, more recently, "born-digital" taxonomic literature into semantically enabled, enhanced&#xD;
documents. This includes the creation of a test body of literature, an XML schema modeling its logic&#xD;
content (TaxonX), the development of a mark-up editor (GoldenGATE) allowing also the&#xD;
enhancement of documents with links to external resources via Life Science Identifiers (LSID), a&#xD;
repository for publications and issuance of bibliographic identifiers, a dedicated server to serve the&#xD;
marked up content (the Plazi Search and Retrieval Server, SRS) and semantic tools to mine&#xD;
information. Plazi's workflow is designed to respect copyright protection and achieves extraction&#xD;
by observing exceptions and limitations existent in international copyright law.&#xD;
Conclusion: The information found in Plazi's databases – taxonomic treatments as well as the&#xD;
metadata of the publications – are in the public domain and can therefore be used for further&#xD;
scientific research without any restriction, whether or not contained in copyrighted publications.</description>
      <pubDate>Sun, 29 Mar 2009 22:58:59 GMT</pubDate>
    </item>
    <item>
      <title>Semantic Web, Scientific Publications and Machine Generated Hypotheses: Will Machines do Part of Our Job in the Future?</title>
      <link>http://hdl.handle.net/10199/19084</link>
      <description>Title: Semantic Web, Scientific Publications and Machine Generated Hypotheses: Will Machines do Part of Our Job in the Future?
&lt;br/&gt;
&lt;br/&gt;Authors: Agosti, Donat
&lt;br/&gt;
&lt;br/&gt;Abstract: An introductory note on the potential of semantic, enhanced publications, what it means, and what is needed to get a respective body and research based on this new tool.</description>
      <pubDate>Tue, 02 Dec 2008 22:58:59 GMT</pubDate>
    </item>
    <item>
      <title>XML: the gateway to state of the art taxonomic communication</title>
      <link>http://hdl.handle.net/10199/19083</link>
      <description>Title: XML: the gateway to state of the art taxonomic communication
&lt;br/&gt;
&lt;br/&gt;Authors: Agosti, Donat; Egloff, Willi
&lt;br/&gt;
&lt;br/&gt;Abstract: The description of the estimated known 1.8M species on planet Earth fills spans well&#xD;
over 250 years in a corpus of over 200 M printed pages, scattered in a tremendous&#xD;
number of libraries around the world. There is one little caveat: it is not really known&#xD;
what's in this corpus. There is no bibliography, nor a list of the taxa described nor is it&#xD;
known where respective copies are housed. Today pdfs begin to complement printed&#xD;
papers naming the well over 20,000 species described every year, but this alone is not an&#xD;
adequate response, given the advancement of the Web, which increasingly allows&#xD;
machines to do the initial data accumulation and analysis.&#xD;
XML (eXtended Markup Language) is one way to produce documents whose logical&#xD;
content can be understood by machines, and represented as text that does not depend&#xD;
on specific software to be opened. It can be enhanced with links to external sources,&#xD;
normalized to conform to domain specific ontologies and can easily be converted to&#xD;
print or pdf versions.&#xD;
XML is flexible. It can be adopted to particular domains by creating a body of defined&#xD;
elements (a schema) that models the logical content, such as those for taxonomic&#xD;
names, treatments, or materials citation. Once a document is converted into XML, the&#xD;
content can be stored in databases, and services can be built to mine the database or&#xD;
the entirety of individual documents, or to search for and extract these documents.&#xD;
Since machines can do this work, potentially the entire corpus of taxonomic literature&#xD;
could turn into a huge database, even supporting XML-based annotations that capture&#xD;
the current state of taxonomic knowledge as it evolved since the original publication&#xD;
Plazi is an organization that developed a first system that includes the entire production&#xD;
from scanning to mark-up, enhanced to expose the logical content of taxonomic&#xD;
publications. Paradoxically, the very existence of such a system, has made it more&#xD;
obvious that mark-up of legacy publications is cumbersome to say the least.&#xD;
The future will be in creating XML scientific documents upfront from respective&#xD;
databases, including all the external links, and in its most advanced these documents&#xD;
will constitute complex digital objects in which all the sources are embedded, and from&#xD;
which the scientific knowledge is more rapidly and easily extracted for analysis,&#xD;
synthesis, and commentary.</description>
      <pubDate>Mon, 01 Dec 2008 22:58:59 GMT</pubDate>
    </item>
    <item>
      <title>Plazi: Providing access to taxonomic literature through semantic enhancements and exposure of treatments. An overview</title>
      <link>http://hdl.handle.net/10199/19079</link>
      <description>Title: Plazi: Providing access to taxonomic literature through semantic enhancements and exposure of treatments. An overview
&lt;br/&gt;
&lt;br/&gt;Authors: Agosti, Donat; Catapano, Terry; Sautter, Gudio</description>
      <pubDate>Tue, 28 Oct 2008 22:58:59 GMT</pubDate>
    </item>
    <item>
      <title>Plazi: Access to taxonomic literature –steps into the future of communicating and sharing taxonomic knowledge</title>
      <link>http://hdl.handle.net/10199/19078</link>
      <description>Title: Plazi: Access to taxonomic literature –steps into the future of communicating and sharing taxonomic knowledge
&lt;br/&gt;
&lt;br/&gt;Authors: Agosti, Donat; Catapano, Terry; Sautter, Guido</description>
      <pubDate>Wed, 29 Oct 2008 22:58:59 GMT</pubDate>
    </item>
    <item>
      <title>Access To Taxonomic Descriptions: Protologs Are Not Protected By Copyright.</title>
      <link>http://hdl.handle.net/10199/19077</link>
      <description>Title: Access To Taxonomic Descriptions: Protologs Are Not Protected By Copyright.
&lt;br/&gt;
&lt;br/&gt;Authors: Agosti, Donat; Egloff, Willi
&lt;br/&gt;
&lt;br/&gt;Abstract: Verbatim descriptions of new taxa (protologs) are an integral part of the formal descriptions of new taxa and requested by the Code. Descriptions are in a very specific standardized language, in a specific standardized form, with the objective of specifying the recognition of a taxon and separating it from others. Descriptions are part of a well established tradition of what characters have to be described and are based on a listing of facts, whose conformity is often reinforced by peer-review. Furthermore, protologs are part of a much larger body of re-descriptions, a body of literature that might be at least 10 times larger and often includes much more detailed, and by nature more recent re-descriptions. Descriptions are not unique and not special in the sense of individuality needed to qualify as work in the legal sense, and thus can not be protected by copyright law, in the sense of the Berne Convention. Copyright legislation is national but is based on the Berne Convention for the Protection of Literary and Artistic Works (1) which defines a minimal standard. This international copyright standard does not require the recognition of descriptions as works; it is therefore not an obstacle to an open access to descriptions of new taxa.</description>
      <pubDate>Sun, 24 Aug 2008 22:58:59 GMT</pubDate>
    </item>
    <item>
      <title>Plazi: Access to Taxonomic Literature</title>
      <link>http://hdl.handle.net/10199/16667</link>
      <description>Title: Plazi: Access to Taxonomic Literature
&lt;br/&gt;
&lt;br/&gt;Authors: Agosti, Donat
&lt;br/&gt;
&lt;br/&gt;Abstract: presentation to be held by Brian Fisher at Global Ant Project launch at Harvard, May 28, 2008</description>
      <pubDate>Tue, 27 May 2008 22:58:59 GMT</pubDate>
    </item>
    <item>
      <title>Plazi.org: A service to provide open access to the content of the published taxonomic literature</title>
      <link>http://hdl.handle.net/10199/16664</link>
      <description>Title: Plazi.org: A service to provide open access to the content of the published taxonomic literature
&lt;br/&gt;
&lt;br/&gt;Authors: Agosti, Donat; Catapano, Terry; Klingenberg, Christiana; Sautter, Guido; Egloff, Willi</description>
      <pubDate>Fri, 16 May 2008 17:36:21 GMT</pubDate>
    </item>
    <item>
      <title>Plazi.org: A service to provide access to the content of the published taxonomic literature</title>
      <link>http://hdl.handle.net/10199/16663</link>
      <description>Title: Plazi.org: A service to provide access to the content of the published taxonomic literature
&lt;br/&gt;
&lt;br/&gt;Authors: Agosti, Donat; Catapano, Terry; Klingenberg, Christiana; Sautter, Guido; Egloff, Willi</description>
      <pubDate>Mon, 12 May 2008 13:17:13 GMT</pubDate>
    </item>
    <item>
      <title>Taxongrab: Extracting taxonomic names from text</title>
      <link>http://hdl.handle.net/10199/15444</link>
      <description>Title: Taxongrab: Extracting taxonomic names from text
&lt;br/&gt;
&lt;br/&gt;Authors: Koning, Drew; Sarkar, Indra Neil; Moritz, Thomas
&lt;br/&gt;
&lt;br/&gt;Abstract: Abstract.––Identification of organism names in biological texts is essential for the management of archival resources to&#xD;
facilitate comparative biological investigation. Because organism nomenclature conforms closely to prescribed rules,&#xD;
automated techniques may be useful for identifying organism names from existing documents, and may also support the&#xD;
completion of comprehensive indices of taxonomic names; such comprehensive lists are not yet available. Using a&#xD;
combination of contextual rules and a language lexicon, we have developed a set of simple computational techniques for&#xD;
extracting taxonomic names from biological text. Our proposed method consistently performs at greater than 96% Precision&#xD;
and 94% Recall, and at a much higher speed than manual extraction techniques. An implementation of the described method&#xD;
is available as a Web based tool written in PHP. Additionally, the PHP source code is available from SourceForge:&#xD;
http://sourceforge.net/projects/taxongrab, and the project website is http://research.amnh.org/informatics/taxlit/apps/.</description>
      <pubDate>Fri, 29 Oct 2004 22:58:59 GMT</pubDate>
    </item>
    <item>
      <title>A combining approach to find all taxon names (FAT) in legacy biosystematics literature.</title>
      <link>http://hdl.handle.net/10199/15443</link>
      <description>Title: A combining approach to find all taxon names (FAT) in legacy biosystematics literature.
&lt;br/&gt;
&lt;br/&gt;Authors: Sautter, Guido; Böhm, Klemens; Agosti, Donat
&lt;br/&gt;
&lt;br/&gt;Abstract: Most of the literature on natural history is hidden in millions of pages stacked up in our&#xD;
libraries. Various initiatives aim now at making these publications digitally accessible and&#xD;
searchable, applying xml-mark up technologies. The unique biological names play a crucial role to&#xD;
link content related to a particular taxon. Thus discovering and marking them up is extremely&#xD;
important. Since their manual extraction and markup is cumbersome and time-intensive, it needs&#xD;
be automated. In this paper, we present computational linguistics techniques and evaluate how&#xD;
they can help to extract taxonomic names automatically. We build on an existing approach for&#xD;
extraction of such names (Koning et al. 2005) and combine it with several other learning&#xD;
techniques. We apply them to the texts sequentially so that each technique can use the results from&#xD;
the preceding ones. In particular, we use structural rules, dynamic lexica with fuzzy lookups, and&#xD;
word-level language recognition. We use legacy documents from different sources and times as&#xD;
test bed for our evaluation. The experimental results for our combining approach (FAT) show&#xD;
greater than 99% precision and recall. They reveal the potential of computational linguistics&#xD;
techniques towards an automated markup of biosystematics publications.</description>
      <pubDate>Sat, 29 Oct 2005 22:58:59 GMT</pubDate>
    </item>
    <item>
      <title>A Quantitative Comparison of XML Schemas for Taxonomic Publications.</title>
      <link>http://hdl.handle.net/10199/15442</link>
      <description>Title: A Quantitative Comparison of XML Schemas for Taxonomic Publications.
&lt;br/&gt;
&lt;br/&gt;Authors: Sautter, Guido; Böhm, Klemens; Agosti, Donat
&lt;br/&gt;
&lt;br/&gt;Abstract: Large numbers of legacy taxonomic publications are currently being digitized to make&#xD;
them online available and ready for full text search. The documents are being marked up with XML for&#xD;
two purposes: To preserve the document structure, and to facilitate access via standard query languages&#xD;
like XQuery. With regard to the second aspect, the choice of an appropriate XML schema is crucial. It&#xD;
affects both query performance and the correctness of query results. Over the last few years, several&#xD;
different XML schemas have been proposed as markup standards for taxonomic publications. In this&#xD;
paper, we report on a thorough evaluation and comparison of these schemas. We have examined if they&#xD;
facilitate formulation and correct processing of queries that are common when it comes to taxonomic&#xD;
literature. We also compare the performance of these queries on documents that are marked up with the&#xD;
different schemas. Finally, we propose extensions to the schemas that enhance correctness of query&#xD;
results.</description>
      <pubDate>Sun, 29 Oct 2006 22:58:59 GMT</pubDate>
    </item>
    <item>
      <title>Why not let the computer save you time by reading the taxonomic paper for you?</title>
      <link>http://hdl.handle.net/10199/15441</link>
      <description>Title: Why not let the computer save you time by reading the taxonomic paper for you?
&lt;br/&gt;
&lt;br/&gt;Authors: Agosti, Donat; Klingenberg, Christiana; Sautter, Guido; Johnson, Norman F.; Stephenson, Christie; Catapano, Terry
&lt;br/&gt;
&lt;br/&gt;Abstract: That computers can read and analyze our systematic publications is rapidly becoming reality. Plazi.org is one of the first integrated system. As a pilot system, its development fosters not only the development of novel tools to read and understand&#xD;
 publications (Name searching algorithms like FAT (SAUTTER et al, 2006), GoldenGATE editor), but shows the enormous amount of work needed to convert a very heterogenous body of published work. Clearly, it could be argued, that part of the time gained of having our publications marked up and available in dedicated servers would be worth a collaboration with the users by spending a fraction of this gained time in the conversion process.&#xD;
More importantly, the expertise and development in the workflow of taxonomists point out, that we need in future to add the mark-up during the preparation of the manuscript, and most likely make the &#xD;
publications as such an integral part of the construction global datasets, such as for character data matrices, ZooBank, specimen databases.</description>
      <pubDate>Wed, 28 Nov 2007 22:58:59 GMT</pubDate>
    </item>
    <item>
      <title>Encyclopedia of life: should species description equal gene sequence?</title>
      <link>http://hdl.handle.net/10199/15440</link>
      <description>Title: Encyclopedia of life: should species description equal gene sequence?
&lt;br/&gt;
&lt;br/&gt;Authors: Agosti, Donat</description>
      <pubDate>Thu, 29 May 2003 22:58:59 GMT</pubDate>
    </item>
    <item>
      <title>The future of taxonomic discovery and information exchange: Access to data and information</title>
      <link>http://hdl.handle.net/10199/15439</link>
      <description>Title: The future of taxonomic discovery and information exchange: Access to data and information
&lt;br/&gt;
&lt;br/&gt;Authors: Agosti, Donat
&lt;br/&gt;
&lt;br/&gt;Abstract: The current, near future and longterm scientific communication (publications) in taxonomy are discussed. plazi.org is used to demonstrate the conversion from legacy publication, and to illustrate existing barriers in technical, social and legal domains. Ongoing work with publishers and NLM to create upfront marked up publications is mentioned.</description>
      <pubDate>Tue, 19 Feb 2008 22:58:59 GMT</pubDate>
    </item>
    <item>
      <title>Plazi.org: Using DSpace as a Repository of Species Descriptions</title>
      <link>http://hdl.handle.net/10199/15388</link>
      <description>Title: Plazi.org: Using DSpace as a Repository of Species Descriptions
&lt;br/&gt;
&lt;br/&gt;Authors: Catapano, Terry; Agosti, Donat; Sautter, Guido
&lt;br/&gt;
&lt;br/&gt;Abstract: The goal of the DSpace installation at plazi.org is to demonstrate how the corpus of texts covering the descriptions of the world's species can be assembled into a digital repository for stable, long term access. In this presentation we will focus on our deployment of DSpace working in combination with a community based text mark up tool (resulting in an XML encoded version of the original scanned or electronically published document) as well as a web service allowing to extraction of individual descriptions from within the body of publications.&#xD;
&#xD;
The published record of biological systematics, including the descriptions of the world's 1,8 million species has some unique characteristics. The scientific naming of species is regulated by Codes and thus the publications are quasi legal documents. Descriptions remain relevant for a very long time, even if they are complemented by more comprehensive ones. Additionally, access to existing descriptions is vital for the understanding not only the 1,8 million known species, but also of the yet to be described 20+ million. Valid treatments for animals, for example, span back to 1758, and include perhaps more than 10M pages, of which almost all are only available in hard copy. Taxonomic treatments are as well highly structured documents and very rich in data. A wealth of important morphological descriptions and data, geographic distribution data, bibliographical references, and more resides latent in the taxonomic literature&#xD;
&#xD;
&#xD;
Items in the repository are made up of several files. A PDF is usually available, but in many (given enough time and resources, all) cases another representation of the publication, encoded in the XML schema TaxonX is provided. The encoding opens up the treatments, exposing the data contained within to extraction, data mining, analysis fo r a variety, and other purposes. Since the mark-up process is a slow and expensive and involves the knowledge of the systematics domain, a community mark up server is added, so that interested parties can not only upload new pdf documents, but download and enhance the documents in discrete well defined steps towards valid taxonx documents. Similarly, other applications can build upon the foundation provided by the DSpace repository, such as a search/retrieval interface oriented towards the needs of the Systematics domain, and integrate into the wider and growing Systematics, Conservation, and Biodiversity cyberinfrastructure.</description>
      <pubDate>Thu, 18 Oct 2007 12:00:00 GMT</pubDate>
    </item>
  </channel>
</rss>

