The whole literature processing requires several independent steps. Each of the process parts requires special software packages, most of it is freeware.
First step: from the pdf to the html document
To process the pdf special optical character recognition (OCR) software is required. We choose the ABBYY FineReader Version 8 or later. This is commercial software and very good documented. When saving the OCR result, keep in mind that we need a HTML document. For saving options choose saving only the text and compatibility with all browsers. Keep the line breaks and insert a line for page breaks. Discard the images.
For more information on the OCR process see also antbase.org.
Second step: from the html to the xml document
This most essential step in the whole workflow requires the GoldenGATE Document Editor, which was developed by Guido Sautter and is freeware. The editor is based in Java, so you should download and install the latest version of the Java Runtime Environment (www.java.com), also freeware. For markup instructions see the GoldenGATE manual at Plazi.org. To open GoldenGate, download the program, install it, and start the program through GoldenGateStarter.jar.
Third step: validating the XML document against the TaxonX Schema
Once the XML document was created and saved in the XSLT transformed, you should validate it against the TaxonX XML Schema. A free validation software is the XMLHammer, download at www.xmlhammer.org. The TaxonX Schema is available at Plazi.org.
Fourth step: upload the valid XML document on the Search & retrieval Server (SRS)
The upload on the SRS happened via GoldenGATE (option "Upload Document to SRS" at the File menu). Make sure that you upload only valid taxonX (tx) document, minimum markup level is tx1. For account creation contact Guido Sautter (firstname.lastname@example.org) or use this form and you will receive the log-in data.