Optical Character Recognition

For the Optical Character Recognition process (OCR) we use the latest verion of the ABBYY FineReader.
Take a look here for some hints using ABBYY and user pattern we created.

http://plazi.ch/ocr/madagascar_ocr.html

general recomendations

MS Windows, ABBYY FineReader Version 8.0 or higher
General recommendations

a) Recognition Languages:

As ant taxonomy papers are mainly written in some “default” languages, we recommend to select multiple languages for each text. This prevents from missing a language in the cited bibliography at the end of recent texts and one avoids rechoosing the language for each single pdf document.

For ant taxonomy paper we recommend to choose the following languages:

English - French - German - Spanish - Italian - Latin

Additionally "Anty_Species" and "Anty_Glossary", two specific ant dictionaries, containing almost all known ant species and technical terms for morphological descriptions, or lexicons_spider_taxa.txt respectively .

b) User Pattern

For text recognition ABBYY is able to create user pattern. Such user pattern become important, when old documents are scanned and the type recognition by ABBYY fails. At this moment we are creating such training files for a number of taxonomic journals. Some of these user pattern you can download at the end of the page. For using such files, download the pattern for a specific journal. For working with it during the OCR process, first you have to open your pdf document and save the batch. The downloaded user pattern you have to save within the created batch (via MS Explorer). After that the user patter will appear in the user pattern window and you have to activate the user pattern of your choice.

c) Saving Options

ABBYY allows saving the ocr-ed documents in different file types. For a later XML mark up with GoldenGATE we recommend to save all pages in html format:

- remove all formatting

- simple (compatible with all browsers)

- keep line breaks

- use solid line as page break (keep page breaks)

- do not save with images

Further we recommend keeping the batches saved, because erroneously numbered blocks or other errors during ocr process can only be corrected with the batch.
Workflow:

1. Start ABBYY and open document. Either use pdf or tiff (or any other format). We use black and white pfds at 300dpi.

2. Select the respective language used in the document (read button, options), and make sure you have also the languages chosen which appear in the bibliographic citations. (see also recommendation A)
3. Read all the pages.

4. Check out the boxes drawn around the blocks of texts in the “image” window, and correct them if necessary. Make sure, that the blocks are set to text, picture or table. This can be changed by moving with the mouse over the edges of the block, press the right mouse button and select “change block type” to “picture” or “text”. Blocks can be complex polygons which can be manually created using the “adds rectangular part to a block” tool in the “Image” window. Make sure, that figure, table and other headers are not part of the text but separate text boxes, otherwise the text blocks can easily be mixed up.

Be sure, that pages with two text columns are marked with two different blocks numbered in a logical order. ABBYY recognizes the text in lines and is not able to differentiate between two text columns.

The correct order of the text blocks is essential. When saving the html file ABBYY saves the text in the appearance order of the blocks.

Re-read those pages where the boxes had to be corrected.

During the reading process you can choose if you want to create / improve a train use pattern. To do that, go to “Read”, “Options”, “Training”, “Train user pattern”. Close the menu and re-read the block. In the case of male, female and other symbols, it is recommended to insert not the respective symbols ASCI code, but the written out words in the respective language, which will help to extract content and mark up the text. We recommend [[worker]], [[queen]], [[male]], [[soldier]] and [[…]] for not recognizable text / symbols.

5. Check Spelling.

At this point, all the ambiguous characters are marked in blue (or the color you choose in the option menu). To spell check, select in the options menu to only “stop at words with uncertain characters”. Better would be to stop at words not found in dictionary, though unless all the taxonomic names are entered, this needs a lot of manual work to updated the list. Similarily, unless a solution is found to build a custom dictionary for all the morphological and geographic terms, it might be better to stop only at uncertain characters.

One problem are the minor case l versus 1 (one). The best would be to assure, to replace all the l with minor case l.

7. Save the batch.

8. Save results as html.

The following settings are recommended: Save as “HTML Document” and “Create a single file for all pages”. Choose “Formats settings” Retain layout: “Remove all formatting” Save mode: “Simple (compatible with old browsers)” Text settings: “keep line breaks”, “use solid line as page breaks” Picture settings: do not select Character settings: “(Automatic” Code page type: ”Windows”) (see also recommendation C)
Download the User Pattern:

As explained above, you can train your FineReader, so the character recognition of all pages or documents of the same type pattern is high. Contructing such user pattern is realtively time consuming, so that we here make availabel all the user pattern for scientific journals containig informations on Ants of Madagascar (and some other journals). These user pattern could be improved and completed by anyone but may help for initial ocr.

For use, download the user pattern and save it in the same file of you batch (you have to create your batch first by opening the pdf and saving the batch in a file). Than you reopen your batch and choose the user pattern you want to use.