URL: https://github.com/gsautter/idaho-imagemarkup

License: BSD derivative

Dependencies (direct):

-          icepdf-core.jar (by ICEsoft Technologies Canada, Corp.)

-          ImageMagick (http://imagemagick.org/script/index.php)

-          TesseractOCR (wrapped, github.com/tessunleashed/tesseract-ocr-unleashed)

-          idaho-core

-          idaho-extensions

Builds:

-          ImageMarkup.jar (Image Markup data model; GAMTA wrapper for Image Markup data model, facilitating application of all markup generation tools existing for the latter; IO facilities for Image Markup documents; Java Swing widgets for displaying Image Markup documents, as well as individual pages; various utility libraries and widgets)

-          ImageMarkupOCR.jar (OCR engine for Image Markup documents, adding words to pages based on OCRing page images; actual OCR happens in TesseractOCR (written in C), Java integration and communication via process IO streams)

-          ImageMarkupOCR.bin.jar (C binaries of TesseractOCR, compiled for Windows, Linux, and MacOS, plus language data used by TesseractOCR; separate JAR in order to reduce download volume on differential updates)

-          ImageMarkupPDF.jar (facilities converting PDF documents into Image Markup data model, including OCR for scanned PDFs, and decoding of embedded glyph based fonts for born-digital ones, as well as extraction of figures embedded as images; uses ImageMagick via command line interface to convert images embedded in PDFs in arbitrary formats to PNG for further processing)

-          ImageMarkupPDF.bin.jar (C binaries of ImageMagick, compiled for Windows, Linux, and MacOS; separate JAR in order to reduce download volume on differential updates)