Our DocGen Track continues …
with explorations in the PDF space. We have been experimenting with a couple of programs that are bridging the gap where Tesseract falls short. The first of these two programs is the newly released Nuance PDF Converter for Mac 3.0 which does an excellent job deciphering even the poorest quality of scans and provides multilingual support. Although this is a commercial product, it is certainly worth the money. It is a standalone module built on OmniPage, the industry leader in the OCR space for years. It does two things really well: 1) provide accurate optical character recognition (as we mentioned before), and 2) create first-generation interactive PDFs. These pros far outweigh the cons such as the software only being available on the Mac platform (Jasper is already researching ways to get the program to run on Linux using an emulator).
Interactive PDFs
In short, interactive PDFs provide the online equivalent of paper forms that one can fill out on the computer and send electronically. These PDFs add a layer on top of an Image PDF that can include internal hyperlinks to other sections of the document, external links to the World Wide Web or an object on one’s computer, and form elements that can be filled out, digitally signed, and sent to a designated email address. Pretty neat!
Hybrid PDFs
Another interesting program that we are working with is LibreOffice, particularly with its “Hybrid PDF” functionality which delivers on the true promise of a portable document format. A “Hybrid PDF” is a PDF that anyone can view, but which includes the source document embedded within it so that people with modern office suites can also edit it if they want to. You can read more about Hybrid PDFs in this PDF about creating Hybrid PDFs (See also: “The Magic of Editable PDFs” from the author of that document).
So, we definitely have a lot of interesting tools to work with to complete our project – and although we were unable to use Tesseract to produce all of our document objects, there is a lot of interesting work being done in that space. Most notably, newly hatched open source efforts with minimal documentation which we are going to keep a watch on in the months ahead. Our solution for the CCR will be a hybrid model – and despite our open source ethos, we believe in a multimodal approach to solving a problem and using the best tool for the project at hand @ #rdcHQ.
Onward!