MathML/DocGen Update = OCRmania!
- July 10th, 2012
Greetings from #rdcHQ! This week, the MathML/DocGen Track is emerging from a full week of research on the state-of-the-art of open source Optical Character Recognition (OCR) technology. We explored three OCR packages, Tesseract (originally developed by HP Labs and now an open source Google code project), ABBYY FineReader, and OmniPage – the latter two being commercial packages with some components powered by open source technology but with superior support for what is known as “page zoning”.
Ideally, we want to stay in a 100% open source environment (although we uncovered some very useful functionality in the commercial packages – more on this later when we cover Interactive PDFs). Unfortunately, Tesseract has some serious limitations with its page layout recognition which causes problems when rendering tables and we have quite a lot of content objects of this type. In simplest terms, OCR software can be customized to recognize object types that appear on a page, and it can also be trained to improve its recognition of individual characters in a particular document set (but this is outside of the scope of our research at this time). An example of some of the different objects on any given page is below. At the moment, we are focused on the data within the <table> object – i.e. headers, cells and rows:
In a perfect world, an OCR program with proper train data and zone files would readily recognize these distinct objects on a page and label them accordingly for future post-processing with a markup language such as HTML, XHTML or XML -
Fortunately, our research did uncover that there is a lot of hacking in the Tesseract space (a.k.a. “hacking tesseract“) and hooks exist for dealing with exactly the problem we are trying to solve. We are taking a closer look at the newly released OCRopus 0.5 to see if it resolves some of the issues we have with table generation.
In other MathML/DocGen news, we have authored 113 MathML equations and Jasper wrote a Ruby script that enables bulk conversion of our CCR .tif files within Tesseract which is currently undergoing testing. More news as it develops … Stay tuned!