MathML/DocGen Update = OCRmania!

Greetings from #rdcHQ! This week, the MathML/DocGen Track is emerging from a full week of research on the state-of-the-art of open source Optical Character Recognition (OCR) technology. We explored three OCR packages, Tesseract (originally developed by HP Labs and now an open source Google code project), ABBYY FineReader, and OmniPage – the latter two being commercial packages with some components powered by open source technology but with superior support for what is known as “page zoning”.

Ideally, we want to stay in a 100% open source environment (although we uncovered some very useful functionality in the commercial packages – more on this later when we cover Interactive PDFs). Unfortunately, Tesseract has some serious limitations with its page layout recognition which causes problems when rendering tables and we have quite a lot of content objects of this type. In simplest terms, OCR software can be customized to recognize object types that appear on a page, and it can also be trained to improve its recognition of individual characters in a particular document set (but this is outside of the scope of our research at this time). An example of some of the different objects on any given page is below. At the moment, we are focused on the data within the <table> object – i.e. headers, cells and rows:

In a perfect world, an OCR program with proper train data and zone files would readily recognize these distinct objects on a page and label them accordingly for future post-processing with a markup language such as HTML, XHTML or XML -

Fortunately, our research did uncover that there is a lot of hacking in the Tesseract space (a.k.a. “hacking tesseract“) and hooks exist for dealing with exactly the problem we are trying to solve. We are taking a closer look at the newly released OCRopus 0.5 to see if it resolves some of the issues we have with table generation.

In other MathML/DocGen news, we have authored 113 MathML equations and Jasper wrote a Ruby script that enables bulk conversion of our CCR .tif files within Tesseract which is currently undergoing testing. More news as it develops … Stay tuned!

Working with Text and Graphic Libraries

Last week in the SVG Track, we began working with more advanced typographical diagrams such as the example below, “Don’t Be The Fall Guy” (an infographical accompaniment to our work on stepladders for a prior document set) –


This week, we are defining new graphical swatches for our library of patterns and beginning to bring dimension into the diagrams using blends, which are a characteristic of many of the illustrations in the CCR.

We now have 214 SVGs in our CCR repository, so we are beginning to pick up steam. Stay tuned!

Bonus deBUG Day + SVG Update

We are having too much fun to take a break this week, so the Cobwebs crew animators had a special meetup to organize the art and break down the script scene-by-scene. We had a screening of the “L’il Red” Walkthrough by professional animator and Friend of the Rural Design Collective Brian Main and discussed the clever animation and design techniques used in this interactive storybook for the iPad and iPhone, currently in progress.

We focused on background techniques, such as panning and layering, and how simple things like using a single color can help create a unique visual style. We also took a look at how elements can be cloned and reused in an illustration and modified slightly to create charming, simple animation effects. We also viewed animation cells from cartoon classics to explore a variety of styles. Our goal for next week is to have at least one background composed for every scene.

SVG Track / Diagrams I

In the SVG Track this week, work commenced on the first set of diagrams which introduce simple 2-dimensional diagrams with typographical elements. These diagrams use techniques stressed in earlier lessons, and begin to show how the various techniques can be combined to create technical illustrations.

Announcing #rdcHQ deBUG Merchandise!

We are proud to announce the launch of our new store featuring incredibly cool designs by the Rural Design Collective <-- (#rdcHQ)! Our initial offering features eight fresh designs by Levi Thompson — in both black and white — which were created for our “Life on a Redwood Post” project. The store is currently in <beta> and we are still fixing a few bugs, but we are fully operational and you can order your own deBUG Tees (and other products) online today!

Shown above is one of our favorite configurations – an eggplant American Apparel T-shirt featuring the Cicadia. We chose Spreadshirt so we could showcase the excellent designs by the Rural Design Collective in their interactive Spreadshirt Designer which is so well done, and we understand they will have more features in the future (we have an ask in for the ability to pare down the product offerings in the Designer). Very soon we will have more insects available, plus there are rumors that a “Cobwebs” shirt is on the horizon. A portion of the sales of these shirts will continue to fuel the good work of the Rural Design Collective and our projects – so please purchase a piece of wearable art (and they are indeed beautiful) … plus help spread the word about our new shop!

MathML / DocGen Update

This week in the MathML Track - We checked our completed set of Simple MathML equations and Jasper coded 40 Advanced MathML equations to put us ahead of schedule. We also began a systematic breakdown of the various styles for the content objects in the DocGen track which we will be taking a closer look at next week. We are using Tesseract for our Optical Character Recognition (OCR) and it works wonderfully with basic text and tables but has difficulty distinguishing vertical lines in more advanced tables from the text within the cells. We will be doing extensive research as homework on the matter, and it is possible that we may need to upgrade from our current stable of open source tools. More news as it develops …

Stay tuned!

Return top