Archive for the ‘MathML’ Category

DocGen / Gone Hybrid

Our DocGen Track continues …
with explorations in the PDF space. We have been experimenting with a couple of programs that are bridging the gap where Tesseract falls short. The first of these two programs is the newly released Nuance PDF Converter for Mac 3.0 which does an excellent job deciphering even the poorest quality of scans and provides multilingual support. Although this is a commercial product, it is certainly worth the money. It is a standalone module built on OmniPage, the industry leader in the OCR space for years. It does two things really well: 1) provide accurate optical character recognition (as we mentioned before), and 2) create first-generation interactive PDFs. These pros far outweigh the cons such as the software only being available on the Mac platform (Jasper is already researching ways to get the program to run on Linux using an emulator).

Interactive PDFs

In short, interactive PDFs provide the online equivalent of paper forms that one can fill out on the computer and send electronically. These PDFs add a layer on top of an Image PDF that can include internal hyperlinks to other sections of the document, external links to the World Wide Web or an object on one’s computer, and form elements that can be filled out, digitally signed, and sent to a designated email address. Pretty neat!

Testing the tab order on an Interactive PDF (double-click to play the movie)

Hybrid PDFs

Another interesting program that we are working with is LibreOffice, particularly with its “Hybrid PDF” functionality which delivers on the true promise of a portable document format. A “Hybrid PDF” is a PDF that anyone can view, but which includes the source document embedded within it so that people with modern office suites can also edit it if they want to. You can read more about Hybrid PDFs in this PDF about creating Hybrid PDFs (See also: “The Magic of Editable PDFs” from the author of that document).

So, we definitely have a lot of interesting tools to work with to complete our project – and although we were unable to use Tesseract to produce all of our document objects, there is a lot of interesting work being done in that space. Most notably, newly hatched open source efforts with minimal documentation which we are going to keep a watch on in the months ahead. Our solution for the CCR will be a hybrid model – and despite our open source ethos, we believe in a multimodal approach to solving a problem and using the best tool for the project at hand @ #rdcHQ.

Onward!

#rdcHQ MathML / SVG / DocGen Update

Busy, busy, busy at #rdcHQ We successfully converted 311 MathML equations using SVGMath. You can view our progress here. The original equation is on the left, and our final SVG is on the right (all of these equations are now coded in MathML which we use as the source for our conversion). We are currently in the process of checking our work against the original graphic. We also kicked off the DocGen portion of our program by batch-processing 200+ basic text content objects in Tesseract. The results are promising, and we are exploring ways we can train Tesseract for better results.

SVG Update

Our SVG team successfully completed work on Labels and Flowcharts. We are also cleaning up work on our Map collection that we converted to SVG using the Autotrace function. We’re currently cleaning up symbols and typography … more updates soon … Stay Tuned!

MathML + SVG / (More) Coding and Crosshatch

Greetings from #rdcHQ! This week, we coded the bulk of our equations in our MathML track (although we do fully expect to uncover a few in the DocGen portion of our program)! We celebrate this milestone with a video short, “Coding in Amaya” -


How our computer screens look at #rdcHQ this Summer! … (Video by Nate)

… and work continues in the SVG Track. We are steadily completing work in our remaining groups. Shown here are Flowing Well In An Enclosure and Beam Pumping Well With Cellar and Typical Wellhead Equipment, Nos. 1100 and 1101, respectively.


More announcements soon – including the development of our first ever Fall Track … Never a dull moment at #rdcHQ!

#rdcHQ Mid-Program SVG and MathML Update

Last week at #rdcHQ, the crew kicked production into high gear in both the SVG and MathML tracks. We now have 232 graphics completed in the SVG track and 237 equations nearly completed in the MathML Track (half of the equations still need to be converted). This tally does not include the new work being done in DocGen, which will be our primary focus in the MathML/DocGen Track for the remainder of the program. Although our program ends at the close of the summer, we call this our “Mid-Program Status Report” as we always allow for one month of post-production after our annual launch party and art show.

Status: SVG Track

The following list shows the groupings that were created for the graphics in the CCR set. These groupings provide the framework for our lesson plan for the 2012 summer program. We also have two catch-all categories that we will explore this week: 1) images that are eligible for work with the auto-trace function and 2) that are illegible and will require research to see if better original art is available.

SVG Track – Groupings for CCR
Group 1Basic Lines – 21 Content Objects – COMPLETED
Group 2Crosshatch – 27 Content Objects -COMPLETED
Group 3Diagrams I – 44 Content Objects -COMPLETED
Group 4Diagrams II – 50 Content Objects – COMPLETED
Group 5Diagrams III – 42 Content Objects – In Progress
Group 6Landscape – 22 Content Objects – In Progress
Group 7Blends – 21 Content Objects – In Progress
Group 8Perspective – 21 Content Objects – In Progress
Group 9Textures – 12 Content Objects – To Do
Group 10Charts – 25 Content Objects – To Do
Group 11Labels – 17 Content Objects – In Progress
Group 12Flow Charts – 16 Content Objects – In Progress
Group 13Maps [10.2MB PDF FIle] – 79 Content Objects – In Progress

Status: MathML Track

The MathML Track is divided into four groupings for our coding meetups, and we currently have 3 of the 4 groups coded, and we will be running these through QA this week. The MathML Mixed category is a tricky one as several of the equations have been flagged for straight HTML markup. There are also often several equations within a single content object, so the true number of equations has yet to be determined. We are sorting through all of these matters at #rdcHQ!

MathML Track – Groupings for CCR
Group 1MathML Simple – 50 Content Objects -COMPLETED and CONVERTED
Group 2MathML Medium – 55 Content Objects – COMPLETED
Group 3MathML Advanced – 49 Content Objects – COMPLETED and CONVERTED
Group 4MathML Mixed (equations plus text) – 89 Content Objects – In Progress

This tally currently does not include the content objects in the DocGen portion of our program. All of these are currently in progress, and we will have more updates on this track in the month of August – which will be DocGen Month at #rdcHQ!

DocGen Track – Groupings for CCR
Basic Text – 270 Content Objects – Tesseract OCR Conversion
Tables – 373 Content Objects – Tesseract OCR Conversion
Forms – 1400+ Content Objects – In Progress

Go #rdcHQ Go!

Return top