I’m learning a bit about OCR, and wanted to get some hands on experience using the open source Tesseract to get a feel for how it works. I’m a long way from any reasonable visual or interaction design, but the result of today’s exploration is an html page where the original image is overlaid with machine generated text in roughly the right location. This page looks like crap, but it’s a neat first step (click on the image below to see the full html page):
Tesseract
Tesseract is an open source OCR tool (Apache 2.0 license) that produces fairly accurate output (relative to its open source peers) for scanned, type-written documents in English and many other languages.
On the Mac, we can easily install it with homebrew:
brew install tesseract
The latest version supports lots of different input image types, via Leptonica, an open source C library for efficient image processing and image analysis operations.
If you just want to get text from an image, check the ReadMe file. I wanted to display the generated text over the image, which Tesseract supports via the HOCR format.
Each word, line, and block of text, is annotated with an HTML tag. I look at just the word element, which is generated as a span tag with the attribute title:
<span class="ocrx_word" id="word_14" title="bbox 398 506 471 527" >WHOM</span>
I wrote a little javascript to create a style tag from the bbox attribute:
Manuscript.bboxToStyle = function(bbox_str) { arr = bbox_str.split(" "); left_pos = "left:"+arr[1]+"px; "; top_pos = "top:"+arr[2]+"px; "; right_pos = "right:"+arr[3]+"px; "; bottom_pos = "bottom:"+arr[4]+"px; "; return left_pos + top_pos + right_pos + bottom_pos; };
Then used jQuery to apply that to every word element:
$(document).ready(function() { $(".ocrx_word").attr('style', function() { return Manuscript.bboxToStyle(this.title); }); });
Resulting in word elements that are positioned roughly where they appear in the image:
<span class="ocrx_word" id="word_14" title="bbox 398 506 471 527" style="left:398px; top:506px; right:471px; bottom:527px; ">WHOM</span>
I tried to experiment with the background-color of the words, but that’s not working for some reason. Complete source on GitHub.
Would love to hear about anyone else creating HTML UI for OCR results, either with Tesseract or other open source tools.
Very nice tool. Great idea. Definitely going to try it out. Thanks.
Most useful. I had to make two slight changes:
-the class for OCR words was named ocr_word, not ocrx_word, on my version of Tesseract for Ubuntu 12.04.
-I had to add to the generated CSS for each word the fragment “position: absolute;”. I didn’t have as much bandwidth as you did, adding a CSS style to set it globally in head. I might try that.
I used opacity on the image to get the words to show more clearly. Setting it to 0.4 was the right setting.
If you made this into pop-ups for each word instead of an overlay, it could be useful for some difficult to read fonts like black letter. Of course, that assumes we can get good OCR results for black letter — eMOP is working on that.
I could imagine some DH projects that might find this a useful way to show page images and have transcriptions available in a convenient way on the same page. Proyecto Cervantes is one. Maybe if done by line or paragraph instead of by word.
Just thinking out loud.