Thanks to the prior work of Matt Christy at eMOP, I got started building Tesseract from source (on Mac OSX 10.8.4).

Here’s my slightly modified workflow:

svn checkout http://tesseract-ocr.googlecode.com/svn/trunk/ tesseract-ocr
cd tesseract-ocr
./autogen.sh
mkdir build
cd build
../configure
make
make install

recently a makefile changed, and I need to regenerate them, starting at the source code root:

autoreconf --force --install
cd build
../configure
make 
make install

Making a “build” directory, makes it easier to keep track of source code changes with svn. I set up my global ignores to ignore the interim files and directories.

vi ~/.subversion/config

then I uncommented this line and added everything after .DS_Store

global-ignores = *.o *.lo *.la *.al .libs *.so *.so.[0-9]* *.a *.pyc *.pyo
   *.rej *~ #*# .#* .*.swp .DS_Store *.in build config configure *.cache
   aclocal.m4 m4

So then I only see source code files that are added or modified when I check

svn status

ever wonder what all those random files are at the root of some package source that you are playing with? and how exactly does the mystical configure command actually do?

Alexandre Duret-Lutz has created a fabulous Autotools Overview & Tutorial — well worth flipping through the first 19 slides (38 pages of the PDF since each there are “builds” of individual slides exported as multiple pages).

Later, at page 97, he starts a tutorial section, which I replicated in this git repo — the README has a little cheat sheet of step to set up and build the package.

I’m learning a bit about OCR, and wanted to get some hands on experience using the open source Tesseract to get a feel for how it works. I’m a long way from any reasonable visual or interaction design, but the result of today’s exploration is an html page where the original image is overlaid with machine generated text in roughly the right location. This page looks like crap, but it’s a neat first step (click on the image below to see the full html page):


Tesseract

Tesseract is an open source OCR tool (Apache 2.0 license) that produces fairly accurate output (relative to its open source peers) for scanned, type-written documents in English and many other languages.

On the Mac, we can easily install it with homebrew:

brew install tesseract

The latest version supports lots of different input image types, via Leptonica, an open source C library for efficient image processing and image analysis operations.

If you just want to get text from an image, check the ReadMe file. I wanted to display the generated text over the image, which Tesseract supports via the HOCR format.

Each word, line, and block of text, is annotated with an HTML tag. I look at just the word element, which is generated as a span tag with the attribute title:

<span class="ocrx_word" id="word_14" title="bbox 398 506 471 527" >WHOM</span>

I wrote a little javascript to create a style tag from the bbox attribute:

Manuscript.bboxToStyle = function(bbox_str) {
  arr = bbox_str.split(" ");
  left_pos = "left:"+arr[1]+"px; ";
  top_pos = "top:"+arr[2]+"px; ";
  right_pos = "right:"+arr[3]+"px; ";
  bottom_pos = "bottom:"+arr[4]+"px; ";
  return left_pos + top_pos + right_pos + bottom_pos;
};

Then used jQuery to apply that to every word element:

$(document).ready(function() {
    $(".ocrx_word").attr('style', function() {
        return Manuscript.bboxToStyle(this.title);
        });
});

Resulting in word elements that are positioned roughly where they appear in the image:

<span class="ocrx_word" id="word_14" title="bbox 398 506 471 527" 
style="left:398px; top:506px; right:471px; bottom:527px; ">WHOM</span>

I tried to experiment with the background-color of the words, but that’s not working for some reason. Complete source on GitHub.

Would love to hear about anyone else creating HTML UI for OCR results, either with Tesseract or other open source tools.