Google Releases Open Source OCR Software

Google Releases Open Source OCR Software

September 7th, 2006: Google has teamed up with the University of Las Vegas (UNLV) to revive an old piece of HP Optical Character Recognition (OCR) software and re-release it as open source.

Called Tesseract, the software was originally developed by HP between 1985 and 1995 and was considered one of the better OCR prospects at the time. HP ceased working on it in 1995 and put it on the shelf until a few years ago when it decided it would be an excellent donation to the open source community.

From here, HP handed Tesseract over to the Information Science Research Institute at the UNLV who then approached Google for help and sponsorship to tidy up the code.

“UNLV... asked for our help in fixing a few bugs that had crept in since 1995 (ever heard of bit rot?),” writes Eric Case on the Google Code blog. “We tracked down the most obvious ones and decided a couple of months ago that Tesseract OCR was stable enough to be re-released as open source.”

It may seem strange for the search giant to be moving into OCR, however, Google is keen to point out that making information available online is what it its bag. “In a nutshell, we are all about making information available to users,” writes Case. “And when this information is in a paper document, OCR is the process by which we can convert the pages of this document into text that can then be used for indexing.”

At the moment Tesseract only supports the English language, and does not include a page layout analysis module, so it performs poorly on multi-column material. Case says that it also doesn't do well on greyscale and colour documents.

“It's not nearly as accurate as some of the best commercial OCR packages out there, writes Case. “Yet, as far as we know, despite its shortcomings, Tesseract is far more accurate than any other Open Source OCR package out there.”

Comment on this story