A New Approach to OCR Quality

By Alexander Goerke

More than a year ago we announced the European grant we had received for a development and research project to improve OCR on historical documents. The goal of this project is to use an unsupervised learning and clustering algorithm to improve OCR on a specific page.

This is especially useful on deteriorated documents or low quality scans. Because in many of these cases a standard omnifont approach fails for characters that simply are not distinguishable any more.

The current results of this project, run with our partner company Lumex in Norway, are already quite impressive. OCR Accuracy Extension achieves improvements of up to 15% on standard OCR results. The lower the quality of the images the higher the improvement. For exotic fonts (like Fraktur) the improvement is even higher as the variations of historic fonts are significant over the time. Since the software is using only image elements for the matching it is also very fast.

The method used to improve OCR on a given document is very similar to that of human cognitive capabilities.

Just imagine that you see a document with very difficult handwriting. In the beginning you will be able to distinguish some of the more distinct characters which in turn allow you to conclude the meaning of other characters as you can derive them from the characteristics of the writer. The same is done in unsupervised machine learning in the OCR Accuracy Extension.

We use object detection and classification to create clusters of all possible characters on a specific page and then use easily recognisable characters to automatically label these clusters with their meaning (e.g. these are all capital “E”). From these a prototype can be derived and then applied in a second round to all the unknown characters. This helps the system to identify even deteriorated or distorted samples confidently thus boosting OCR quality.

A complex example is the recognition of a complete old newspaper page (as shown below). This page contains 16,837 characters. This is actually an advantage because a high number of available character is beneficial for the automatic creation of good prototypes which in turn can be used to improve the quality.

In this case the first pass of OCR with ABBYY FineReader 11 yields a decent quality of 78.4% correct characters when compared to a manually corrected ground truth file. If the AE OCR booster is initialised by unsupervised learning, the recognition rate goes up by 6% to 84.5%. If in addition when we use another page of the same newspaper for the learning step, the system achieves an improvement of more than 10% to 88.7%.

These are promising first results for improving OCR on difficult documents using unsupervised machine learning techniques. The project is ongoing and will for sure yield even better results in the coming year, allowing researchers to access cultural heritage easier and faster.

Alexander Goerke is founder of Skilja, a European consultancy working on a project to improve OCR for historical documents with the support of Germany’s Federal Ministry of Education and Research (BMBF) and the European Union.

An example for a very old book is shown displayed in our test and benchmark tool, the Accuracy Extension (AE) Studio. The green blocks highlight characters that have been corrected by the AE. In the tooltip you can see that in the first line in the word “solten” the character, which is an “o” in GT (ground truth), has been incorrectly recognised as “e” but has been corrected back to an “o” by AE.

Another example is a typical old typewriter document (actually a telegram). All the faint characters have been well corrected. The text line at the bottom shows the comparison between original OCR and the magically corrected result