Limitations of using OCR for file classification

By John Martin

OCR is often used to obtain text from image-only files for use in classifying them. However, there are several limitations of OCR that can result in inaccurate or missing text which makes text-based classification difficult or impossible:

  • Font Size. OCR may not convert characters with very large or very small font sizes. This can make the most important characters and words unavailable for text-based systems.
  • Uni-Dimensional. With OCR, individual words have one dimension, they’re either before or after other words. OCR does not catalog page coordinate information for characters even though page coordinates can be quite useful for classification and extracting attributes.
  • Sequential Editing. OCR errors typically have to be corrected sequentially with the same errors being repeatedly being edited. Global spell checking can introduce other errors.
  • Case Sensitivity for Editing. The use of spell checking to correct OCR text will typically not permit the case of the letters to be considered, e.g., cat and CAT will be treated alike.
  • Languages. Many languages have special characters, and unless the correct OCR software is loaded, those characters can be lost or incorrectly recognized.
  • Non-Symmetrical DPI for Faxes. Faxes are often stored in files where the number of dots per inch horizontally is not the same as the DPI vertically, and OCR engines can have difficulty with this non-symmetrical DPI.
  • Partial Text. Document authors often incorporate graphics that have visible text. However, the OCR software may detect some text, assume that OCR is not needed, and skip processing the document leaving the text in the images invisible for text-only searching or analysis. A similar phenomenon can happen when textual headers, footers, or legends are added to previously image-only PDFs. OCR systems may detect the presence of a text layer and not attempt to convert the image layer, even though it may have the most important content.
  • Non-Textual Glyphs. Many times there are important non-textual characters or glyphs that do not get converted to characters by OCR, leaving them invisible for text analytics or text-based retrieval, e.g., logos, or map symbols.
  • Inferring the Obvious. Graphical elements often provide the most obvious clues as to how a file or document should be classified, e.g., placement and size of logos or text blocks. Because those graphical elements may not be directly accessible by text-restricted systems, they are left to try to infer what is most obvious to anyone just looking at the files.
  • Incorrect Document Boundaries. Image-only files often contain multiple documents per file and OCR does not provide a way to correct document boundaries. This causes downstream problems with systems which classify files based on comparing the words that are used within documents. Embedded documents can be missed and the ones that are classified can be misclassified. There can be similar issues for single-page TIFs where document boundaries are not obvious. For more information, see blog posting on Basic Assumptions Gone Wrong: ECM and Document Unitization, and Information Governance Lessons from 4 AFEs and a Daily Drilling Report.

A better approach to classification is to use visual classification which uses richer information about what documents actually look like than just extracted text. It provides scalable, consistent classification of all types of document files.

John Martin is founder & CEO of BeyondRecognition, LLC, a Houston-based technology company that provides visual classification and attribute extraction software and services. This posting is based on John’s forthcoming book, Guide to Managing Unstructured Content, Practical Advice on Gaining Control of Unstructured Content, due out later this year. You can sign up at http://beyondrecognition.net/guide-to-managing-unstructured-content/ to receive your copy of the book when it is available