Seven Reasons Your Document May Not be Well-suited for Forms Recognition
Image recognition technologies, such as Optical Character Recognition (OCR) and Intelligent Character Recognition (ICR), can help to speed forms processing, saving time and money, and the accuracy of these technologies has increased exponentially in recent years—with some software technologies able to accurately read almost all of the data on a document (depending on application). However, it is important to note that some forms pose greater challenges over others, particularly those that are unstructured in nature. Not all forms are good candidates for OCR and ICR, but often these forms can be changed or redesigned slightly so that information from them can be more easily recognized.
The following 7 document characteristics present clues as to why unstructured, and other, documents may not be suited for OCR and ICR and what you might do about it to help them to be better candidates for processing.
1: Field Label Colour - Unstructured data analysis for OCR/ICR relies heavily on finding keywords that may be associated with the interested field. When colour is introduced, the image engine has to work extra hard to detect the keyword(s). If they become obscure or drop out completely, the location will become impossible. Colour, even black with white lettering, obscures the word(s) and can lead to either un-located keywords or false positives. For this reason, organisations should consider using black and white in forms design.
2: Too many keyword instances - if a keyword is repeated on the page, the difficulty in determining the appropriate data to extract increases exponentially as the keyword count goes up. Even with zone location reference points, if the specific keyword is not consistently located in the identified zone, the recognition engine could result in too many responses increasing the false positives and making extraction nearly impossible. Organisations using keyword-heavy forms might consider alternative language that could be used in certain places on specific documents.
3: Inconsistent distance from keyword – though technology has made advances in determining where data fields are located in reference to keyword labels, these advantages can often be eclipsed by inconsistencies in how humans fill out forms and offsets in machine print itself. When the distance is too varied between the keyword location and the data field, extraction can be nearly impossible.
4: Form density - density references how much information is on a page. If a form is produced with 6 to 8 point fonts, has many paragraphs of instructions, includes similar sections or has the same labels referenced repeatedly, keyword location and extraction will increase in complexity. Often, the written information will be either similar, too small, or the human handprint will be huge in reference to its expected area, obscuring its intent and/or other form areas. Often dense forms should be broken into two or three pages during design.
5: Poor scan quality - while this is a universal issue in capture, it becomes an even bigger issue when unstructured forms processing comes into play. Poor quality leads to false positives and nearly impossible keyword location and data extraction. If the pixel count is too low, the letters, words and shapes will be obscured through pixilation that makes it nearly impossible to ascertain usable data.
6: Poorly printed forms - even in this day and age of high quality printing, inconsistency with print quality still exists, especially with publicly available forms on the web. When printing from an electronically assessable form, the print quality is dependent on the individual’s print setup and skills to print correctly, at a reasonable size and/or to a decent printer.
7: Drop out colour - there are forms designed specifically to drop out form design elements intended to make OCR/ICR engines read important data better. There's also inconsequential drop out with highly stylised forms with various shades of red, blue or green. Scanner optics often will not detect shades of red and the scanned data will not show up. If the keyword locater is one of these elements, then dynamic unstructured data location is nearly impossible.
While these issues are not impossible to overcome with the right tools, they greatly affect an OCR/ICR engine’s ability to locate information and extract the important data. For those responsible for producing a workflow or advising business teams on data extraction projects, these seven issues should be key considerations. Knowledge of potential forms processing pitfalls and how to avoid them, can lead to huge leaps in data extraction.
Shane Cooper is with Parascript, a recognition technology provider. For more information on ICR, see http://info.parascript.com/not-all-icr-software-is-created-equal