You want me to OCR that!

By Brooke Martin

Thinking about OCR technology to help you extract data from paper or pdf documents? There are a number of things to keep in mind when you begin your evaluation of this technology. This discussion is more related to advanced data capture requirements and to give you some high level information and thought direction.

Some quick things for you to be prepared to have ready when engaging a vendor:

  • Understand your document type: invoices, EFT transactions, work orders etc.
  • How do they come into your organization along with the quantity: via email, paper and/or fax
  • Have ample samples available to discuss 
  • Know what data you want extracted from the documents, high-light in yellow the data sections
  • Know your bench marks, how long does it take to data enter information into your system now? How many key strokes and as an example how many invoices per day can a A/P clerk process
  • Look for independent knowledge information or network with your peers


The simplest—but perhaps most awkward—step of the document receipt process is the preparation and physical scanning of the documents. Separating and opening mail, flattening and separating its documents, removing staples, and putting the documents through a scanner are all a highly manual process. You need to determine if this is a skill and resource that your organization is willing to invest in. This greatly determines the success of your OCR project. If your documents come to you via email or some form of FTP, then even better.

The next step in an OCR solution is to understand if classification of the documents is needed or you will scan all the same type of documents, such as invoices. If it's always the same document type, this determines whether classification automation is needed in the software.

How will the software be programmed to set up and recognise the data you want to collect? There two primary industry methods:

  • template based - you or the solution provider must program the OCR application for each document, in the case of invoices, each vendor will have to be set up in the system. Pros: Much more effective recognition of data, workflows and data integration is better. Cons: special skill set needed or advanced product knowledge needed in house or you need to rely on your solution provider. More time is needed in the project, can add to project costs. 
  • learning based - learn as you go, the software is configured to allow you team teach the system where the data is per document, so the next time the system sees same document, it gathers the data. Pros: lower project fees, project is up and running faster, easy to configure Cons: not as advanced, not all documents are created equal and at times advanced programming will be needed. Therefore you need to work with your provider and let them assess your documents complexity... if they're complex go with template based system.

With some discovery with your vendor, they will be able to direct you to which solution is best. Does your vendor offer different data capture technologies? 

Once the document is scanned and released to the OCR software those that need some validation (meaning the OCR failed to recognise some of the data) will be presented to the user. The User Interface is an important aspect of this process and not all vendors are created equal. Does the process of identifying and correcting data seem user friendly? If not your project could be at risk- low user adoption, frustrated users, abandoned use, loss of efficiency gains.

Export, what type of export of data do you need? Is it easily configurable? Where will you want to send the data? Does the vendor have integration experience? Where do you want the documents to go; ERP, CRM, DMS or EHR? Is there advisability to the process and how important is that? 

Data Accuracy

Like all good software there are risks. There will be times that the software thinks it's correct and will pass the information through the system without the operator knowing. Test for this in your evaluations. Understand the risk to this and what level is acceptable? You can mitigate this by putting controls into the capture and OCR process before the documents get to the verifying operation. Ask your vendor to explain how they would handle this.

You need to know you will not get 100% data extract accuracy, but what is important and it's back to your current KPI's. Can you scan and verify data from your documents faster than I can do data entry? If the answer is yes, your project can be successful. You may be present with information that a certain document is not suitable for the software solution and recognise it might be better to manual data enter.

Be prepared to ask for references and project challenges! 

Brooke Martin is a Senior Solutions Consultant with Canada’s Process Fusion Inc.