It's about time for imaging

It's about time for imaging

Time is money when scanning and indexing paper documents says Paul Montgomery in this imaging cost analysis.

There used to be a time when imaging was part of a monolithic proprietary solution, and the only way that a company could proceed down the project path was on the way laid down for them by the vendor. Like so many other closed systems in IT, this has now changed for the better with the advent of open standards and greater interoperability.

This impetus is added to by the constantly dropping price of high-quality scanners and imaging software, and the ease with which imaging components can be linked to software higher up the chain with such object-oriented technologies as ODBC, COM, CORBA, ActiveX and Java. The average corporate user has been allowed to concentrate, more than ever before, on what the vendors like to call a "best-of-breed solution", where the user is given the freedom to pick and choose according to specific needs.

The most pressing need for most companies in this area is to save money. The imperative is to get the process done for as little cost as possible, whilst retaining all of the functionality. As a simple lesson in economics, there are two elements to cost in imaging: capital equipment and labour. Because the former is intended to replace the latter, the balance between the two is a matter of waiting until capital costs for a particular task come down to a point to make it cost-effective to invest to replace the labour previously assigned to that task.

Given that competent scanners can be bought for around $5,000 at the workgroup end, and physical components are becoming cheaper at the enterprise end as well, there are more opportunities now to substitute human process with software. Essentially, this means decreasing the time that a scanner operator has to spend on each page. The average time to scan one page, given a whole lot of assumptions about the nature of the infrastructure behind the process, might be estimated as 20 seconds. If the operator works for $12 per hour, that's a cost of 6.67 cents per page. How do these numbers come down? Stealing seconds here and there, which add up over a full year to ten of thousands of dollars in savings per operator. That requires an in-depth analysis of the elements of the document capture process to ascertain if and where gains can be made at each step: batch preparation, scanning, optical character recognition (OCR), image clean up, indexing, quality assurance (QA), rescanning, and release.

Batch Preparation

Scanning pages in batches, instead of individually, is the biggest time saver in itself. Enterprise entry-level scanners can typically achieve speeds of 40 pages per minute (ppm) at least, but if the operator has to spend time inserting each document individually and then indexing it while the scanner is idle, then those nominal speeds are useless. Batches of 100 or more pages make the speed mean something, as all of the documents are fed through in sequence, and the OCR and indexing software can do its work on one document while the next document is being scanned. The table below* shows how much this can save (these figures are arbitrary, and it would be instructive to figure out what they would be for your organisation).

Apart from the savings of actually implementing batch processing, there is also the question of how large a batch should be. The temptation is to make it as large as possible, to minimise the time overhead taken up with loading and unloading the scanner and indexing station. This has to be balanced with the inevitable need to rescan some of the documents which are badly scanned. If an operator needs to spend five minutes wading through 500 pages just to find one misfed document, then it's not worth it.

OCR (optical character recognition)

As can be seen from the first table, indexing is by far the most time-consuming part of the process, which has led to vendors finding ways to automate it. The first of these is optical character recognition, where bitmapped data is converted to the best-guess equivalent ASCII characters by software. OCR is particularly suited to forms, where there are defined zones which can be scanned for alphanumeric text, and is also used extensively for legal briefs, where the entire page is composed of printed text and so is relatively easy to recognise.

The problem with OCR is that it is unreliable. Its accuracy is dependent on the quality of the identification engine within the software, the quality of the scanner, and the quality of the original image, so if any of these are lacking, then the end result is an error rate which is more trouble than it is worth.

If, for the purposes of our simulation, we assume that accuracy is 97 percent, then to ascertain the error rate, the accuracy has to be multiplied to the power of the number of characters in each field. If each field has ten characters on average, the chance of every one being recognised correctly is 0.97 to the power of ten, or 74 percent. Thus the error rate is 26 percent, meaning if there were four such fields on a page, seven out of every ten would need correcting in at least one field.

This sounds bad, but how does this stack up against manual keying? If an indexing operator can type 10,000 characters per hour, it takes 3.6 seconds to type data from a 10-character field. To give a better estimation of the time taken looking and typing, it might be better to round that off to four seconds. For a 100-page batch, this equates to 400 seconds.

In our 97 percent example, if it takes 2 seconds to conduct OCR on a 10-character field, then the total time taken in a 100 page batch is 200 seconds, plus four seconds each for manual keying of incorrectly recognised fields. In other words, (2 X100) + (4 X 26), or 304 seconds.

This example is contrived, of course, and the assumptions made about the error rate of the automated process are arbitrary, at best, but this is a saving of almost a quarter of the time. Even if the assumptions are thought to be generous in relation to your organisation's parameters, there are several ways to improve to a point where OCR is cost-effective.

Parts of the recognised text can be checked against existing database information. For instance, if two fields provide a surname and employee number, the first name can be deduced, and checked aginst the OCR version. This is also illustrative of another technique, that of validation scripts, which perform simple checks on fields according to preset rules.

The bitmap scanned image can be cleaned up to make it more palatable to the OCR engine. Techniques include: deskewing, or straightening a crooked page; deshading, to remove the small dots caused by a coloured background; despeckling and streak removal, which delete spots and stripes caused by foreign material in the scanner or faults in the software driver; line removal, to white out the guiding lines on typed or handwritten forms and pages; and edge enhancement, which rounds off the outline of a character.

If possible, text should be replaced by bar codes, which are much better suited to automated recognition. Bar code scanning software has a far easier time attaining error rates of less than half a percent, even at an angle, especially considering that the design usually has error checking built in.

The other part of the OCR error equation is the number of characters which have to be recognised. If these can be minimised, either by reducing the number of fields or the number of characters per field, then some of the burden is lifted from the software. Any movement in this direction also decreases the worth of the database and harms the chance to pick up errors by cross-checking so, as with all of the other choices to be made, users have to be wary of the pitfalls.

* Time and cost estimates, plus other material in this article, adapted from "The Dynamics of Cost in Document Capture", a white paper from Kofax Image Products published in September 1996.