The user dilemma in document management

By Simon Kravis, Principal, Aleka Consulting

A document management systems (DMS) is often regarded as a cure-all for every problem that an organisation has with information retrieval and records management.  While a multitude of DMS systems are available for this purpose, with the proven capability to provide such a solution, they all require system users to use the supplied software in a way that delivers the expected outcomes.

Unfortunately, the way in which users store and retrieve electronic documents is typically based on their experience with file and folder structures on a hard drive. If these habits prevail when using a DMS then the expected benefits of the expense and disruption of DMS rollout may not materialise.

One of the major advantages of a DMS over file/folder storage is its ability to systematically store different versions of a document. File naming conventions can be used to implement version control in a shared file/folder environment with multiple authors, but like any convention, these may or may not be followed by, or even known to, all users.

Multiple different versions of the same document can be detected automatically with reasonable reliability by comparing document text content. Similar text content means the documents are likely to be different versions , as most changes to documents in the course of their evolution are minor.

Unlike the estimation of binary duplication of files, which can be rapidly and precisely measured using checksum algorithms such as MD5 or SHA-1, estimating the text similarity between documents is dependent on the similarity algorithm used, the threshold of similarity supplied by the algorithm and the process of extracting text from the documents.

There are many text extraction programs for common document formats such as Word, PDF and Excel.  Office programs also have internal methods of extracting text content. All these methods tend to give slightly different results and some may fail on documents from which text can be extracted successfully using another method.

For example, text extracted from PDF files created from Word is invariably slightly different from text extracted from the parent document. Search engines have faced this problem for some time. One very well-known one search engine uses three methods of text extraction from PDF files with the first one to operate successfully applied. (In this context a failure to is much less noticeable as the correct search result is simply missing.)

PDF files pose particular problems for text extraction, as the format supports image representations of text, which are not searchable. Scanned documents are frequently stored in PDF format.

Where Optical Character Recognition (OCR) has been applied in order to create searchable text, the quality of text created may be very poor if the original image is of poor quality. The quality of OCR has improved greatly in recent years, as algorithms and computing power have increased, and re-applying OCR to scanned documents processed many years ago may improve searchability.

I recently undertook an exercise to estimate the extent to which use of a DMS reduced the incidence of copies and different versions of documents being present. To do this, the prevalence of binary and text near-duplication was estimated in two similar document collections created in the course of two large software development projects extending over a number of years between 2009 and 2011.

Both utilised 10 – 50 staff for design, management, development and testing. One project (FS) used a file share as a document store, and the other (SP) used the SharePoint DMS as the document repository. Both projects used the same folder structure for documents, but the SP project supported storage of multiple versions of a document in a systematic fashion, whereas the FS project did not.

Files in the SP collection were accessed via the Windows Explorer interface, and where multiple versions of SharePoint documents existed, only the most recent version was profiled. Near-duplication of text content was estimated using a word vector approach, which provided good computational efficiency.

Results

In the FS collection 85% of the text files have a unique binary checksum, and the largest cluster of binary duplicates has 15 members, but that only 40% of text files have a unique word vector, with the largest cluster of near-duplicate files being of size 129. This indicates that near-duplication is much more common than exact binary duplication, but the difference will vary according to parameters used in near-duplicate estimation.

In the SP collection, 80% of the files have a unique binary checksum and the largest binary checksum cluster is of size 23.  38% of the files have a unique document vector, and the largest near-duplicate cluster is of size 161.

The presence of multiple versions of files in the SP (SharePoint) collection would probably increase the proportion of near-duplicates somewhat, but the number of files stored in this way was not available. The similarity of duplicate spectrum profiles in the two collections indicates that although SharePoint can support document version control, files which were similar or identical to each other were not stored in SharePoint in this way.

Examination of some of the near-duplicate clusters in the FS collection indicated that a convention of including the version number, Final/Draft status and author initial in the file name was used in many cases. If this complex convention was adhered to by all users it would be possible to locate the definitive version of a document, but the version control facilities of SharePoint (at the time new to most of the project staff) were not used and SharePoint was used very much like a file share by the majority of users.

Conclusions

Analysis of collections of text documents from two software development environments reveal a modest level of binary checksum duplication (80-85% unique) but a much higher level of near-duplication of text content (40% unique), using a word vector similarity measure.

Although the absolute level of near-duplication is to some extent arbitrary, duplication levels did not vary significantly between a file share and a SharePoint document repository as near-duplication was estimated using the same parameters for each collection. The lack of reduction in near-duplication when using SharePoint may be attributable to user inexperience but it does underscore the fact that the availability of version control in a document management system does not necessarily mean that it will be used.

Many of the documents created during a software development project consist of a graphic and a small amount of text. Documents which are quite different may be incorrectly flagged as similar as different graphic content is ignored when assessing similarity. Profiling of a collection of administrative documents from a Commonwealth department indicated that 60% of the files contained unique text content on the basis of word vector comparison using the same parameters as the FS and SP collections

Use of the SharePoint repository was anecdotally unpopular with users, due to poor performance compared to a share drive, especially for large files. Document disposal, which is facilitated by the availability of a unique sentencing date in document management systems, was not an issue for these particular collections, as all development documentation is retained as long as the software is in use.

Despite the lack of apparent reduction in near-duplication, the use of SharePoint may have been beneficial through provision of easier web access to documents. In conclusion, the introduction of a DMS to replace a share drive needs careful planning, adequate provisioning and user training to achieve the expected benefits.

Simon Kravis is the principal of information management consultants Aleka Software http://www.alekaconsulting.com.au/contact.html