Can you search annotation/redaction data in your ECM?

By Xiaopeng He

Annotation and redaction capabilities are advanced features of document/image viewers. A group of users who share documents among themselves can collaborate via annotations on the contents of the documents. Redactions provide content level securities that enable the protection of sensitive information in portions of a document from users who have access to the entire document.

Redaction is a special form of annotation. Many document viewers and image viewers support annotation/redaction capabilities. In ECM repositories, annotation/redaction content is normally stored separately from the content of the documents that they are associated with.

Such separation has several advantages. First, it allows the documents in the system being annotated or redacted without the content of the documents being modified. Many users can annotate/redact a document simultaneously without worrying about the loss of annotation content due to the potential concurrency issue.

Second, it allows users annotate/redact documents of different content formats with the same set of annotation/redaction objects. There are many file formats, PDF, MS Office, Text, TIFF, PNG, AutoCAD just to name a few. Without a document viewer, an ECM application must rely on native applications to display the documents in the repository.

Some of the applications come with annotation/redaction capabilities. For example, Adobe Acrobat allows users annotate PDF documents. However, Microsoft Office supports a completely different set of annotations.

A document viewer control with the annotation/redaction capabilities not only displays documents of many file formats, but also displays a document embedded in web browsers or mobile apps so that users don’t have to switch back and forth among different applications, and yet allows users annotate/redact various documents with a predefined set of annotation objects. This is regardless of the native annotation data format that the native applications may have. For example, the 3Si hViewer supports the display of documents of many file formats inline and embedded in HTML5 browsers. It also supports the annotation/redaction of documents in many file formats including PDF and Microsoft Office.

Thirdly, it allows annotation content to sit on the side of structured data while document content is treated as unstructured data in the ECM repositories. Searching the structured data is easier and more efficient than searching the unstructured data.

With annotation/redaction content stored separately from the document content, annotations/redactions are qualified as document indexes that are capable of pointing to specific areas of document content.

Imagine having a search hit that points to an annotation object on page Z of a document, and by clicking the search hit the user is taken to a document viewer displaying the page Z of the document where the hit annotation object resides and possibly the annotation object focused and selected in the viewer. This is a huge productivity improvement given that many files in the repository may have multiple pages.

One extreme case we have run into is a single PDF file with over 22,000 pages! Providing indexing into a specific page of a multi-page document saves users time and effort.

This is all great. However the question is whether your annotation data is searchable in the repository? If you create a text annotation from a document viewer, and save the annotation object in an ECM repository, is that text searchable? Or if you create an arrow annotation from a document viewer and give a tooltip to the arrow object, is that tooltip text searchable?

As corporate data volumes grow exponentially, it becomes more urgent to answer these questions than ever before. Without being searchable, annotations/redactions are not qualified for document indexing.

Unfortunately, the answer to this question is negative for many commercial ECM products. This is due to the data transparency issue described in another article titled “ECM Data Integrity & Transparency”.

To avoid the data transparency problem and other issues associated with annotation/redaction data in ECM repositories, some organisations go backwards by restricting the ability of document viewers to generate annotation/redaction contents in the repository. This practice is commonly seen among popular EFSS (Enterprise File Synchronization and Sharing) sites such as Dropbox, Box, Syncplicity, Accellion, ShareFile. etc.

From these cloud storage services, users can view documents from a document viewer. But the document viewer does not support annotation/redaction capabilities. Users simply cannot collaborate on the documents via annotations/redactions.

Disabling annotation/redaction features avoids the issues of the annotation data. But this approach is throwing out the baby with the bathwater. Instead of facing the root issue and finding a sound solution, these applications threw away a significant productivity feature.

Many other EFSS services do not provide a document viewer at all. They rely on native applications to handle the display of the various formats of document contents. We have not yet seen an ECM application that allows users to collaborate on documents via annotations/redactions and at the same time makes the annotation/redaction contents searchable outside the document viewer.

Would it be nice for an ECM system to allow collaborations via annotations from a document viewer, while enabling the search component to finds hits from annotation data?

Xiaopeng He is Founder & CEO of 3S International Corp, a Washington, DC based consultant specialising in ECM solutions and viewer integrations.

http://www.3sillc.com