The RCAV Model for Managing Unstructured Content

By John Martin

John Martin’s Guide to Managing Unstructured Content, Practical Advice on Gaining Control of Unstructured Content, provides a four-step model to gaining control of unstructured content. The book is based on John’s many years of migrating content, developing e-discovery and information management tools, and serving as a testifying and consulting expert on e-discovery. In this first part of the book John gives advice and pointers for each of the four steps.

There are four basic steps to managing unstructured content. They are:

  • Rationalise. This is an inventory and collection process and answers the most basic of questions: “How many files do you have and where are they?”
  • Classify. This begins the process of creating useful intelligence from the content. It answers another basic question: “What’s in those files?” Accurate classification enables all downstream information governance activities, e.g., assigning useful record retention schedules and performing accurate, uncluttered retrieval.
  • Attribute Extraction. This involves a more granular determination, “What types of information do you want to pull or extract from within each classification?”
  • Validate. This involves measuring and confirming that all files have been accounted for and that resulting classifications and attributions are valid. It’s basically, “Do you have everything and is it accurate?”

In the following article we provide tips or things to consider during each of those four steps.

A. Rationalise - How Many Files & Where?

Managing unstructured content begins with inventorying it. Files must be located, counted, and copied for analysis and processing. Here are suggestions for rationalising unstructured content:

1. Anticipate Interruptions

Enterprise-scale file inventorying and copying is an ongoing, resource-intensive operation. Anticipate that it will be occasionally interrupted and be sure that your process can resume without having to repeat all the analysis.

2. Beware the 260-Character Path/Filename Limit in Windows

Software for indexing, logging or copying unstructured content files is often subject to the 260-character path/filename limit inherent in most Windows applications. That means that files where the path/filenames exceed that limit will not be examined by that software - they are simply invisible to it. They won’t be indexed, logged, or copied. We have found cases where 2% of files collected had been inaccessible to normal Windows applications because of overly long file paths.

This can happen in many ways, for example:

  • Zipped or compressed folders can contain multiple levels of folders that although not troublesome in their original location become excessive when unzipped or copied onto other, longer paths
  • Drives or folders can be mapped to other drives or folders resulting in unexpectedly long paths.
  • Users sometimes use folder names to store notes about the files in the folder, e.g., “not to be included in the survey responses.” When folders get copied and moved to other locations, the total characters for all folders in the path can easily exceed 260 characters.

 

3. Log All Files Reviewed

All files examined during the inventorying process should be logged whether or not they are copied. This permits audits of what was examined and can provide support for why any processing decisions were made. Ongoing, accurate logging goes to the very heart of data integrity.

4. Calculate Hash Values and Record in Log

Hashing algorithms have long been used on the Internet to provide a way to determine if emails or files were altered during transmission (see RFC 822 and subsequent standards).If two files have the same hash values they are identical, especially if the more secure SHA hash standard was used. Hash values should be calculated for each file and stored in the log. This is a critical data integrity requirement. By doing this the files can be rehashed later to confirm whether they have changed in any way. The hash values also enable several other important functions as described in several of the following tips.

5. Identify NIST-Listed Software Files

Information governance concerns itself with the content created by or received by an enterprise. One way to focus on content is to exclude system files distributed by software companies such as executables, help files, and templates. The National Institute of Standards and Technology (“NIST”) publishes a list of known system files and their hash values, and this NIST list can be used to identify such files. When such files are located, their presence should be noted on the log, to include the filename, location, hash value, and if the hash matched a NIST list entry.

6. Copy Only Deduped, DeNISTed Files

Hash values for content files should be compared to hash values of files already collected, and only new files that do not appear on the NIST list should be collected. Collecting only one instance or copy of each unique file minimises the number of files to be copied and permits the collection system to copy drives that were larger than those of the collection device itself because it is not copying known system or duplicate content files. Note that the detailed audit log retains all information about where the duplicate copies of all files were located; there is no need to copy bit-for-bit duplicates to track this information.

7. Use Hash Values for File Names

Managing millions of files requires that they be stored in a standard fashion, i.e., not kept in their original folder paths which may present numerous difficulties in terms of length of path, inefficient numbers of files in individual folders, and possibly unacceptable characters. Using hash values for names has two significant advantages:

  • Non-collisionable Names. While duplicate file names don’t present many problems if they are each in unique folders, when files are moved to new folders as happens with standardised storage, there can be “collisions” when two or more files in the same folder have the same name. Basically only one of them will be kept, others with that file name will be essentially lost. This does not happen if hash values are used as names.
  • Self-Authentication. When the value obtained by hashing a file is used as its name it makes the file self-authenticating because the hash value can be recalculated and compared to the file’s name.

 

8. Compress Collected Files

Collected files should be compressed when not being analysed in order to preserve data storage resources. By deNISTing, deduping, and compressing copied files, collection devices can often collect

9. Check for Duplicates in Container Files

Once copies of files have been centralised for processing, files that contain other files (e.g., *.ZIP or *.RAR files) need to be recursively opened, logging each new file or container file and its associated hash value. The goal is to identify the set of unique, deduped content files copied from the source files.

10. Anticipate Encrypted Files

Some files will be encrypted, making it impossible to accurately classify them or search on their contents. Decisions should be made ahead of time about how  to process encrypted files as there are several options, each with different costs and benefits. If files are encrypted they should be indicated on the log to be treated as possible exceptions, and the log should be updated if encrypted files are decrypted. Here are ways to handle encrypted files when processing large volumes of data for information governance or e-discovery purposes:

  • Request passwords from users. This is time-consuming and fraught with difficulties, e.g., when users have resigned, died, or just don’t remember the password.
  • Brute force attempts. In this approach programs try virtually all possible passwords until a successful one opens the file. Depending on the length of the passcode this can be a rather quick process or a long and resource-intensive one. If a password for one file is located it can be tried against the other encrypted file to see whether the user re-used that same password.
  • Dictionary. Software can try all character strings found in the unencrypted files to see if the user had stored the password in one of those files.
  • Key recovery. Rather than try to find the password that the encrypting software uses to calculate the key that opens the file, programmatically generate all the possible keys.
  • Email. When the contents of individual emails have been encrypted, it may be possible to find an unencrypted version of the email by looking later in the thread on other custodians’ files.

 

11. Encrypt Collected Files

Collected files represent potential security risks if they are stolen or unauthorised people gain access to them. To minimise those risks, encrypt the collected files. If passwords are used for both compression and encryption, security can be maximised by using different key lengths for both processes.

12. Use Hash Values to Audit Compartmentalised Storage

Many organisations have content that is so sensitive it is supposed to be only maintained on servers physically disconnected to any other network. One way to find if there’s been “leakage” of this content is to calculate the hash values on the isolated or quarantined servers and when the general enterprise content is inventoried and hashed, compare hash values to see if copies of the secure content are being stored in unsecured locations or devices.

B. Classify — What Type of Content?

Consistent classification is the most critical challenge in managing unstructured content. If you can’t classify items, you can’t manage them. It has historically been very difficult to achieve consistent, scalable classification. Here are just a few of the downstream tasks that become much more difficult if not impossible without consistent classification:

  • Classification-based retrieval
  • Setting records retention periods
  • Determining user-level access rights
  • Setting department or business unit level access rights
  • PII detection
  • Setting system security specifications for content

 

These difficulties also lead to end-user workarounds that defeat many of the reasons for having ECM systems, e.g., users maintaining private but duplicative stashes of content. Note that file type is not a sufficiently useful file attribute for most information governance purposes. The file is just a container and information governance focuses on content. For example, a PDF could be virtually anything from a spreadsheet to a presentation, Word document, or website. Classification must go beyond file type to accurately label the type of content. Following are some considerations when setting up automated file classification systems:

1. Anticipate Constant Change

Changing business needs and changing regulatory requirements results in a constant change in the documents used to perform business functions. Individual forms or templates change 10-15% per year on average, and many new document types are added each year. This constant change imposes the requirement that classification systems be flexible and responsive, ideally alerting administrators when new classifications are required.

2. Text Dependency has Limitations

Most automated classification systems rely on the presence of accurate textual representations of the files being classified. Classification systems must anticipate multiple problems with a text-based approach:

  • Language. Systems that seem to work fine with English documents may fail when presented with other languages that were not part of the original training sets or scripted rules. Multiple languages will also cause obvious problems with approaches based on taxonomies. Machine translation of content may not yield the desired classification accuracy.
  • Non-textual files. Many files have no text associated with them, e.g., files output as PDF or TIFF files from user  software or captured as image-only documents by scanning or faxing. This may be a minor issue in some collections but in others non-textual files may account for appreciable percentages of all files. At the very least the percentage of non-text files ought to be measured to help determine what sort of remedial effort may be justified.
  • Poor-quality text files. Text layers can be created by optical character recognition (“OCR”) software, but the resultant associated text can be riddled with errors, making text-based classification very problematic. One area of particular concern is being able to classify all versions of the same document type consistently, e.g., to classify the original Word document with the PDF version and the scanned TIFF version.
  • Sentence dependency. Some auto-classification systems analyse text as presented in sentences and ignore non-sentence text. This causes them to fail to accurately classify documents like check lists, spreadsheets, PowerPoint presentations, and many forms-based documents.
  • Numeric  Text.  Text  analytics  and  text search systems may ignore numeric text strings and evaluate pages and documents without considering their numeric text.

 

3. Address Document Unitisation issues

One of the most basic, and often incorrect, assumptions about file classification is that there is just one document per file. Virtually all content management systems or file search systems permit only one set of fields to manage a file, e.g., there is one document type, one author, one date created field. Document unitisation problems can arise when users assemble multiple documents or files, possibly even those created in different applications, into one PDF. Unitisation issues are also common in files created by scanning or faxing paper documents. The one-document-per-file limitation causes embedded documents to be “lost” in the sense they are not represented in the fielded data that describes individual documents, e.g., Date, Author, Title, etc. Having multiple document files can also cause errors in text analytics systems that depend on comparing the text in individual documents. At the very least, sampling of the various sources and storage locations should quantify the extent of this issue. The unitisation issue impacts all later downstream functions, e.g., assigning retention periods, accurate retrieval, and extracting document attributes.

4. Anticipate Scale & Granularity

Some text analytics systems -  used for instance to place a few hundred thousand files into relatively large“buckets” such as Responsive or Non-Responsive for e-discovery - may not provide the granularity and scalability to have hundreds of classifications for millions of files. Anticipate your needs for scale and granularity when comparing competing solutions.

5. Anticipate “Do-Overs”

Business environments and missions evolve over time and those changes can affect how files or documents are classified. A well-designed classification system will permit reclassification without completely reprocessing large numbers of files

6. Anticipate Multiple Classifications for Multiple Reasons

There are many reasons to classify unstructured content:

  • File share remediation to identify and remove unneeded files
  • Responsiveness/Relevance in discovery
  • Setting retention schedules
  • Setting access rights based on the work section or job classification of individual workers
  • Setting storage security requirements
  • PII detection/protection
  • To indicate need for specific image enhancements

 

Classification systems ought to permit multiple looks at the same content, otherwise the organisation may have to pay for and support multiple classification systems for multiple purposes.

7. Use Stakeholder Collaboration

Multiple heads are better than one for capturing enterprise knowledge about the business reasons behind why different documents types are created and used and what the regulatory requirements are that govern their use and disposition. Classification work flows that involve collaborative classification schemes will avoid many downstream problems later on. For example, if subject matter experts from the business unit, finance, and records management can all consider the same documents at the same time they can help ensure that the most useful classifications for the whole enterprise are applied. This approach also pays huge dividends in the next step of Attribution.

8. Anticipate the Awareness Gap

While using collaborative teams will help ensure that the collective wisdom of the enterprise is brought to bear on the classification task, there will still be document types of which team members are simply unaware. They may do a good job classifying files or documents presented to them, but how do you know that substantially all files being classified have been considered?

9. Use Classification Matrix Approach

A typical file classification scheme has the business unit or function as the top level with each unit or division handling for lower level classifications. This can lead to overly complex and possibly inconsistent labels for files that are for all purposes the same types. For example, Finance might call one document type “Authorization for Expenditure,” Exploration & Production could call the same thing “AFE,” while Engineering might call it “Expenditure Authorization.”  These different labels for the same type of document cause clutter and confusion later downstream. An alternative approach is to build a matrix with the business units or functions listed in the left column and document types listed as column headers with Xs or check marks within cells to indicate if that document type is used in the unit or function. The advantage of this approach is that it minimizes the number of classifications and helps assure the use of consistent classification labels for the same documents.

10. Test Consistency

The top three criteria for classification are consistency, consistency, and consistency. Test and measure consistency. Do you get the same results when you reprocess the same content? When you do text searches do you find the same type of content has been assigned two or more classifications? Consider having an alternative classification technology re-classify already classified documents and determine which system performs more accurately.

11. Start-Up and  Maintenance Costs

Different classification technologies involve different costs of consultants, in-house staff, computing resources, and per file or per gigabyte licensing fees. When evaluating competing alternatives, ask for explicit maximum costs and agreed accuracy levels to be included in a service level agreement (“SLA”).

C. Attribute — Extract What’s Important

Document attributes are things that are evident on the face of the document. Attribution involves identifying which of those data elements are significant from a business, legal, or regulatory purpose in each of the classifications, and extracting them so they can be used for retrieval, reporting, or as part of a decision management tool. They can be used to populate or to validate structured database entries. Here are considerations when attributing files:

1. Identify Specific Desired Attributes for Each Classification

For each classification, the organisation should identify the types of data elements or attributes it wants to extract from members of that classification. This list becomes a checklist to consider when designing or performing attribution.

2. Examine Data Fields in Control or Retrieval Systems

If information from unstructured content is used to populate or to audit a process control system (e.g. the name of a loan applicant is used in a mortgage loan management system), those data elements should be considered for inclusion on the attribute list as automated extraction can speed input and auditing of such data.

3. Be Aware of Classification-Dependent Limitations

The ability to extract desired attributes often depends on the consistency with which initial classification is performed. If classes are consistent and files within classes are very much alike it will be much easier to extract data values. On the other hand, inconsistent or overlapping classifications will lead to poor quality attribution. Keep those limitations in mind when considering what to try to extract.

4. Be Aware of Text Conversion Dependence

Systems that use text conversion technology to provide textual values from image-only files will be subject to the limits of that technology. If the underlying OCR system won’t recognise characters with font size greater than 24, then any text of larger size won’t be available for auto-extraction, and the same will be true for minimum font size limitations. As discussed earlier, text dependence is also a significant issue for non-textual files, poor quality text files, and foreign language files.

5. Be Aware of Extraction Tool Dependence

The ability to extract desired attributes is dependent on the tools available for the extraction, e.g. can the extraction specifications include absolute page coordinate zones or positional specifications relative to other zones? There is no point in specifying attributes that the available tool set cannot identify.

6. Consider Non-Textual Attribute Extraction

Although people usually associate attribution with text extraction, be aware that non-textual elements, such as signatures or logos, can also be extracted and stored in a properly designed system.

7. Normalise Extracted Values

When designing attribution, provide a mechanism to normalise content so terms are stored in a consistent format. For example, in the Oil & Gas industry, the Well Number is a key data element, but it can occur in many slight variations, some with spaces, some with dashes, some with spelling variations. Normalising such data elements maximises the value of the attribution for reporting and retrieval purposes.

8. Use and then Update External Authority Lists

Often management systems maintain authority lists that ideally represent all known values that can appear in a particular field, e.g., a list of the American Petroleum Institute Well Numbers from an oil & gas well tracking system that tracks active oil and gas wells. The authority list can help find in which document types those terms occur, and this can guide attribution guidelines. If references to the terms are essentially random within a type, they can be ignored. If they are consistent or largely consistent, they can be considered for inclusion. Comprehensive attribute extraction typically results in identifying a significant additional number of values for the authority list, and those will need to be reconciled.

9. Involve All Key Stakeholders

Have all significant stakeholders involved in deciding what to extract, how to label each item, and how to format it. Ideally this collaboration would take place in the same room looking at examples of the file types at the same time.

10. Build Comprehensive Logs for Full Audit & Presentation

Logging of all extracted values permits the full auditing and authentication of those values. The system should know the document, page, and page coordinates for each extracted value. This not only greatly expedites auditing any control system to see if the supporting documentation confirms the source of the entered data, but it can also be used for on-screen presentations to minimise the “stare and compare” time otherwise spent comparing data from different documents.

11. Consider Full Text as an Attribute

One of the attributes of a file may be the text associated with the entire document without trying to identify specific attributes or field values in the full text. This is sometimes called “content enablement,” and it should be considered as a supplement to extracting field values. While full text can be useful, it will not provide for the precision available by searching specific fields in a database and will not provide the level of report formatting and sorting available with attributed data values. When extracting full text, always measure the quality at the character and word level to give a sense of how reliable it will be when used for retrieval or other analysis.

D. Validate — How Dependable Is it?

Validation involves measuring how accurate and complete the resultant data is. It incorporates questions like, did we account for all the files we encountered doing the inventory? Are the files classified accurately? Were the correct attributes extracted from the files and were they properly formatted and loaded? Here are tips or hints for this phase:

1. Automate the Validation

Whatever the process involved in validation, it should be automated so validation is consistent and results can be compared over time. An automated process should also save the time of the people doing the validation, making it far more likely that they will validate data regularly. One sign of a non-automated process is if someone must use Excel to generate a random list of files to examine or has to use Excel to tally the results of having evaluated files.

2. Validate Deadline Compliance

Besides focussing on the results of the classification and attribution, it is also important to validate that the results were delivered on schedule. Mechanisms should be in place to document when files were available for classification and when the results were received.

3. Compare Before & After Totals

All files initially examined should be accounted for. The accounting should include categories like:

  • Files Examined
  • Files Uncompressed
  • NIST list
  • Content Duplicates
  • Unique Content Files
  • Encrypted/Decrypted
  • Unencrypted
  • Remaining Unencrypted
  • The unique content files should also be accounted for:
  • Flagged for Disposition
  • Number Retained

 

The number of files retained should be compared to the number loaded into whatever the target search or content management system.

4. Measure Classification Accuracy

Consistent, accurate classification is the lynchpin for practically any document-related information governance initiative so it is critical to assess the quality of the classification. This can be done in several ways:

  • Reprocess a Benchmark Sets of Files. If classification involves writing new rules or scripts that assign classifications, it will be useful to keep a standard set of documents that are reprocessed for each new rule/script change to see if the changes have adversely affected existing classifications. Regardless of how classifications are assigned it will always be wise to periodically reprocess benchmark sets of documents or files to ensure that the system has not gone off track.
  • Peruse Files within Same Classifications. Another way to perform ad hoc review of classification consistency is to examine sets of documents all assigned the same classification. This need not be a lengthy process as systemic errors will often be obvious.
  • Review of Randomly Selected Documents. End users can randomly select documents, sort them by assigned document type, and evaluate whether the assigned document types are accurate. Depending on the sample size, this may not adequately assess classifications assigned to relatively few documents as they may not be included in the sample. To combat this, the process used to select the random sample should ensure that at least some minimum number of files are selected from each classification
  • Review Results of Full-Text Searches for Specific
    Document Types.
    Finding files correctly classified when you search by specifying the class can lead to a false sense of confidence about the quality of classification for all the content. The search results don’t   show the files that should have been assigned that classification - it doesn’t show the false negatives. One way to gauge the extent of this is to try to retrieve documents within the desired classification without using the document type as a search parameter. Use full text search to try to pull up files in the desired class. You may find files that look like they should have had the desired classification but didn’t. Maybe some classifications could be consolidated. One problem with this approach is that it is text-dependent and will not work with documents with text dependency problems (e.g.  no text, poor text, non-English text).
  • Search for Sets of Related Document Types.
    In collections where one would expect to find a several document types relating to the same transaction, event, or item, users can search terms that describe that transaction, event or item and see if all the expected document types are represented. For example, in a loan system the user could search for a specific borrower and see if all the expected document types associated with a loan are present. (subject to the usual limitations of text-based search).
  • Evaluate Files with Residual Classification.
    It can be informative to review files or documents that were assigned the “Other” or “Uknown” classification. You may find there are several files not being classified by a text-based algorithm because of text-related issues (no text, poor text, language, lack of sentences). You may also discover new document types that should be included in the overall classification scheme.

 

5. Measure Accuracy of Attribute Extraction

It is vital to measure the accuracy of attribute extraction so the organisation can be sure it’s receiving value for the effort and funds expended on attribution and so it does not place unwarranted reliance on the results of attribution. Here are ways to confirm the accuracy of attributed data:

  • Compare extracted data value to source document. The best way to confirm accurate extraction is to go to source documents to confirm the data value. However, in some systems this will be a time-consuming process.
  • Compare extracted values to trusted authority list. If an extracted value matches one of the values on a trusted authority list, the organisation may be able to assume it was extracted accurately.
  • Confirm missing values. If the file or document classification was accurate and the attribution list identifies which values are associated with a particular document type, quality assurance efforts should involve examining fields expected to have values but don’t.
  • Focus on the key data elements. Some data elements will be more critical than others, e.g. the loan number will be a critical value in a home mortgage tracking system. Focus on the critical items.

 

6. Use Ad Hoc Analytics

Reviewing metrics associated with rationalisation, classification, and attribution can disclose many unexpected results. For example, document date is a commonly extracted document attribute (this is the date that appears on the face of the document, not the system date), and reviewing graphics showing volume on the Y axis and time on the X-axis can make spikes or gaps readily apparent.  Analytics can also make outliers more apparent, e.g., document dates well before or after expected date ranges. People familiar with the content should be able to explain these aberrations, e.g., a spike was caused by acquiring another company, or a gap was caused by a company-wide strike.

7. Double-Check Duplicate Tracking

Some software only tracks one location for duplicates. However, there can be many reasons to track back to all the original sources of duplicate files, and consider doing periodic audits to ensure that all copies are tracked.

8. Know What Didn’t Get Captured/Extracted

Validation should include looking at what wasn’t captured. For example, if the data value for Borrower’s Name should have been extracted from Loan Applications, the validation should include looking at Loan Applications with no Borrower’s Name listed to determine why attribution failed to provide that data value.

9. Reconcile Authority Lists

Most systems can generate authority lists which are usually alphabetically-sorted lists of the unique values that appear in specific fields.  Reviewing such lists is often an easy way to spot data values that either shouldn’t be there or should be there in a different format. Someone familiar with the content may also notice that terms that should be included are not on the list. At the beginning of the process, domain experts should examine the new values being extracted from the unstructured content and compare those with the previous authority lists to confirm that the correct values are being extracted.

10. Compare Results

Other than not processing files that ought to be included, the biggest error that can occur is to have the wrong classifications assigned to files. This causes numerous downstream problems with classification-dependent attribution and loss of user confidence in being able to depend on classifications for retrieval. One way to validate the accuracy and consistency of overall classifications is to apply the same rules using a different classification technology, or to apply different technology and then compare any differences in classification groupings. It may disclose that both systems have strengths or it may disclose that one is superior across the board.

11. Incorporate User Feedback

The real acid test is whether users find the classifications and attributions to be useful and reliable. If not, all the effort may be for naught. When the outputs are first loaded, users should be encouraged to share what they like or don’t like about the results.  Are they looking for things they can’t find? Are they experiencing clutter in their search results?  These are just some questions whose answers should be gathered and analysed so classification labels and attribute names and formatting can be adjusted going forward.

John Martin is the CEO and founder of BeyondRecognition, LLC, a Houston-based information governance technology company. John is a consulting and testifying expert on e-discovery and a US patent has been issued for his work in document technology. This article summarizes the first chapter of his new e-book, Guide to Managing Unstructured Content, available as a free download at http://beyondrecognition.net/download-john-martins-guide-to-managing-uns...