Automated Electronic Records Management?  Are We There Yet?

By Tim Shinkle

Cloud providers have recently introduced some new and powerful Cloud Services for Big Data and Artificial Intelligence(AI). These Cloud services have the promise to finally reap the power of AI for automating electronic records management(ERM). But is the market finally ready? And will these new services finally convince the skeptics that AI can be used to effectively automate ERM?

Years ago, at the turn of the millennium I was the Chief Technology Officer at a leading records management software provider called TruArc. TruArc had recently introduced and patented AutoRecords, the first ever commercially available use of AI for ERM, and we were hoping we’d get a huge jump on our competition.  Although we had some success with AutoRecords, there were also some challenges. The main challenges being, it didn’t always work well enough to convince the skeptics, and the market wasn’t ready.

One of the things we discovered with AutoRecords was, it is sometimes too risky to be a leader in emerging markets when the market might not be ready for new technology.

This is especially true when the industry isn’t mature enough, as we found when leveraging AI for ERM.

Although AI had been around for some time (back in 1996 IBM’s Deep Blue became the first machine to win a chess game against the reigning world champion, Garry Kasparov), AI was still not in mainstream use.

This was especially true in the Records Management industry, where paper records were still being perceived as one of the primary challenges faced by records managers.

AI for ERM just wasn’t ready. Yes, AutoRecords could at times classify (or categorise) records with a high degree of accuracy, but other times it couldn’t. To complicate matters further, retention schedules sometimes contained hundreds or even thousands of record categories, where most record series were developed for paper records.

What also wasn’t helpful were organisations using record categories such as “other”. Records being filed in the “other” category often were based on context at the time only available outside the computer.

Implementing AutoRecords revealed that most organisations weren’t ready for AI based solutions. This was evident when we ran into problems during a study on AutoRecords performed by the National Archives (NARA). The study involved using legacy retention schedules originally developed for paper and the use of poorly suited training sets that ended up being ineffective for AutoRecords. Although some people might argue this is the flaw with AI – Isn’t AI supposed to adapt to your environment auto-magically? As it turns out, like a child, AI can’t just start running before it crawls or walks, it needs to be taught and prepared to run properly over time and with some investment.

But it wasn’t just organisations’ readiness that was the problem with AutoRecords. There were plenty of challenges with AutoRecords that we had yet to figure out. A big challenge was betting our solution on a single point of failure, being a classification (or category).

The AI was supposed to identify a classification or category to file a record under a particular record series. In the unstructured world of document management, knowing just one dimension of a document, such as classification (category or subject), might help for searching, but it isn’t good enough to automate ERM. Unstructured documents tend to contain multiple categories or subjects for multiple reasons.

An example of the single classification problem can be explained with something as simple as a resumé. A resumé has a fairly distinct pattern and AutoRecords was pretty good at learning what a resumé looked like. But saying something is a résumé often isn’t enough.

What if the resumé is a draft where only the final resumé is the record? How do we know which resumé is the final version? Further, what is the context surrounding the resumé? Was the resumé captured as part of a hiring process for employment?

Should the resumé be filed as part of a case file containing many different types of employment documents under a human resources classification? Just knowing something is a resumé isn’t always helpful.

Then there is the problem of false positives and false negatives. A document could simply be discussing a resumé and not be a resumé, or, a resumé describing job experience might result in a classification that isn’t a resumé, when in fact it is. As it turns out, people rarely depend upon a single piece of information to make a decision about a document being a record. Why should AI be any different? We needed more dimensions and guessing at the single best category was only giving us one piece of the puzzle.

Other challenges included:

• Training – Taxonomies and training sets need to be as accurate as possible for effective machine learning. Developing a taxonomy often takes a high level of expertise and it is difficult to find a good training set. Maintaining a training set, as things change over time, became too difficult, time consuming and expensive to perform with most organisations existing in-house resources.

• Algorithms – Leveraging only one algorithm or approach for the best or even top three classifications was probably doomed from the start. One algorithm producing the most likely classifications to choose from as “the” classification didn’t provide robust enough results for processing thousands or millions of records automatically without human oversight and intervention.

• Scalability – When we introduced AutoRecords we were dealing with thousands of records at a time, we are now in the age of potentially billions of records at a time for some of our larger customers and data is only growing. Just recently, a large US federal agency tried to leverage in-house (non-cloud based) AI services to process their records, only to realise too late that it will take years to process the records they have and it will never catch up with the ingestion rate of new records being added.

• Change – Algorithms, technology, retention schedules and records management change all the time. AutoRecords needed to be updated, retrained, and retooled constantly to plug into many different repositories and technologies. The technology required extensive integration upgrades and maintenance often dealing with insufficient APIs.

With AutoRecords, we faced and fell short of effectively addressing the challenges outlined above.

In the end our company was bought for our ERM functionality, not AutoRecords. The purchasing company quickly abandoned AutoRecords, where it disappeared into the world of cool products that never succeeded (remember the dot-com bubble?).

So, you might ask, has anything changed in the world of AI that can address these challenges? We at Millican now believe the answer is a resounding YES, but perhaps not how we had originally intended to use AI.

Many technology companies, including some of the world’s largest, have invested billions of dollars into Big Data and AI Cloud services, where Big Data and AI are complimentary services solving the most challenging problems AutoRecords had faced.

Companies such as IBM have gone on to do amazing things with AI, winning the game show Jeopardy for instance with their Watson technology against the best human competitors. Google’s AI recently beat the best Go player in the world and many others are accomplishing spectacular achievements never before thought possible.

The lessons learned from these achievements are now being leveraged in Cloud APIs available to the public for solving many real world problems. This is a perfect time to revisit AI for ERM and leverage these recent investments in cloud-based Big Data and AI services.

The latest artificial intelligence (aka cognitive computing) and Big Data cloud offerings provide a powerful assortment of services. We currently have the ability to crawl data sources found on-premises, on mobile devices and in the cloud. Via cognitive computing and Big Data cloud services we also understand this data in ways never before possible.

This unprecedented level of understanding gives us the ability to make better decisions on how best to manage our electronic records.

To understand where we are with cognitive computing, as compared to where we were with AutoRecords, look no further than the big technology vendors such as IBM, Microsoft, Google, DeepMind, Amazon and others. These technology companies have collectively invested billions of dollars in developing and exploiting cognitive computing technology (there are some very large open source initiatives as well, with the likes of TensorFlow, H2O and others).

Cognitive computing is helping solve real world problems that humans have been unable to solve on their own. Recently IBM Watson solved a patient care problem that had stumped doctors for months. Google has even changed their approach to search. The Wired article “AI Is Transforming Google Search. The Rest of the Web Is Next” discusses how computers now are performing certain functions of search that, until recently, required human insight. AutoRecords, with its use of AI technology, has gone from something that was ahead of its time to something that is now mainstream and growing in leaps and bounds. There’s no denying that the AI market has matured.

Training

A big challenge we faced with AutoRecords was training sets (training sets are used to train the cognitive services to learn patterns in data). Having access to large cleansed training sets is a challenge — especially as changes occur over time. Many of the available cognitive cloud services come pre-trained and ready to use. You can even try some of them online before you decide to invest. (See http://alchemy-language-demo.mybluemix.net).

Technology companies are training their cognitive APIs for both horizontal and vertical industries using training sets such as Wikipedia and large data sets from healthcare, banking and other industries.

IBM recently bought a company that services banks in part because it can leverage the knowledge the company has on the banking industry — the knowledge can then be used to train IBM’s Watson to better serve its customers. This approach is far superior to what we faced with AutoRecords, where training was a significant challenge for customers to do on their own.

Multi-Dimensions

A second challenge we had with AutoRecords was that classification was a “single point of failure”. We depended entirely on the classification value provided by AutoRecords to understand the record and perform some action based on the result.

Having a single classification (or even having false positives and false negatives on a single classification) is not as challenging to address when decisions are spread across multiple dimensions, with each dimension providing valuable input for better overall decision making.

As discussed earlier, simply knowing a document is a resumé is often not enough. Cognitive services can now provide a much richer understanding of documents, providing dimensions for concepts, keywords, entities, relationships, sentiment, author and more.

As an example, for a hiring manager in Human Resources, knowing the relationship of the author, their role in the organisation and the person being discussed in the resumé can make all the difference in how the document is managed. The machine can now make these connections without having to depend upon the single classification of “resumé”.

Patterns of who someone is, where they work, what they do, and the data they use, can all be leveraged when crawling and combining valuable information into large sets for cognitive processing and analysis.

Other Challenges Addressed

Today’s technology can meet all the many challenges in algorithms, scalability and managing change over time:

• Algorithms – The algorithms used today are much more sophisticated, with the ability to self-train and leverage massive scale data sets not previously accessible to on-premises solutions. There also are many more algorithms for specialised capabilities such as language, speech, visual recognition, data insights and more. These algorithms can be leveraged together or separately depending upon the need.

• Scalability – Today’s cloud services are more scalable than ever before. New research and development in cloud computing and cloud platforms, with technology such as Docker and containerization, are expected to keep pace with the volumes of data being produced on a massive scale. Microsoft, Google, IBM and others are even developing specialised reprogrammable computer chips to increase Cloud performance.

• Change – Change is now much more manageable for customers of Cloud services, as algorithms are improved they can be swapped out and can pick up where previous algorithms left off, using the same training data sets. Cloud vendors can change these algorithms for you without a disruption to the service as improvements occur in the AI industry over time – the Stanford report “Artificial Intelligence and Life in 2030” ( http://ai100.stanford.edu/2016-report ) covers AI over the last 100 years to today, discussing many of the changes that have occurred in the industry over time.

Putting All the Pieces Together

The combination of cognitive cloud services and Big Data analytics provides a powerful approach to understanding the value, cost and risk for optimised electronic records management. Big Data analytics provides a rich feature set for visualising data including the ability pull together multiple sources of related data, such as storage, litigation and compliance costs.

Big Data analytics is the mechanism for exploiting all the dimensions that are provided from the cognitive services for large volumes of electronic records. Big Data services are easily shared among different groups within the organisation, leveraging analytics for other use cases —Information Governance/Enterprise Records Management (ERM) is just one group of many that can leverage their organisation’s investment in Big Data. Another piece of the puzzle is crawl technology. This technology can work behind a firewall to harvest multiple data sources found on-premises and pool the data in a central location for analysis (e.g., PostgreSQL, Apache Cassandra and others).  Crawl technology can also be used to take the results of the analytics and execute compliance rules on the originating data sources, including decisions to manage data in-place or transfer it to a central archive on-premises or in the cloud.

Where the three key pieces of technology — crawler/harvester, cognitive cloud services and Big Data analytics — provide a solution to the challenges of electronic records management, the final piece is cloud based storage — a place to put all this data (e.g., Hadoop Distributed File System on-premises or in the cloud). As systems expire, a cloud based ‘intelligent archive’ - managed by the main components of the solution - provides a cost-effective location for records to be reused and managed over long periods of time.

Future State

It is apparent that the days of the traditional ERM solution are coming to a close. The recent Wired article “The End of Code“ by Jason Tanz suggests that soon we won’t program computers; instead we’ll train them like dogs. In the not too distant future we will spend our time training computers and asking the right questions to process our records instead of buying and implementing expensive content and records management solutions and scripting rules manually. This is prompting some people to proclaim it is the end of tech companies. We already are seeing a shift in the ERM industry away from big ERM software purchases and a shift in budgets from traditional IT technology investments to investments in Big Data and the cloud.

In conclusion, we believe that the world of AI has fully matured since we first created AutoRecords more than a decade and a half ago and it is now powerful enough to effectively meet the enormously challenging requirements of modern enterprise records management. AI technology is ripe for taking over the task of chasing down massive amounts of data and determining how best to manage it over time. We now are able to use cognitive services to orchestrate tremendously powerful solutions for our customers. These solutions include leveraging crawl technologies, cognitive and Big Data cloud services in meaningful and effective ways to serve the entire ERM industry.

Tim Shinkle is VP at Millican & Associates, a Florida, USA based information management services firm.