AI Beyond the Hype
Lately we notice a strong proliferation of diverse classification and extraction technology providers that base their service offering solely on Artificial Intelligence (AI). Based on various AI algorithms their claim is to extract all meaningful data from documents without any human intervention.
From my perspective, two drivers seem to be behind this (currently) booming trend. I think the stronger one is the appearance of Cloud services. The technology is simply made available as a service in the cloud and is then globally accessible. Barriers to market entry are rather low.
The second driver is the fact that almost every university is into AI nowadays. This leads to a huge number of engineers that are familiar with writing code to tap the potential of AI algorithms. Since every engineer understands and knows documents, it’s just a short jump to use AI algorithms - often available as ‘freeware’ - to automatically extract data from documents.
With the proliferation of extraction AI, have we now arrived in data extraction paradise? Is the only thing left the agony of choice? Let’s have a closer look.
Initial analysis of the results from AI based algorithms are typically rather impressive. A document is fed into an AI system and it comes back with most or even all data required.
When things are booming, everyone likes to talk about the potential advantages, but I would like to point out some of the challenges to using pattern-based AI technology in our industry. These are some of the same reasons why the promise of self-driving cars years ago has still not fully materialized – there are trade-offs to fully leaving AI to simulate human decision-making processes.
Security
For usable results in practice, AI models must be trained using large amounts of real documents and data. This is called the pattern or training set. So, you’ll have to hand over copies of real customer documents into a 3rd party cloud environment for the system to be trained on. Since the AI system needs to be re-trained frequently (e.g. nightly) that data has to remain with the training set forever.
Furthermore, to keep the model up to date, you have to constantly feed the training set with new examples to keep it as accurate as possible. As a result, there is a constant potential risk that data from the training set might get exposed to prying eyes.
Intellectual Property
By providing documents and data to the AI technology provider you are enabling them to enlarge and potentially improve the model. Are you getting compensated for the contribution of ‘your’ data/IP? Are you completely sure that an improved model will never enable a competitor of yours to have a better AI solution and therefore a competitive advantage? Who even owns the IP contained within the model?
Approximate Accuracy
AI not precise? Isn’t this a big disappointment? By the nature of the beast an AI system is never absolutely sure whether it is right or wrong, but will always try its best to return a result. All results are only approximately accurate. Various techniques get employed (e.g. the calculation of confidence levels) to better estimate the accuracy of results, but there will always be a gap between 100% right and the systems results. Some pundits claim that today’s AI system rarely get over 70% accuracy. Hmm...
One more important thing: the accuracy of the AI system varies with new training-runs on a changed pattern set. So, something that was interpreted correctly yesterday might run on error today. The inconsistencies of AI systems are a true challenge!
The High Priests of AI Wizardry
Well, you and I do not understand the workings of the AI system. In come some specialists that start to manipulate the pattern data and/or the algos. To explain what they are doing they might use expressions like ‘under or overfitted model’, ‘non-parametric learning’, ‘too much noise in the training set’, etc.
But these High Priests of AI Wizardry will not be able to explain the behavior of the AI system to you - nor may they fully understand it themselves! Experience tells me to be wary if a person of average intelligence is not able to understand how a system’s results are being produced. You should also be able to (even if you are smarter than average).
Unexpected Human Intervention (Security II)
The AI provider has a problem; 100% accuracy was promised but the technology is lacking. How is this gap closed and the accuracy level increased to as close as possible to 100%? Cheat, and hire some eyes and hands. That’s easily done with today’s service offerings on the internet. So, your documents might get exposed to some unknown students or others that need to make a quick buck.
What if these people try to supplement their meager pay by taking some action on the information presented in front of their eyes? Just another security problem here, that nobody at the AI technology provider ever wants to talk about.
The good news here is the fact that there are systems out there that can meet and beat the fashionable AI technologies of today. At TCG Process we have honed our classification and extraction technologies over the last two decades. It will maybe take a few more man hours initially to set them up, but they will run far more consistently and transparently. No wizardry here.
For your peace of mind: The classification and extraction technologies within TCG Process’ flagship product DocProStar will bring the same or better results without the problems described above. AI based technologies are incorporated but are never fully relied on when delivering correct results for critical processes.
https://www.tcgprocess.com/en-en/australia/
Arnold von Büren is a Swiss entrepreneur with three decades of experience in capture and input management. He was a founding member of DICOM Group plc. and played an instrumental role in the acquisition of Kofax, Inc. USA, becoming Kofax CEO in 2000. From 2003 – 2006 Arnold was CEO of DICOM Group plc. Since 2007 he has been CEO of TCG Process, providing leading process automation software to businesses of all sizes and growing the company into a global organization with more than a dozen subsidiaries across Europe, the Americas and Asia Pacific.