Machine Learning: giving us all a bad name?

Deck

By Rachael Greaves

Machine Learning is a really common AI technology. People tend to assume that ML means machines teaching themselves – but really, ML means machines learning from people.

Once the machine has learned, or been taught, it can start to make its own predictions. But the process of learning can be very onerous, depending on the approach.

There are two main approaches to ML. One is supervised ML. In this approach, a large set of training data is used. The data is curated and labelled, then shown to the machine. The machine learns to recognise data that should also match the examples it has been given.

This is a robust type of ML, but has a significant disadvantage: it requires a lot of training data, and a lot of effort to curate that data. Supervised ML approaches to records and information management have been proof-of-concepted by various vendors, and the feedback has been:

The AI needed a lot of training by our records team
We couldn’t come up with 1,000 good examples of a document for every rule
We had to spend time correcting or confirming the machine on every single match
Training each rule was so onerous that we had to limit the rules we applied
When rules change, we will have to train all over again
We couldn’t feasibly apply more than one ‘rule’ to a document
It was too much work to set up, and it created more work than it alleviated

But it doesn’t have to be this way! Supervised ML is really not scalable for a problem as complex as records management (which needs to apply retention, security, privacy and handling rules from multiple different instruments, and update them dynamically over the life of the record), or over data sets as large as corporate file shares, for example. There’s too much data, and each item is rarely just about one thing, so you really can’t simplify the rules to the point where supervised ML is comfortable.

Remember that even AFDA v2, ostensibly a ‘rolled’ up’ Records Authority with only 86 classes, actually has 256 separate rule types within those classes. So that’s at least 256,000 documents you would need to find, cleanse, and curate for a supervised ML approach, then ‘approve or deny’ the attempted matches.

And that is assuming one document only matches one class. But a document is never just ‘yellow, red or blue’. It’s a bit yellow, a bit blue, and mostly red. A contract is not just a ‘financial’ record. It can also be a record of core business, or relate to compensation, or even be subject to a ‘freeze’ like PFAS or Natural Disasters such as have arisen in recent Royal Commissions.

Multiple rules will always need to be applied, and those rules come from multiple types of instruments. That’s why Microsoft’s labelling approach also can’t work for records management.

Unsupervised ML, on the other hand, doesn’t need the records team to create and curate sample sets, and train the machine. The machine looks at the data itself, and finds its own patterns, clusters and dimensions. It doesn’t need humans to create training labels, and it doesn’t need humans to ‘mark’ every match it makes in order to learn and improve.

It is a much faster, simpler and easier ML model from the client’s perspective – whereas supervised ML puts the work back on the organisation to teach the machine, unsupervised ML keeps the burden on the vendor (where it really belongs) to develop sophisticated technology.

So don’t throw the baby out with the bathwater! AI done wrong can have really negative consequences, that outweigh any potential benefits. You can (and should) have great quality, sophisticated ML as part of your AI and automation strategy. But it doesn’t have to hurt.

@christinatrex
this joke is more niche so if you actually get it, drop a comment love you. ##computerscience ##artificialintelligence ##machinelearning ##csmajors ##code

♬ Eine kleine Nachtmusik - Mozart

Rachael Greaves is the cofounder and Chief Executive Officer of Castlepoint Systems.

Business Solution

Document & Records Management

Enterprise Content Management

Information Analytics