Can technology classify records better than a human?
In their 1999 book Sorting Things Out: Classification and Its Consequences, Geoffrey Bowker and Susan Leigh Star wrote that ‘to classify is human’. Classification schemes, taxonomies, thesauri and the like help us to categorise, find, manage, and better understand all sorts of things in their specific context.
Humans almost instinctively sort things into categories – we classify them. We overlay or link them with some kind of form and structure, and in so doing endow them with a meaning and context that helps us to make sense of them, to make their management more efficient, to ‘add value’.
We seek to classify records for similar reasons: to give them business context, to establish evidence of business activities, and to facilitate their ‘description, control, links and determination of disposition and access status’. (AS ISO 15489.2-2002 4.3.4.1.)
Bowker and Star suggest that classification is the ‘sleeping beauty of information science’ and, certainly, classification is an essential record-keeping tool.
But, the methods we are using to classify digital records, to assign business context to them, seem to be failing. As a profession, we are more often than not using, or attempting to use, paper-based methods, with less than successful outcomes.
Because these methods are failing, organisations are turning to alternative methods to find and group records into business or other contexts, including by deploying advanced search tools.
Search tools, of course, do not classify records. They provide little more than a temporary grouping of possibly related records, not necessarily all records relating to a specific business context.
However, search tools that can learn how to find and then keep records in context are increasingly being used to support legal e-Discovery practices. SharePoint 2013, for example, includes an e-Discovery site based on these ideas. Perhaps, with the assistance of records managers, these tools can be used to improve or replace our current methods, to herd the digital cats into their respective categories.
The problem of classifying digital records
Record-keeping classification schemes describe categories of business activity and the records those activities generate.
In most contemporary record-keeping systems, for both paper and digital records, classification terms are applied to the aggregation (i.e., a file or container); individual records contained in the aggregation inherit the classification terms, rather than being applied on each record.
Where the classification scheme is mapped to retention requirements, inherited classification facilitates the retention and disposal of aggregations of records.
Inherited classification assumes two things. First, the aggregation has the most appropriate classification for the records it contains and second, all relevant records will be or have been placed by people in the correct (or any) classified aggregation.
In practice, probably no more than 5% of all digital records created or received by organisations ends up in classified aggregations in record-keeping systems, let alone an aggregation that includes a classification. The rest are stored, unclassified, on network drives, in email folders and, increasingly, in the cloud.
Records managers might respond that this situation is tenable - not everything is a record and so not everything needs to be kept as a record. On the other hand, most records managers know that a lot of digital records are not captured in a record-keeping system or classified.
There are a number of reasons why the use of record-keeping systems and, by extension, the classification of digital records, has not been successful, and why as a profession we need to find better ways to classify them, including by looking at what technology can do for us.
For a start, the way that people have created or captured digital records over the past 20 years has been mostly divorced from any record-keeping system; people manage digital records in network drives and email folders, applying their own rudimentary and mostly uncontrolled forms of classification to the names of folders.
These ‘systems’ abound with the oft- forgotten remnants (and multiple duplicates) of digital objects including records. In many organisations, backup processes that should be used only for disaster recovery have become the de facto archives of digital records.
Most people who work in organisations understand the need to keep records. Saving a record to a record-keeping system is often an additional step that, in most cases, still leaves the original record stored in its drive or folder. As well, copying a record from a ‘user-classified’ network or email folder to a record-keeping system with a different classification structure doesn’t make sense.
Records managers seem to be stuck between a rock of records classification theory and the hard place of end user classification practice.
A second reason why the classification of digital records via record-keeping systems has failed is because records management theory tells us that not everything is a record and only records should receive the special attention (and classification) that record-keeping systems provide. There continues to be a persistent belief on the part of records managers that EDRM solutions are the answer to the digital nightmare we face and yet EDRM success stories are not common.
Courts, on the other hand, do not differentiate between records and non-records. A record – that is, evidence - of a business activity is a record regardless of whether it has been captured or managed in a record-keeping system.
There is not, to our knowledge, a single piece of case law in Australia or Malaysia in which all evidence of a particular business activity was discovered because the organisation had an effective classification system, or even an EDRMS.
In the past year, a number of records management professionals, including Adelle Ford from record-keeping Innovation and Cassie Findlay from State Records NSW, have called for records managers to shift their thinking in relation to the management and classification of digital records, and ‘to reinvent our practice for the digital environment’.
Shifting our thinking means finding new ways to manage the sheer volume of digital information created and received by organisations, to maintain our relevancy as a profession. We need to lead with innovative and useable options, not offer solutions that aren’t working or delivering real returns on investment.
These options could, potentially, include ediscovery methods to find, aggregate and classify digital records in their business context.
Classification and ediscovery
The processes involved in legal discovery and the classification of records have a similar outcome - to identify and group all related business records for a given business context. A typical subpoena for the production of records is likely to ask for ‘any and all records (including all different types) that relate to a given subject’.
The problem for most contemporary organisations receiving such a request is that, even with a classification scheme applied in an EDRMS, they cannot be entirely sure that all records required can be found or produced. The message will go out from the legal team seeking records in any location and in any form.
Although a well-implemented classification scheme should, in theory, be able to identify all records that provide evidence of a specific business activity, the reality is that there are likely to be many more records that haven’t been classified.
To satisfy legal requirements, and in the absence of any realistic record-keeping options (such as producing a container containing all known records), organisations turn to discovery methods such as search and review to produce all related business records.
The IT department may even be required to attempt to recover records from backup tapes or network drive folders with obscure and fanciful names like ‘John’, ‘1999’, or ‘Peters Stuff’, or ‘Xmas Photos 1989’.
But surely, isn’t one of the stated purposes of record-keeping classification is to identify evidence of business activities? Wouldn’t a request for production of records would be made more easy if all that was required was to produce all records classified against a particular function, activity and/or related transactions?
Of course, the classification of records is not just about supporting discovery or the production of evidence. It is about linking records with business actions and providing business context for those records, and utilising that classification to support other record-keeping activities including retention and disposal.
What, really, is the difference between a subpoena for ‘any and all records (about a given subject), in any format’, and a business requirement to keep and/or find any and all related records in their business context? In simple terms, the former is a legal demand that cannot easily be ignored, while the latter is our rarely achieved ideal state. Why aren’t the two aligned?
We believe it is largely because the method used to classify digital records, by requiring them to be stored in classification-based aggregations in an EDRMS, have failed as a record-keeping process, un-mandated by an uncaring business. The problem is not with classification, it is with how it is implemented, as Ford noted.
Can technology-assisted review help?
One option that appears to have potential to help is an e-Discovery method known as technology-assisted review, with expert guidance and input from records managers. Courts in the United States have started to make use of these methods successfully instead of using traditional manual methods.
A 2011 research report from the University of Richmond, Virginia examined whether technology-assisted review in e-discovery could be more effective and more efficient than exhaustive manual review. (Grossman and Cormack, 2011)
The paper noted a general perception that technology-based tools were considered by many lawyers to be inferior to manual-based reviews; a sentiment that could just as easily replace the word ‘lawyers’ with ‘records managers’, and ‘reviews’ with ‘classification’.
The authors found that technology-based tools were in fact both more efficient and produced results that were superior to the latter as measured by recall and precision. It also found that manual review was ‘far from perfect’.
The paper stated that the objective of an e-discovery review was:
‘… to identify as many relevant documents as possible. The fraction of relevant documents identified is known as recall, while the fraction of identified documents that are relevant is known as precision. That is, recall is a measure of completeness, while precision is a measure of accuracy or correctness’.
The effectiveness of a review is referred to as F1, or ‘the harmonic mean of recall and precision’. Relevance, on the other hand, ‘remains elusive’, and requires human intervention. This intervention, according to some, should only be done by lawyers who claim to be ‘better able to assess relevance and privilege than non-lawyers’.
The paper further noted that: ‘It is well established that human assessors will disagree in a substantial number of cases as to whether a document is relevant, regardless of the information need or the assessors’ expertise and diligence.’
This sentence could just as easily have said that ‘end users will disagree on the most appropriate classification of a record (that is, its relevance to a specific part of the classification), regardless of their expertise’.
While the paper was not specifically focussed on classification, the concept was more or less identical – whether a computer based algorithm could achieve a better result than manual review (based on pre-defined keyword searches) in relation to the grouping of related (digital) records in a specific business context.
The paper noted a previous study (Voorhees) that found that the practical upper bound for manual review was 65% precision and 65% recall ‘since that is the level at which humans agree with one another’. By comparison, a computer algorithm-based review, targeting documents from the Enron email archive, and having humans review only 1.9% of documents to help the system ‘learn’, achieved average recall and precision rates of 76.7% and 84.7%, with no recall lower than 67.3%.
If records managers could help a system to learn how to classify digital records, success rates for the classification of those records could be as high as indicated in these results.
The application of technology to assist with legal review, a method commonly known as ‘predictive coding’, is not just theory.
In October 2011, Andrew Peck, a US magistrate judge for the Southern District of New York wrote that computer-assisted coding ‘works as least as well if not better than keywords or manual review’, particularly in large-data-volume cases where it may save … significant amounts of legal fees in document review’. Against objections, predictive coding was approved for review and production of electronic documents.
The classification of records remains a valid business and record-keeping requirement, but it needs to be implemented differently in the digital world to be successful. Technological methods, including predictive coding, may be one answer.
These methods will still rely on the expert assistance of records managers to ‘seed’ search algorithms correctly by locating and identifying records that are relevant to a specific part of the classification scheme or schemes. And, of course, records must be classified accurately and persistently so that, when linked to retention requirements, all related records can be disposed of appropriately.
Current methods to classify digital records need to change to be successful. Technology-based classification is an option that needs further exploration.
This article was inspired by a draft PhD thesis titled ‘Electronic records classification in Malaysia: A case study at syariah courts’, written by Umi Asma’ Mokhtar. The authors would like to acknowledge feedback on this article from James Lappin (UK).
Andrew Warland is an experienced, Sydney-based, information management consultant currently working at UnitingCare NSW.ACT as the Information Architect, Document and Records Management.
Umi Asma’ Mokhtar is a lecturer in records and information management with the Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia.