Predictive coding emerges as ediscovery’s data salvation

Tuesday, September 11, 2012 - 16:23

A massive growth in data volumes that must be navigated in ediscovery has led to the rise of a technique known as predictive coding. Maureen Duffy of law firm Freehills tracks examines why 2012 has been been such a pivotal year for the new technology.

Every January thousands of people from around the world descend on New York City to try to gain some insight into where the legal technology industry is heading. The conference is called LegalTech and the format includes vendor exhibitions and work streams running panels of experts sharing their experiences and predictions for the coming year. This year the major theme was predictive coding.

Maureen Duffy is National Practice Coordinator in the Information Logistics Group at Freehills, and is a lawyer licensed to practice in Australia and the USA. These are her personal views.

To set the scene, the Information governance panels discussed that on a worldwide basis, data volume is exploding; the digital universe exceeded 1.8 zettabytes (1.8 billion terabytes) in 2011 and it is expected to double every two years thereafter. Frighteningly, one third of this information will need to be managed by business for compliance purposes (IDC’s 2011 Digital Universe Study).

The speakers also confirmed that organisations everywhere are struggling to manage this volume of information in any meaningful way. In highlighting trends, many information governance panels observed that there is a direct relationship between large volumes of electronically stored information (ESI) and the costs of doing a discovery. This certainly was not a new trend to those of us in the ediscovery business!

The predictive coding sessions continued with the theme that the large volumes of ESI complicated the discovery process and increased costs. It was declared that exploding data volumes required that a new approach be taken, more specifically, the costs for legal review need to come down. Rand issued a report this year called “Where the Money goes: understanding litigant expenditures for producing electronic discovery,” by Nicholas M Pace, Laura Zakaras.

In the report their research confirmed that legal review consumed $US0.73 of every dollar spent on ESI production. The report went on to highlight that there are limited opportunities to speed up human review and so to reduce the cost of review, and thus the cost of discovery, technology would have to provide the answer. It was claimed that if used correctly, predictive coding is likely to be that technology.

What is predictive coding?

Simply, predictive coding is software that uses algorithms to determine the word meaning based on patterns and associations. Based on these associations the software will group like with like to categorise and rank documents.

The term predictive coding is generating some controversy as the software company Recommind patented their algorithm and trade marked the phrase “predictive coding”. After LegalTech, the phrase became the new buzz word in the ediscovery industry.

There is now great debate about the use of that phrase as there are many software companies that have products using a variety of algorithms that achieve a similar result. This is why there is a strong push from these other companies to use phrases like computer assisted review (CAR) or technology assisted review (TAR).

It is also felt that these titles more appropriately represent the fact that it is a workflow and process in which technology is assisting the legal review, rather than the suggestion that it is a piece of software that is completing the review.

At LegalTech in January, from the time we started to listen to the first predictive coding panellists, it was clear that if (or when) this type of technology becomes generally accepted by the courts and the clients, lawyers are going to face a profound technology driven change to the way legal review and discovery is conducted.

Up to this point, there had been hesitation amongst most lawyers to use this type of software as there was no endorsement by the courts of these technologies and so lawyers did not want the first case and risk exposing their client to an unfavourable and expensive decision.

The two groups of stakeholders that could influence the lawyers and accelerate the adoption of this type of technology are clients and the courts. At LegalTech, these two groups had some interesting things to say.

From the panel discussions it was obvious that the clients who were speaking had already embraced the technology. They felt that predictive coding was a powerful, cost effective response to a financially crushing problem, and thus they were seeking to engage law firms that would use this type of technology. Convincingly, they stated that all clients who faced large scale litigation are looking for a better experience when it comes to the time and cost spent on large discoveries.

The Lehman Bankruptcy case was referred to as an example of how the volume of data is becoming so great that any format of human review can no longer be considered a viable financial option. In Lehman, it was explained that the bankruptcy administrator was provided 350 billion pages of information, which would take a team of 100 contract lawyers working 60 hours a week, 200 years to review 1% of the material.

Court response?

The only question remaining was how would the Courts respond to the suggestion that a computer, rather than a person, would be completing the review of documents for a discovery production?

Judge Andrew Peck, (United States Magistrate and winner of the 2012 Champion of Technology awards from Law Technology News) appeared and moderated several panel discussions on predictive coding or CAR. During these appearances, he discussed how the combination of the data volumes, legal review costs and the lack of cooperation between lawyers is crippling the US justice system to the point that soon parties will no longer see it as a financially viable avenue to settle disputes.

Judge Peck explained that modern technologies must be used to help create a solution to this modern problem. He also made it clear that technology alone is not the answer.

The problem can only be addressed by using a combination of technology, legal knowledge and expertise both as to the facts and the law. The successful use of the technology is heavily dependent on proper workflows and a defensible validation process.

In fact, Judge Peck repeatedly said it is the overall workflow and validation processes used that will be scrutinised by the courts and confirmed that he had no interest in knowing or understanding how the complex algorithms in the software worked. He stated that these types of software systems have been used and accepted in disciplines outside the law for a long time, and thus he simply accepted that it does work.

Judge Peck referred to an article that he had written “Search, Forward, Will manual document review and keyword searches be replaced by computer-assisted coding” (Law Technology News October 2011.)

In that article, he consistently said that there is no need for approval by the courts of CAR, explaining that the use of keyword searching is accepted and used in almost every case and yet the use of keyword searching has never been approved in any court decision in the US. He reiterated this view at the LegalTech conference. At LegalTech, Judge Peck, made clear that he was looking for a case where he could provide approval of the process and technology as he felt the risk of being the first case caused a lot of concern for lawyers and so once there was an opinion from the courts accepting the technology and processes the revolution of predictive coding based discovery would begin.

That day came sooner than any of us expected. Within the month, Judge Peck had issued an opinion approving the parties’ joint request and agreed protocol outlining the process and methodology for the defence firm to use predictive coding technology to complete their discovery. The case is da Silva Moore v Publicis Groupe No. 11-CV-1279, 2012 U.S.Dist. LEXIS 23350 (S.D.N.Y. Feb. 24 2012).

This was quickly followed by a Virginian State Court Judge approving the defence request to use predictive coding over the objections of the plaintiff in Global Aerospace v.Landow Aviation, No CL 6140(Vir.Cir.Ct. April 23, 2012).

Finally on the 13 July 2012, Judge Scheindlin, Federal Judge, issued an opinion in National Labour Organizing Network et al.V United States Immigration and customs Enforcement Agency et al. 2012 U.S.Dist Lexis 97863 where she found keyword searching to be inadequate to find relevant documents and that predictive coding technologies are the emerging best practice.

Together these three cases now provide approval by the courts of the use of these CAR technologies in the US.

In an interview conducted online on June 25, 2012, Master Whitaker of the Queen’s Bench Division of the High Court (UK) was asked to share his thoughts on the court’s attitudes on the use of predictive coding software and he explained that the use of these technologies would certainly be approved in the UK.

Although there has been no comment from the bench in Australia, there would be no reason to believe that the use of this technology would not be approved. There has been a strong movement to streamline discovery and move towards “quick and efficient justice” resulting from recommendations from law reform commissions and changes to the civil procedure acts. This has led to modernising practice notes and court rules that are designed to encourage cooperation between the parties to narrow discovery obligations and attend case conferences to address issues surrounding the production (Federal Practice Note 6).

Predictive coding - The process

Although the review is accelerated, all the usual steps in the Electronic Discovery Reference Model (EDRM) www.edrm.net still need to take place and this process requires the use of both ediscovery experts and legal counsel.

The clients who spoke at Legal Tech accessed the software through their external law firms, and the Rand report confirmed that the use of predictive coding software and legal review is predominately managed by outside counsel. Based on Judge Peck’s discussion it seems that counsel is in the best position to put in place the necessary defensible workflow, and negotiate the process protocols with the other parties.

To start, potentially relevant data is identified, preserved, collected, analysed and loaded into the predictive coding software.

Once the data is loaded, the legal team (preferably senior members) review a “seed” set of documents for relevance and privilege. These documents are a statistically selected random sample of the whole document collection; additional individual documents that are clearly relevant or irrelevant can also be added to this seed set which is used to train the computer.

The computer then categorises and ranks the remainder of the documents based on the training from the seed set. Consistent categorisation is something computers do well; if a mistake is found, adjustments to the categorisation of the seed set can be made and the computer can re-categorise the documents.

This iterative process continues until the legal team is satisfied that the categorisation and ranking are correct. Then a decision is made that all documents below a rank are not relevant and those above are potentially relevant. These potentially relevant documents are then reviewed by a secondary review team prior to inclusion in a discovery production.

These workflows are what Judge Peck was most interested in and it is what establishes a defensible discovery process.

Will Predictive coding technology transform the practice of law?

No, it alone is not going to be the cause of the transformation. The transformation has come from the fact that technology permeates every aspect of our personal and work lives; it is the size of the digital universe, its exponential rate of growth, and the fact that almost all evidence is now digital.

At LegalTech, Ralph Losey, lawyer and writer of the e-Discovery Team blog, stated that never before in history has a single generation of lawyers faced such a dramatic transformation of the format of the evidence that they need to manage in a case.

In conclusion, predictive coding is only one of the cures that have emerged to assist us in navigating this new information landscape, but it certainly won’t be the last.

Maureen Duffy is National Practice Coordinator in the Information Logistics Group at Freehills, and is a lawyer licensed to practice in Australia and the USA. These are her personal views.

Search form

Predictive coding emerges as ediscovery’s data salvation