Rise of the lawyerbot?

Does predictive coding mean the end of the lawyer’s role in ediscovery?  Freehills’ Andrew Caspersonn wonders whether it really means we can make do with smart machines.

I was browsing through some online articles on Early Case Assessment (or ECA) the other day and came across an article on a particular piece of software. The software was being promoted as an end to end solution and the article contained an interesting quote that I have paraphrased except for the words in italics:

“This product has been designed to scan data sets and assess data in the early stages of litigation, i.e. it conducts an early case assessment. The product is also made to preserve, cull, and collect electronically stored information (ESI) as well as analyse and report on what it has collected and conduct a first-pass review.”

I have a technology background and like to see a good piece of software given credit for making a process more efficient. I’m a bit concerned though, by the claim that the piece of software is performing early case assessment (let alone that it conducts a first-pass review…).

This prompted me to put some thought into ECA which leads me to the related topic of predictive coding or Technology Assisted Review (TAR).

Software can be used to achieve a goal or solve a problem, but it can’t achieve the goal or solve the problem by itself. The tool doesn’t assess the case, it provides information about the data.

In my mind, software that says it performs ECA is a little misleading. Whilst it might help a person filter, view or analyse the data, it doesn’t perform early case assessment, it simply provides the user with information about the data. Incredibly useful data, but still just data.

For example, if you point one of the tools at three people’s email databases and get it to ingest all the data, the tool may well group related documents using powerful algorithms, assist you to remove spam and provide detailed reports. It might show you who is emailing who and even make you a cup of tea, it won’t however provide you with legal analysis about the merits of your case.

Early Data Analysis

Within Freehills, we have started using the term Early Data Analysis or EDA to describe how we use tools that provide you with a quick way to see what data has been collected, find gaps or identify additional custodians. We see it as a more accurate name for what these tools can do. 

The tools don’t perform ECA and rather than making lawyers redundant, such tools probably make the senior lawyer’s role even more important at the beginning of a matter as they seek to understand what data they are dealing with.

In summary, whilst advanced software is making large volumes of data easier to understand and review, the process still requires smart people to put that information together with an understanding of the case for it to be of any use in the real world.

'Andrew has 15 years’ experience as a litigation support analyst at Freehills. He has recently moved to Perth from Sydney to head up the Information Logistics group in the Freehills Perth office. These are his personal views.

Predictive coding

Predictive coding or Technology Assisted Review (TAR) goes hand in hand with EDA. 

TAR in its simplest form uses the same sorts of powerful algorithms used in the EDA phase to group documents with similar concepts and propagate coding between them. In this way, instead of reviewing the entire corpus of say 100,000 documents, experienced reviewers code a smaller set, say 10,000 documents, and this set is used as the ‘seed’ to propagate the coding to the remaining 90,000 documents

I think it’s safe to say that the same principles will apply as with EDA. The tool itself doesn’t make the key decisions. Rather, the tool interprets the decisions made by your experienced reviewers and propagates the decisions to the documents it has deemed to have the same concepts.

Likewise, the predictive coding software needs to be in the right hands. It requires experienced reviewers who understand the issues to code the starting set of documents and somebody who understands the statistical concepts to run the processes.

Understanding these key statistical concepts I think will make or break the use of predictive coding in Australia. Concepts like random sampling to create the initial set of documents to be reviewed and to create the sets to check the accuracy of the predictive coding. 


Starting with EDA and moving on to TAR, let’s have a look at a not uncommon but hypothetical situation and see how good software + smart people + an understanding of the case can work. (This scenario is based on what could happen and has been simplified for the sake of time and space.) It should not be used as a real world example. 

It’s Thursday afternoon, 4pm and a portable hard drive arrives on your desk. It might be something you know about, or it might be a total surprise. A quick look at the drive (after virus scanning of course) shows 45 PST (Microsoft email store) files, each approximately 1.5 GB in size. They look like they have come out of an email archiving program and have the helpful names May2012_01.pst, May2012_02.pst through to May2012_45.pst.

Traditionally we would have run a linear process involving one-off searches and/or processing all emails that could take weeks before anything was available to be looked at by the legal team. This was because our processes were not as refined and the tools were slower and more cumbersome to use. 

Using some of the newer tools on the market, the process is much more streamlined and flexible. In our hypothetical, the approximately 70GB is ingested or sucked into the tool overnight and the following morning we have the ability to analyse the data received. 

First up, a litigation support expert reviews the logs, checks the exceptions, possibly OCRs any documents that were image based and gets a picture of the data. System and known irrelevant files can also be removed at this stage. Once that’s done, say by Friday lunchtime, a senior lawyer and the litigation support expert can sit down together and ‘play’ with the data.

While this step might be focused on the discovery process, it’s even more useful for the legal team to get an understanding of the data. Searches can be run and data displayed in many different ways. From date graphs to show potential key periods of time or to note gaps in emails, to correspondence links showing the frequency of emails between certain people. 

In our hypothetical, the senior lawyer decides to look at the email links. They select one of the known key players and start to look at who they were emailing in the critical period. This analysis shows a disproportionate number of emails going to two people not previously considered as having been involved in the transaction. The senior lawyer looks at a sample of emails for the first person and quickly realises they are all personal and could be ignored at least initially. 

The emails to and from the second person however are mostly related to the project and further checks with the client show the second person was a key individual who had recently left and so was missed from the initial email extraction. An urgent request can now be sent to the client to provide this additional person’s emails.

Key words garnered from the quick review of the above correspondence are then tested in realtime. The senior lawyer and litigation support expert flag a group of potentially hot documents for quick upload and later in the day these are loaded to the review platform for review by the entire team.

So by the end of Friday, the legal team have useful information about what documents they have been provided with, a gap in the data has been identified, potentially key documents are available to the whole team and everyone can go home and enjoy their weekend. Well that’s the idea anyway.

The beauty of the way these systems are designed is that when more information comes to light or the scope of the matter changes on Tuesday the following week, it’s easy for the lawyer and litigation support expert to sit down again and tweak the process. 

When required the rest of the potentially relevant data can be moved in bulk to the review database and the discovery review process can kick off which moves us on to TAR.

For this part of the hypothetical I am most grateful for the inspiration provided by Ralph Losey’s recent blog posts on the topic which can be found at http://e-discoveryteam.com/. Ralph goes into great detail about an example he is using to train his team using the Enron data set.

In our hypothetical it’s two months down the track and discovery orders have been made. By refining key words, custodian lists and date ranges, the number of documents that are to be reviewed is reduced to 150,000. Unfortunately, the timeframe is rather short with only three weeks allocated to the review phase.

In a standard linear review., every document is looked at and coded by a lawyer or paralegal. This is the current standard way of discovering documents. Depending on the percentage of relevant documents and how detailed they are rates of review vary from 300 a day per reviewer to 1,000 a day. The higher figures assume less than 30 seconds per document which might be possible if the majority of documents are short emails, but in most cases is unrealistic. 

Even with the higher review rates, it is easy to see that it would be incredibly difficult to complete the review within three weeks even with a large team of reviewers.

TAR gives us a process that can hopefully reduce the time to complete the review. Instead of reviewing all 150,000 we get the tool to select a random sample of documents using the required statistical confidence levels. Depending on what confidence levels are used that could be less than 500 documents. In our hypothetical let’s start with 1000 documents for review.

These 1000 documents are allocated to the two senior lawyers who have the best understanding of the matter. As they don’t need a matter briefing they can start straight away. Assuming they can work pretty solidly on it, after two or three days of review they should have finished.

At this point, we get the software to propagate the review coding across the entire set of documents. We kick it off on Wednesday night as it takes a couple of hours to run. On Thursday morning we get the same lawyers to check what the computer has done. 

In our case, the initial review by the lawyers found 10 documents to be relevant. The propagation we would expect then to code a similar percentage of documents from the entire set to be relevant which it does returning 1500. 

We could leave the process here and discover the 1510, but that would be placing too high a reliance on the software, so we refine the process by reviewing the 1500 documents marked relevant as well as a sample of say 1000 from the remaining documents deemed to be not relevant. 

Our two senior lawyers start on this next set of documents and work through the weekend to have it complete by Tuesday afternoon of week two. A few documents marked relevant were changed to not relevant and vice versa. This now gives us 3500 documents that have been reviewed and Tuesday evening we re-run the propagation. 

The re-run changes the figures slightly. Instead of re-reviewing the whole set, the senior lawyers take Wednesday to review the changes and see if they agree with the software’s decisions. A couple of changes are made but nothing significant as the decisions were borderline.

Thursday is spent looking through the documents deemed to be irrelevant to see if there are documents that have been missed. While two documents were found to be relevant, they were only just relevant so the decision is made to use the documents marked relevant plus any hosts/attachments to these documents that were marked not relevant as the production set.

Friday and into the first part of the following week is then used to finalise any privilege review, complete any masking and prepare the data and images for exchange. The three week deadline is met.

As I noted at the beginning, the hypothetical is not a real world example, nor is it an example of best practice. Its aim is to illustrate the benefits that can be gained combining smart people and smart technology to deal with the large volumes of electronic data we are seeing more and more in disputes these days.