Law firm tackles the unsearchable with contentCrawler

New Zealand IP law firm AJ Park has adopted DocsCorp's contentCrawler to provide an automated solution to convert legacy and newly-profiled PDF documents in its Autonomy iManage DMS to searchable PDFs.

The firm wanted to to ensure that every document is 100% text searchable to minimise risks associated with failing to produce documents on demand, or failing to recognise any conflicts of interest in taking on new clients. The service provides the firm with a single, enterprise-wide OCR solution that eliminated the need for multiple OCR workflows and processes.

AJ Park  works with clients across a range of sectors including biotechnology, chemical, electrical and electronics, mechanical and manufacturing, materials and nanotechnology, IT and software industries. With offices in Auckland, Wellington and Sydney, the firm counts over a third of New Zealand’s top 100 companies and almost half of the global Fortune 500 companies as clients.

Stephen Field, a System Engineer at the firm, said. “We concluded that image-based documents in the Autonomy iManage DMS represented a serious risk to the firm. So, we started to look for a solution.”

The firm has been a DocsCorp client for several years using its pdfDocs products for creating and editing PDF documents. It was through this relationship that they became aware of the contentCrawler product. 

Stephen recalls how they obtained the contentCrawler audit tool to put some actual numbers on the scale of the problem and to build the business case for resolving it. “After running the audit tool on a section of the Autonomy iManage database, we concluded that there was about 30% of non-searchable content. When you have 4 million plus documents stored in Autonomy iManage, this is a sizeable number of documents being omitted from searches,” says Stephen.

The firm had two concerns initially. It wanted reassurance that contentCrawler would not modify or change the actual appearance of the document. However, contentCrawler does not modify the original documents, instead it simply adds a text layer to facilitate indexing and searching. Further assurances were given that it would also preserve any annotations that might have been on the original, and that it was 99.9% accurate, supporting more than 180 languages.

Secondly, the firm did not want to double up on storage. Again, contentCrawler provided the firm with a number of options for saving documents back into Autonomy iManage. Documents could be saved as a new version or replace the original. AJ Park decided to replace the original with the new searchable PDF.

The firm decided to proceed with the purchase and deployment of contentCrawler. But before commencing, the IT department made a number of decisions on how contentCrawler would tackle the enormous library of over 4 million documents. The first decision was to automate the entire process.

The process would be an end-to-end, automated process with contentCrawler assessing, converting, saving and replacing the original documents with no intervention from staff. This would allow them to run the contentCrawler service 24/7 to complete the task as quickly as possible. contentCrawler can also be run as a manual process with built-in “Hold for Review” options prior to the OCR and/or “Save to” stages.

In addition to running the crawl as an automated process, AJ Park decided to tackle the problem in two stages. The first stage would focus on the conversion of all the legacy documents year by year, and the second would handle all newly-profiled documents. contentCrawler provides organisations with the flexibility to work in one of two (or both) modes precisely for this reason.

The Backlog mode handles all legacy documents whereas Active Monitoring processes recently profiled documents. Once the legacy PDF documents in the firm’s Autonomy iManage database had all been converted to searchable PDFs, the IT department turned their attention to ensuring all newly-profiled documents would be handled in a similar way. This provided the firm with a single, back-end OCR solution that eliminated the need for multiple OCR workflows and processes. It also allowed end users to forget about OCR and focus on other tasks.

DocsCorp has made available a contentCrawler trial version to allow organisatioons to determine how much non-searchable content they have in their content repositories.