How to Leverage Unstructured Data

By Philip Dodds and Lecia Pearce

A few years ago, talk of using unstructured data sources in the enterprise was relatively rare, but that is rapidly changing. The reason, a thirst for data.

In today’s world, data is used to both run the day-to-day business and to drive it forward. Data assists you in finding new customers, guides you in predicting growth, and aids you in uncovering new opportunities.

Incorporating unstructured sources with traditional structured data will help you find missing data, validate the data you have, and capture whole new sources of data to drive your organization forward.

Validation against unstructured sources can be a powerful tool, since it is often those sources which were the point of origination of the data (i.e. legal contracts and mortgage documents). This capability can be critical when businesses are attempting to satisfy regulatory requirements.

Augmenting existing structured data with information from rich, unstructured sources allows businesses to add concepts that were traditionally not available, such as user sentiment.

Driven by these two main motivations, businesses have started looking to this once useless collection of PDFs, Excel files and Word documents as sources of valid and critical information. As they investigate ways to make this data accessible, it has become apparent that the standard technology toolkit is missing some capabilities needed to drive this adoption.

To leverage these unstructured sources, businesses must transition from the world of traditional Enterprise Document Management to a much richer and more complex world; one where a range of capabilities come into play.

Scoping the Problem

Before we go any further, let’s clarify the term “unstructured data”. When we think about unstructured data what do we mean?

Unstructured data typically refers to information that was not captured in a form native to a computer. A diverse group of items falls under that umbrella, including:

  • Documents
  • Social Media
  • Narrative or description fields
  • Emails

Though the list goes on, these types of data are extremely common in most organizations (often accounting for ~80% of the storage).

So why would you want to access this information?

In general, unstructured data lives at the edge of an enterprise’s data ecosystem. It’s data collected at the beginning of a process (a credit approval), gathered as your enterprise touches upon client interactions (emails, support tickets, complaints), or captured in your outbound reporting (regulatory or fiscal).

These edges are best thought of as spaces where rigid structure can be problematic to the process, thus people are interacting in a semi-formal or informal way.

Most of this data would then be linked back to the enterprise’s systems. However, due to the cost of managing structured data, this ingestion has typically only occurred on information that is critical in order to justify the cost.

With the rise of the data-driven enterprise, business users want access to more information and more insights. The growth in machine learning (ML) and processing capacity has allowed us to start reaching into those unstructured sources to identify the data as needed.

Whether that leveraging a FICO score for use in a risk model, validating the details of contracts, or assessing risk based on non-disclosure agreements, there is now a need to make these rich unstructured sources addressable so that we can incorporate them into the data ecosystem.

As you identify reasons to extract this rich data, you must consider the capabilities required to do so.

The list of capabilities outlined here is not exhaustive, but does highlight the key parts of an architecture that will allow you to journey into the world of addressing unstructured data.

Parsing

Unstructured data does not come in a single format. Documents have a range of technical formats (PDF, Word, Excel, images, text files, emails, etc.) and each format is a combination of content (text, images, tables, etc.) and metadata (filename, author, modification date, etc.). The most basic document parsing involves accessing the native formats and extracting this content and metadata.

Often, people will start this work with tools like Apache Tika. However, these tools typically have limited value, because understanding a document often requires layout and structure, not just the text. The way that we understand information presented in documents is very complex.

Documents use a wide range of presentations to convey conceptual information, from the narrative structure of a contract, to the structured form of a credit application, to a mix of both. Information can be presented in a rich form (a mortgage document) or a very terse form (a credit approval memorandum).

Before parsing the information, you need to consider your specific use-case. If you are simply determining the sentiment of the text or trying to find dates to understand temporal information, then extracting raw text and omitting the layout of the document may be fine. However, if you want a very specific piece of information, the specific layout might be needed to allow you to find it.

For example, labels exist on forms in order to specify information associated with that label value. If there are multiple values of the same type (i.e. dates), extracting all values without the context of the source labels may be worthless.

Labels are often placed to the left of their values, but may also be placed above or below their values. Extracting all text from a document without taking this placement in mind may also render the data useless. It is also common to find differing layouts within the same document (labels to the left, at the top, or in tables). Treating the entire document in one uniform fashion may result in more meaningless data.

Normalization

Since documents come in different formats, it’s useful to normalize these formats into a more unified one. There are two approaches you can take: Converting all formats to a single format (possibly converting everything into HTML) or creating an abstract format that isn’t one of the native formats.

The approach you decide to use will be driven by the amount of value you believe there is in creating a new format. For example, you might want to support capabilities like annotation, lineage, or spatial structuring, which would lead you to a new format.

There are pros and cons to each approach, and your decision may shift as the needs of your organization evolve and as you find yourself needing to support new capabilities. In either case, you’ll need to develop a strategy to store these normalized forms, since you will need to track lineage of data back to the normalized form rather than the original document.

Navigation and Pre-processing

Any real attempt to get information from a document will quickly move beyond using only the normalized form, and you will likely find additional processing and preparation must be applied.

For example, you may need to identify and annotate the structure of the document (headers, footers, sections, outlines, etc.). You may also find that a pre-processing step needs to modify a normalized form (removing irrelevant data or labelling entities for a later step).

Rather than perform parsing and pre-processing into a single action, it’s best to separate this type of pre-processing (document enrichment) into a multitude of steps that can be applied to the document. This approach allows you to combine steps across various document formats as needed.

Pipelines

Pipelines are a natural way to model the combination of the capabilities above into a meaningful and controlled approach. They provide the ability to represent connections between documents, processing steps and data, in a way that easily communicates your problem-solving approach to non-technical users.

Pipelines also allow you to track evolution of your processing. Once you successfully identify and extract the data specified in your original requirements, you’ll often discover additional data gathering opportunities against the same source, which generates new requirements. This phenomenon shows how rich unstructured data can be.

I have seen waves of new requirements originating from a single set of documents as capabilities start to bear fruit.

Don’t forget the Process

While all of these capabilities are being aligned in your organization, one thing to bear in mind is these efforts must be placed in a well-defined process. All of this work should be captured and tracked in a development process allowing it to be controlled and managed.

Often people approach data projects — and ML specifically — as something beyond normal development. While there are some differences, you should not abandon those traits that help ensure quality in a software project.

Ensure you have version control in place. Work to build out testing to validate your work. Manage your deliverable so you are not just digging deeper into your data, but providing visible and usable results early and often.

You will need to build trust from a business community that will not be familiar with the technology. This will be much easier with tooling that allows to you to show how the document was processed and where the data was found.

Final thoughts

It’s still early days in the enterprise adoption of these new data sources. While exciting, these times are not without risk. Making use of these unstructured data sources will require some re-alignment in most organizations.

It’s also worth emphasizing that this isn’t only an IT initiative. Business SMEs are critical to making sense of the data sources. There are no schema or definitions to describe the documents that you will need to address. The ML tools currently available are constantly evolving.

With this ongoing evolution in mind, you should separate some concerns. Focus on building an infrastructure that can work with a range of documents and models, to ensure your solution does not suddenly become obsolete because of a new ML breakthrough.

The world of unstructured data will no doubt change in the coming years, but one thing is fairly certain: it is not going away. Gartner predicts 80% of the data in the enterprise is unstructured, and that volume is growing at an incredible rate. While managing that data is a problem, harnessing it to drive your business is an equally massive opportunity.

Philip Dodds is CTO / Director Analytics & Machine Learning and Lecia Pearce is Senior Engineer/Data Scientist at Infobelt, Inc., a US developer of information records management and compliance solutions.