Rewiring to Tackle Unstructured Data

By Xavier Pornain

Extracting value from (Big) Data requires the right tools for extraction. In the pursuit of this value, large enterprises and administrations have spent very large sums of money on tools for data analysis and business intelligence. But these tools, mature as they are, can only tackle about 10% to 20% of the data most companies deal with today: structured data. They only address the structured data from enterprise applications pulled together in data warehouses and online analytical tools, including ERP, CRM, SCM, etc.

However, according to the leading industry analyst firm IDC, unstructured data accounts for 90% of enterprise data, referring to data from sources outside of the enterprise but relevant to its business processes. Unstructured data is mostly human generated, textual data — in many different languages — from documents like project or research reports, internal and external publications, product descriptions, work and governance procedures, and of course, emails. Add semi-structured data to this, like log files and, generally, machine generated data, and you understand the enormous volumes of data that remains untreated today.

Why is it that companies and administrations have according to IDC, “significantly overinvested” in dealing with structured data and underinvested in dealing with unstructured data?

The down-to earth pragmatic answer may be “because they were lacking the tools to deal with unstructured, textual data.” That is, however, only part of the truth: Tools for syntactic and semantic analysis, Natural Language Processing (NLP), have been around for some time, and they have become increasingly performant in recent years and are able to deal with sizeable chunks of enterprise data, data generated and kept inside or outside the enterprise.

We believe, the disregard for unstructured data has deeper, cultural reasons: Our scientists, engineers, and programmers are trained to “structure the world” to understand it and dominate it. Unstructured data is just data that you haven’t structured yet. It is a bit embarrassing, work unfinished, to be done asap.

You may dig even deeper for the roots and come to the book of Genesis: the creation is an act of creating a divine order – separation of the earth and the skies, of land and water, of light and darkness. And the mission given to humans to proliferate and dominate the world means emulating this process of creating order.

Coming back down to earth and to our daily chores of extracting value from data, the unfortunate truth is that the volume of unstructured data doubles every year, while the volume of structured data grows by only 20%. The effort of “structuring the world” is clearly doomed, about as successful as the quest for the Grail.

It appears then, that our IT people, and possibly our managers, are wrongly wired: They need to accept unstructured data as a fact of life, and make the best of it, instead of fighting it. Luckily for them, if they overcome their shame at the existence of unstructured data and accept it as a natural fact of life, they have the tools at their disposal to extract the value hidden (until now) in this treasure trough.

The news is even better for them: Today’s tools for analysing both structured and unstructured data are less costly and can be put in place more rapidly than many of the tools they spent their budgets on in the past.

Getting rewired may be all it takes to come to grips with a problem that seemed intractable and embarrassing in the past. But “rewiring” means a change in culture, and that is known to be one of the hardest problems of all.

How will this cultural change come about? Quite simply, by competitive pressure. Some enterprises are implementing ingenious projects today, using real-time big data analytics that include analytics of unstructured data. Others will need to follow suite. And they will need to do more than just copy. Inspiration rather than imitation is needed to ensure long-term competitiveness.

Xavier Pornain is VP of Sales for Sinequa.