How to ensure your unstructured data is AI-ready

By Russ Kennedy

While unstructured data accounts for 75% of enterprise data, it often goes unanalysed. Here's how to get it ready for AI.

Today’s largest organisations are increasingly depending on insights from advanced analytics and artificial intelligence (AI) to make all manner of business decisions, large and small, but there’s still a lot of valuable data that has largely remained untapped: unstructured data.

Depending on which analyst report you read, unstructured data accounts for anywhere from 70 to 80 per cent of all enterprise data, and it’s growing, but this data has proved difficult to analyse. Organisations that can overcome the obstacles to analysing this data stand to gain a significant competitive advantage over those who don’t.

The biggest hurdle to analysing unstructured data is the challenge of scale. It’s easy to plug into a file system that’s just a few terabytes in size, but large enterprises may have hundreds of millions of files that collectively represent multiple petabytes (PB) of storage. Even worse, those files are typically stored in multiple silos that are often physically separated by vast distances.

As a result, any large organisation that attempts to analyse a substantial amount of its unstructured data on-premises will find the process extremely cumbersome and expensive. Certainly, IT can deploy analytics and AI on-prem using clusters and frameworks such as Hadoop, but that doesn’t address the problem of accessing data stored in disparate silos. Large distances introduce unavoidable latency, so the process will be slow, and replicating that much data is both expensive and complex.

If data is stored in the cloud, however, providing access to it is much easier, especially since the major cloud providers now offer some very sophisticated AI, machine learning (ML) and advanced analytics services, such as Amazon EMR, Amazon Textract, Google BigQuery ML and Azure AI. Whether you want to analyse video, text or image files, there’s a cloud service you can employ, and, in many cases, you don’t need to be a data scientist to use them, as they provide simple point-and-click interfaces.

Plus, the cloud uses an object store, which is a great format for big data, because it’s highly scalable, non-hierarchical and easily accessible. With an object store, you can go right to the data you need without having to navigate a structure or a tree, and, even better for analytics and AI, object stores have a great deal of metadata associated with the data, providing even more information to produce better insights.

The trick, of course, is getting all that unstructured data into the cloud in the first place. Moving or copying multiple petabytes of data is no easy task. Even with a 1 GB/second connection, transmitting 10 PB of data could take up to four months of continuous transfer to complete. If time is not of the essence, tools exist to move the data. However, whichever tool you use, it must understand the original format (which is likely a file system) to read the data and then be able to write it in the cloud’s object store format.

Amazon has a service called Amazon Snowmobile that will physically pull a semi-truck up to your site, copy up to 100 PB of data into a ruggedized storage container, and manually transport it directly to the AWS cloud data centre. The process takes just a few weeks, but is still much faster than transferring such a high volume of data over the wire.

Of course, if you’re copying all that data to the cloud, now IT has to manage not only the file data that it has stored in various silos around the globe, but also duplicates stored in the cloud. So if you had 20 PB of data on-premises, now you’ve got 40 PB to manage, encrypt, back up and secure.

There is an alternative to copying on-premises unstructured data to the cloud: store it all in the cloud in the first place. But simply storing file data in AWS S3 or Azure Blob Storage won’t necessarily work for all applications and use cases.

Putting aside the fact that these cloud storage services are object stores, which are not natively suitable for storing files (and that’s a huge issue), hyperscale providers build their data centres in sparsely populated areas where real estate is inexpensive.

 They’re typically hundreds or even thousands of miles from customers, who are typically located in or near urban areas. Not even the speed of light can overcome the significant latency these distances introduce, which makes accessing files in the cloud painful and slow.

So while you may be able to analyse your unstructured data easily in the cloud, it may be all but inaccessible to your applications and users.

Thankfully, there are now multiple file data services that store the master copy of all file data in the cloud, but cache the most frequently used files locally to deliver an expected level of performance.

In these services, changes to files are uploaded to the cloud, which are then synchronised back to the caches in every other location. All data is encrypted using a key controlled only by the customer, so the service provider cannot access the data. And because these services copy the data in so many different places within the cloud, backup takes place automatically.

As a result, organisations can get all of their unstructured data into the cloud, where it can be easily fed into cloud-based AI, ML and analytics services, yet they still get local performance for file data with just a single copy of their data to manage.

The days of on-premises analysis of big data — especially big unstructured data — are rapidly coming to a close. Only the cloud possesses the scale and ubiquitous access required. The challenge is getting such a large amount of data into the cloud and then mitigating the cost and complexity of managing it. With the rise of hybrid cloud file services, enterprise IT can not only simplify file data management, but also ensure this valuable data is ready for AI and other analytics.

Russ Kennedy is chief product officer at Nasuni.