GenAI's Hidden Challenge: Mastering Unstructured Data

By Andy Milburn

As the modern organisation enters a new phase of technological evolution with the widespread move towards smarter systems based upon Artificial Intelligence (AI) and Large Language Models (LLMs), the volume and accessibility of data with which to feed these systems will become critical.

AI is already in use across the modern enterprise, from fraud detection to personalised marketing, and its basic use is largely understood.

An LLM is an advanced type of AI model designed to understand, generate, and process natural language. This can extend to writing code as well, and LLMs are capable of being taught complex relationships and nuances in communication.

LLMs are themselves built using machine learning techniques, typically leveraging deep neural networks, and are trained on massive datasets of text from diverse sources such as books, articles, and websites. As they become more sophisticated, LLMs can begin to understand reasoning, solve problems, and create original content. This all comes down to the quality of information they are being ‘fed.’

Unstructured data like natural language, images and videos provide rich context for LLMs to learn from. In order to learn properly and make nuanced decisions or generate human-like responses, LLMs need clean, organised datasets to generate accurate outputs. Unstructured data, which is often chaotic and disorganised, must therefore be properly categorised, labelled, and made accessible to ensure meaningful training, or the language they produce is going to be impacted. Poorly organised data leads to biases, errors, or irrelevant insights, which undermines the reliability of AI – and will likely have an adverse effect on the organisation too.

Therefore, organising unstructured data enables better semantic understanding and cross-referencing of related information, which in turn will have a positive effect on the business outcome that the LLM is designed for.

Better visibility of unstructured data, which makes up as much as 90 per cent of the data estate for many companies, will also have an effect on real-time decision-making within an organisation. In applications that rely on real-time interaction such as fraud detection, customer support or diagnostics, AI relies on instantly accessible data. Visible and well-organised unstructured data allows LLMs to retrieve relevant insights very quickly indeed, improving response times.

As LLMs continue to scale to support diverse industries as we are now seeing such as healthcare, legal or manufacturing, they require domain-specific unstructured data to be accessible and organised in order to fine-tune and make contextual adaptations. Disorganised data creates bottlenecks in deploying scalable AI systems.

It may seem obvious, but there are unnecessary costs associated with disorganised data as well. On top of the expense of actually holding onto a large, unwieldy data estate, plus the cost of moving files and information around, organising unstructured data upfront reduces inefficiencies in AI workflows. This in turn cuts many costs related to data preparation, processing, and storage for LLM applications. Spending a little time and capital up front to organise and gain visibility into data can have a big impact on the bottom line down the track.

It also pays to have one eye on the future. As AI systems evolve, they will require ever-more dynamic access to constantly growing data sources. A visible and well-organised repository of unstructured data ensures long-term adaptability and scalability of LLMs to new challenges and datasets.

The future of AI and LLM technology hinges on the visibility, organisation, and accessibility of unstructured data. As organisations continue to generate vast amounts of data, investing in effective data management systems will be key to unlocking the full potential of LLMs. By addressing the challenges posed by unstructured data, businesses can harness these advanced models to drive innovation, improve decision-making, and ensure ethical, scalable, and cost-effective AI applications.

Andy Milburn is Regional Director, APJ, Datadobi.