IBM Unlocking Unstructured Data for Generative AI
Unstructured data – buried in contracts, spreadsheets, and presentations – is one of the most valuable but underutilized resources in the enterprise. IBM is evolving watsonx.data to help organizations activate this data to drive more accurate, effective AI.
TIBM says its evolution of watsonx.data will bring together an open data lakehouse with data fabric capabilities – like data lineage tracking and governance – to help clients unify, govern, and activate data across silos, formats, and clouds. Enterprises will be able to connect their AI apps and agents with their unstructured data using watsonx.data, which tests show can lead to 40% more accurate AI than conventional RAG.
IBM is also introducing watsonx.data integration, a single-interface tool for orchestrating data across formats and pipelines, and watsonx.data intelligence, which uses AI-powered technology to extract deep insights from unstructured data. They will be available as standalone products, with select capabilities also available through watsonx.data – maximizing client choice and modularity.
To complement these products, IBM recently announced its intent to acquire DataStax, which excels at harnessing unstructured data for generative AI. With DataStax, clients can access additional vector search capabilities. Further, watsonx is now integrated as an API provider within Meta's Llama Stack, enhancing enterprises' ability to deploy generative AI at scale and with openness at the core.
Edward Calvesbert, Vice President, Product Management, watsonx Platform, writes, “Enterprises are facing a major barrier to accurate and performant generative AI - especially agentic AI. But the barrier is not what most business leaders think.
“The problem is not inference costs or the elusive “perfect” model. The problem is data.
“Organizations need trusted, company-specific data for agentic AI to truly create value - the unstructured data inside emails, documents, presentations, and videos. It is estimated that in 2022, 90% of data generated by enterprises was unstructured, but IBM projects only 1% is accounted for in LLMs.
“Unstructured data can be immensely difficult to harness. It is highly distributed and dynamic, locked inside diverse formats, lacks neat labels, and often needs additional context to fully interpret. Conventional Retrieval-Augmented Generation (RAG) is ineffective at extracting its value and cannot properly combine unstructured and structured data.
“IBM's new capabilities will enable organizations to ingest, govern and retrieve unstructured (and structured) data—and from there, scale accurate, performant generative AI.”