Leading AI Models Fail Accuracy Tests

Thursday, September 11, 2025 - 15:41

A comprehensive evaluation of 37 major AI language models reveals significant weaknesses in factual accuracy that could pose compliance and operational risks for organisations deploying artificial intelligence tools.

The study by Hong Kong University's Business School found that while leading models like GPT-5 and Claude 4 Opus performed best overall, all tested systems struggled with "factual hallucinations" – generating plausible but incorrect information that contradicts real-world facts.

Professor Jack Jiang, who led the research through the Artificial Intelligence Evaluation Laboratory, said hallucination control capability directly impacts the credibility of AI systems in professional settings including knowledge services, customer service and intelligent navigation.

The evaluation tested models on two types of hallucinations: factual errors that conflict with real-world information, and faithful errors where models fail to follow user instructions or produce content contradictory to input context.

Results showed GPT-5 variants achieved the highest overall scores of 86 and 84 respectively, followed closely by Claude 4 Opus models at 83 and 80. However, even top-performing models scored below 75 on factual accuracy tasks, indicating room for improvement in enterprise-critical applications.

For compliance and risk managers, the findings highlight potential vulnerabilities when deploying AI tools for document analysis, regulatory reporting or customer communications where factual accuracy is paramount.

The study found models generally excelled at following instructions precisely but were more prone to fabricating facts – a pattern that could mislead decision-makers relying on AI-generated insights for business-critical processes.

Chinese models including ByteDance's Doubao 1.5 Pro showed balanced performance but lagged behind international leaders, while reasoning-focused models performed better than general-purpose versions at avoiding hallucinations.

The research comes as organisations increasingly integrate AI capabilities into Microsoft 365 and other enterprise platforms, making hallucination control a critical consideration for digital transformation initiatives.

Information managers implementing AI workflows should establish validation processes and human oversight mechanisms to mitigate risks from factual inaccuracies, particularly in regulated industries where compliance failures carry significant penalties.

The full evaluation methodology tested models on information retrieval, misinformation identification and contradictory prompt scenarios to assess their ability to maintain factual consistency and contextual accuracy.

The full report is available here.

Search form

Leading AI Models Fail Accuracy Tests