Treasury Copilot Trial Reveals Mixed Results

A lengthy evaluation of Microsoft's Copilot AI assistant at the Australian Department of Treasury has revealed both promising benefits and significant limitations of the technology, according to a newly released report.
The 14-week trial, involving 218 Treasury staff between May and August 2024, found that while Copilot showed clear benefits for basic administrative tasks, it fell short of initial expectations for more complex work. Only about one-quarter of participants reported using Copilot frequently by the trial's end, with many citing limitations in the product's capabilities and accuracy.
"The gap between expectations and reality adversely impacted product use," the report noted, with 59% of participants reporting that Copilot supported only 0-25% of their weekly workload, compared to pre-trial expectations that it would support 25-50% or more.
“Unrealistically high expectations at the trial outset may have contributed to the problem, as some staff were discouraged by the performance of the product and gave up using it.”
However, the trial revealed some unexpected positive outcomes, particularly in workplace accessibility and inclusion. Staff members who were neurodivergent, working part-time, or experiencing medical conditions reported that Copilot helped them manage their work more effectively, such as catching up on missed meetings or overcoming procrastination barriers.
“There were 4 use cases initially proposed for Copilot: generating structured content, supporting knowledge management, synthesising and prioritising information, and undertaking process tasks. The consensus from participants was that these use cases were appropriate for the Treasury context, but that Copilot was not appropriate for more complex tasks, mostly due to the limitations of the product itself. Participants expressed concerns about functionality relative to other generative AI products on the market.”
The evaluation found that Copilot was most effective at basic administrative tasks like summarising meeting minutes, finding files and information on SharePoint online, developing draft plans and documents, and adapting the tone of writing. One participant reported saving approximately six hours on a procurement review task using Copilot, while others found value in its ability to assist with coding and data analysis.
During the trial period, there was a slight decrease of 11 percentage points in responses from participants who agreed that they struggled to find the information or documents they needed to complete their job. Further, qualitative responses indicated that participants felt Copilot assisted with record-keeping and recovering lost corporate knowledge in some areas.
Trial participants reported difficulties and concerns with ‘prompt engineering’, including difficulties finding the correct prompt to use, unhelpful outputs from prompts, and low quality outputs to complex tasks.
There were also concerns over misattributions of statements in documents leading to incorrect summaries.
“Co‑pilot often created fictional information when asking it to generate output, said one trial participant.
The report estimated that an APS6-level staff member would need to save only 13 minutes per week on administrative tasks to offset the license cost, suggesting potential cost-effectiveness despite the limited adoption.
A key finding was the need for better training and support. "There was scope for more education and training to support participant onboarding," the report stated, with many participants requesting more tailored guidance throughout the trial period.
The evaluation makes seven recommendations for future AI implementation, including providing clear use cases, taking a phased approach to rollout, and developing guidelines for transparent use of AI tools. It also emphasizes the importance of monitoring both work outcomes and staff wellbeing in future implementations.
The full report is available HERE