Unified Agents Fail 55% of Real-World Tasks
Based on research by CocoaBench Team, Shibo Hao, Zhining Zhang, Zhiqi Liang, Tianyang Liu
Imagine an AI assistant that can look at a screen, search the web, and write code all at once to solve complex problems. While these unified digital agents are becoming common, we do not yet know if they can truly handle real-world tasks that require juggling multiple skills simultaneously. Current testing methods often fail to reveal how well these systems actually perform when everything is on the line.
To fix this gap, researchers have introduced CocoaBench, a new benchmark built from human-designed tasks that demand flexible combinations of vision, search, and coding. Unlike previous tests that check abilities in isolation, this system evaluates agents only through simple instructions and an automatic function that judges the final result. This approach allows for reliable and scalable testing across different agent infrastructures without getting bogged down in technical details.
The results are startling. Even the best evaluated systems managed a success rate of just 45.1% on these challenging tasks. The analysis reveals that current agents struggle significantly with reasoning, planning, using tools correctly, and understanding visual information. These failures highlight a massive distance between today's technology and the reliable performance needed for practical applications in software engineering or deep research.
The takeaway is clear: while progress is being made, unified digital agents are far from ready to be fully trusted in complex environments. There is substantial room for improvement in how these systems plan their actions and ground themselves in visual data. Developers must focus on strengthening these specific weak points before expecting these AI assistants to handle the wild unpredictability of real-world work.