AI Agents Fail at Basic Cross-App Tasks
Based on research by Jinchao Li, Yunxin Li, Chenrui Zhao, Zhenran Xu, Baotian Hu
We rely on computers to handle complex professional workflows, yet our most advanced AI agents still struggle to navigate between different programs. New research reveals a startling gap: while AI can master isolated tasks, it largely collapses when asked to coordinate across multiple applications like a human professional would. This isn't just a minor glitch; it is a fundamental barrier to truly autonomous work.
Researchers have introduced WindowsWorld, a benchmark designed to test GUI agents in realistic, multi-application environments. Unlike previous tests that focused on single apps, this study evaluates how well agents can juggle tasks across seventeen common desktop applications. The benchmark includes 181 complex tasks, with nearly 80% requiring interaction with multiple programs simultaneously. These tasks are generated to mirror real-world professional activities, demanding that the AI switch contexts, manage data, and execute multi-step procedures without human intervention.
The results are sobering. Leading large models and agents achieved a success rate of less than 21% on these multi-application tasks, a dramatic drop from their performance on simple, single-app challenges. The agents frequently stall at early sub-goals when conditional reasoning across three or more applications is required. Even when they do not fail outright, their execution is inefficient, often taking far more steps than a human would. This suggests that current AI lacks the contextual awareness needed for true professional automation.
The takeaway is clear: we are far from AI that can reliably handle complex, cross-platform work. Until models can seamlessly coordinate across multiple applications with human-like efficiency, they will remain limited to simple, isolated tasks. The path to true autonomy requires solving the multi-application coordination problem, not just improving individual app interactions.