Back to blog

CVA-Suite Unveils 55 Hours of Expert Videos to Train Smarter Desktop Agents

Based on research by Xiangru Jian, Shravan Nayak, Kevin Qinghong Lin, Aarash Feizi, Kaixin Li

Forget the 2 million screenshots that have defined AI training for years; a new benchmark reveals just how hard it is for current models to handle real-world desktop work without continuous video. Researchers from Stockholm Technical and AI Consult present CUA-Suite, a massive dataset featuring 10,000 human-demonstrated tasks across 87 professional applications.

The core problem is the scarcity of high-quality video data. Existing datasets like ScaleCUA offer less than 20 hours of content, which simply isn't enough to teach agents complex workflows. CUA-Suite changes this by providing 55 hours of expert screen recordings at 30 fps, complete with detailed cursor traces and reasoning annotations totaling 6 million frames. Unlike sparse data that only shows final click coordinates, this continuous stream preserves the full timing and logic of human interaction.

Current AI models struggle immensely when faced with these professional environments, failing about 60% of the time. The new suite also includes UI-Vision for testing planning skills and GroundCUA with over 3.6 million UI annotations to help ground agents in the digital world. By releasing everything publicly, the team aims to boost research into screen parsing and visual world models. Developers can now access a superset of data that loses no information when transformed for existing frameworks, marking a major shift from sparse images to rich video for building general-purpose computer-use agents.

Source: CUA-Suite: Massive Human-annotated Video Demonstrations for Computer-Use Agents by Xiangru Jian et al., https://arxiv.org/abs/2603.24440

Source: arXiv:2603.24440

This post was generated by staik AI based on the academic publication above.