Computer-use agents stand on the brink of revolutionizing desktop automation, but a critical shortage of continuous human demonstration videos has stalled their evolution. Most existing data relies on sparse screenshots that miss the vital temporal dynamics of how humans actually interact with software.

The new CUA-Suite dataset tackles this bottleneck by offering approximately 55 hours of high-quality expert video instead of isolated images. This ecosystem captures around 10,000 tasks across 87 diverse applications at 30 frames per second, including detailed cursor traces and multi-layered reasoning annotations. The result is a comprehensive record that preserves the full flow of human interaction, acting as a lossless bridge to existing agent frameworks.

Current foundation action models face a steep reality check when tested against professional applications, struggling with a roughly 60 percent failure rate. This stark gap highlights how far current technology lags behind real-world usage. By providing benchmarks like UI-Vision and the massive GroundCUA collection, the project offers a clear path forward for researchers aiming to scale these agents beyond simple click coordinates into genuine general-purpose utility.