AI agents promise to automate your inbox and handle routine life tasks, but can they truly navigate the messy reality of everyday online interactions? A new evaluation framework reveals a stark gap between current AI capabilities and the complex demands of real-world web usage.

Researchers have introduced ClawBench, a rigorous testbed featuring 153 simple yet demanding tasks people perform regularly in their lives and work. These challenges span 144 live platforms across 15 categories, ranging from completing purchases and booking appointments to submitting job applications. Unlike previous tests that use static pages in offline sandboxes, this framework operates directly on production websites to preserve the full complexity and dynamic nature of real-world interaction. A lightweight interception layer captures and blocks only the final submission request, ensuring safe evaluation without causing actual side effects.

The results expose a significant limitation in today's technology. When tested against seven frontier models, both proprietary and open-source systems managed to complete only a small portion of these tasks. For instance, Claude Sonnet 4.6 succeeded in just 33.3% of the scenarios. This failure rate highlights that current AI still struggles with demanding capabilities like obtaining relevant information from user-provided documents, navigating multi-step workflows across diverse platforms, and performing write-heavy operations such as filling out detailed forms correctly.

Progress on this benchmark is essential for building reliable general-purpose assistants. Until AI agents can consistently handle these routine aspects of daily life without constant human intervention, they remain far from replacing the comprehensive support we expect from future digital helpers.