AI's 3D Intelligence Scores Are a Lie
Based on research by Yiming Zhang, Jiacheng Chen, Jiaqi Tan, Yongsen Mao, Wenhu Chen
Current tests for how well AI models understand 3D space are fundamentally broken. They rely on outdated data that ignores how modern vision-language models actually process video, leading to misleading scores that hide real-world failures. This means we have been overestimating the spatial intelligence of our most advanced AI systems.
The core issue lies in a disconnect between evaluation benchmarks and actual model capabilities. Previous studies often used 3D annotations originally designed for static perception, treating them as ground truth for video analysis. This approach introduces severe artifacts: objects clearly visible in video are missed, identities are mislabeled, and geometric details like size are corrupted. Furthermore, these tests assume the AI sees every frame of a scene. In reality, models typically operate on sparse samples, making many benchmark questions impossible to answer correctly regardless of the model's true ability.
Researchers have addressed this by introducing ReVSI, a new evaluation protocol designed to reflect how VLMs actually work. They re-annotated objects and geometry across 381 scenes from five different datasets using professional 3D tools. Every question-answer pair was regenerated with rigorous bias mitigation and human verification to ensure accuracy. The benchmark also provides variants based on different frame counts, allowing for precise diagnostic analysis of how visibility impacts performance.
The results reveal systematic failure modes that previous benchmarks obscured. By aligning evaluation conditions with actual model inputs, ReVSI exposes the true limits of current spatial reasoning capabilities. This offers a more reliable and diagnostic assessment, forcing developers to confront the gap between theoretical potential and practical performance in 3D understanding.