Imagine a world where AI doesn't just see or hear, but truly experiences both simultaneously. This is the promise of Audio-Visual Intelligence, a frontier that is rapidly reshaping how machines perceive and interact with reality. As foundation models grow more powerful, the ability to bridge sound and sight is no longer a novelty—it is becoming the core of next-generation AI.

Researchers are now focusing on unified architectures that process audio and visual data together. This shift moves beyond simple recognition to enable complex tasks like generating realistic videos from sound or creating interactive dialogue systems. Recent industrial breakthroughs, such as Meta MovieGen and Google Veo-3, highlight the urgent push toward models that can understand dynamic, temporal signals, allowing for more natural and controllable human-computer interaction.

However, the field faces a significant hurdle: fragmentation. Despite rapid progress, the literature is scattered across inconsistent taxonomies and diverse evaluation methods, making it difficult to compare advancements or build upon previous work. This lack of standardization impedes systematic comparison and knowledge integration, leaving the community struggling to make sense of the explosion in research output.

This comprehensive survey aims to consolidate the chaos into a coherent framework. By establishing a unified taxonomy and curating key datasets and benchmarks, it offers a structured path forward. The ultimate takeaway is clear: to unlock the full potential of large-scale audio-visual intelligence, the industry must move from isolated experiments to a standardized, integrated approach that addresses critical challenges in synchronization, spatial reasoning, and safety.