Why Let Humans Teach Them
Based on research by Zhengxian Wu, Kai Shi, Chuanrui Zhang, Zirui Liao, Jun Yang
Imagine a student solving complex equations without ever checking an answer key, instead critiquing their own work in real-time until perfect scores emerge automatically. This is no longer science fiction; multimodal AI models are achieving this breakthrough by judging themselves. Researchers have unveiled a new training framework that allows large language models to evolve their reasoning skills entirely unsupervised.
The core conflict challenges the industry standard: for years, improving AI math performance depended on expensive human annotations or borrowing intelligence from powerful teacher models, both of which are difficult to scale. The new method shatters this reliance by using a unique internal balancing act called bounded Judge based modulation. Instead of relying on external answers, the system samples multiple reasoning paths for every question. It treats these paths as a group and converts absolute scores into relative advantages, effectively teaching the model to distinguish good logic from bad without any human feedback. This process creates a self-consistent signal that drives continuous improvement across five major mathematical benchmarks using only unlabeled data. The result is a scalable path toward models that can teach themselves better reasoning strategies indefinitely, proving that true autonomy in AI development is now within reach.
Zhengxian Wu et al., When Models Judge Themselves: Unsupervised Self-Evolution for Multimodal Reasoning, https://arxiv.org/abs/2603.21289