Back to blog

Your AI paints the wrong picture when you add too

Based on research by Zhekai Chen, Yuqing Wang, Manyuan Zhang, Xihui Liu

When asked to generate images based on multiple visual inputs, current artificial intelligence models quickly fail. Instead of blending references seamlessly, the number of input images causes severe performance degradation and incoherent results.

Researchers have now identified the culprit: a massive data bottleneck. Existing training sets overwhelmingly feature single-image pairs, leaving models unprepared for the structured, long-context supervision required to understand complex relationships between many sources. To solve this, scientists introduced MacroData, a colossal collection of 400,000 samples. Each entry in this new dataset can contain up to ten reference images, meticulously organized across four distinct dimensions including customization, illustration, spatial reasoning, and temporal dynamics. This comprehensive coverage allows models to finally learn dense dependencies between multiple inputs without breaking down.

Beyond the dataset, the team also launched MacroBench, a standardized benchmark of 4,000 samples designed to evaluate generative coherence across different task scales. Testing confirms that fine-tuning on MacroData yields substantial improvements in multi-reference generation. Furthermore, ablation studies reveal that cross-task co-training provides synergistic benefits and offers effective strategies for handling long-context complexity, proving that better data organization is the key to scalable image synthesis.

The full dataset and benchmark will be made publicly available soon. Chen et al., "MACRO: Advancing Multi-Reference Image Generation with Structured Long-Context Data", https://arxiv.org/abs/2603.25319

Source: arXiv:2603.25319

This post was generated by staik AI based on the academic publication above.