Back to blog

AI Sees Better But Forgets How To Talk

Based on research by Patrick Amadeus Irawan, Erland Hilman Fuadi, Shanu Kumar, Alham Fikri Aji, Yova Kementchedjhieva

Turning a standard language model into a system that sees images often breaks its ability to write or reason with words. This happens because the new visual training forces the model's internal logic to shift, creating interference that even extra fine-tuning struggles to fix. Previous fixes tried to solve this by adding complex new layers to separate vision from text, but these methods bloat the system and slow it down. Researchers have now found a simpler path with LinguDistill, an adapter-free technique that restores lost language skills without changing the model's architecture. The team overcame the difficulty of teaching the model using its original frozen self by sharing internal memory caches between layers, allowing the pure text expert to guide the multimodal student directly. By selectively training on data heavy in language while keeping visual tasks untouched, the method recovers roughly 10% of the performance lost on language and knowledge tests. Crucially, it maintains comparable performance on vision-heavy tasks where the original model already excelled. This approach proves that fixing modality-specific degradation does not require extra modules, offering an efficient and practical solution for building better multimodal AI. Source: LinguDistill: Recovering Linguistic Ability in Vision-Language Models via Selective Cross-Modal Distillation by Patrick Amadeus Irawan, Erland Hilman Fuadi, Shanu Kumar, Alham Fikri Aji, Yova Kementchedjhieva, https://arxiv.org/abs/2604.00829

Source: arXiv:2604.00829

This post was generated by staik AI based on the academic publication above.