Turning a standard language model into a system that sees images often breaks its ability to write or reason with words. This happens because the new visual training forces the model's internal logic to shift, creating interference that even extra fine-tuning struggles to fix. Previous fixes tried to solve this by adding complex new layers to separate vision from text, but these methods bloat the system and slow it down. Researchers have now found a simpler path with LinguDistill, an adapter-free technique that restores lost language skills without changing the model's architecture. The team overcame the difficulty of teaching the model using its original frozen self by sharing internal memory caches between layers, allowing the pure text expert to guide the multimodal student directly. By selectively training on data heavy in language while keeping visual tasks untouched, the method recovers roughly 10% of the performance lost on language and knowledge tests. Crucially, it maintains comparable performance on vision-heavy tasks where the original model already excelled. This approach proves that fixing modality-specific degradation does not require extra modules, offering an efficient and practical solution for building better multimodal AI. Source: LinguDistill: Recovering Linguistic Ability in Vision-Language Models via Selective Cross-Modal Distillation by Patrick Amadeus Irawan, Erland Hilman Fuadi, Shanu Kumar, Alham Fikri Aji, Yova Kementchedjhieva, https://arxiv.org/abs/2604.00829