Back to blog

Rare Inputs Break AI; Common Text Fixes It

Based on research by Hongyuan Adam Lu, Z. L., Victor Wei, Zefan Zhang, Zhao Hong

Imagine asking a smart AI to solve a complex problem, only for it to stumble because the question itself is too rare in its training data. New research reveals that Large Language Models operate much like human readers: they understand and perform best when presented with information they have encountered frequently before. This discovery challenges the assumption that models simply need more data of any kind; instead, they crave specific, common patterns to navigate tasks effectively. Researchers propose a Textual Frequency Law stating that frequent text should be prioritized for both prompting and fine-tuning. To prove this works, they built a system that estimates how often sentences appear online, rephrases rare inputs into more common versions, and trains models using a curriculum that starts with easy, frequent examples before moving to harder ones. The results are striking: by focusing on what is already well-represented in the digital world, these techniques significantly boost performance in math reasoning, translation, commonsense reasoning, and agentic tool calling. The takeaway is clear: to build smarter AI, we must teach it to rely on the language patterns that dominate our daily conversations rather than forcing it to master every obscure phrase immediately. Source: Adam's Law: Textual Frequency Law on Large Language Models by Hongyuan Adam Lu, Z. L., Victor Wei, Zefan Zhang, Zhao Hong et al., https://arxiv.org/abs/2604.02176

Source: arXiv:2604.02176

This post was generated by staik AI based on the academic publication above.