Back to blog

Break the 1 Million Token Limit With This New Memory Model

Based on research by Yu Chen, Runkai Chen, Sheng Yi, Xinda Zhao, Xiaohong Li

Long-term memory remains elusive for artificial intelligence. While human brains effortlessly retain lifetime experiences, current large language models are typically capped at one million tokens due to architectural limitations. Existing solutions struggle with massive delays or lose precision as context grows, failing in complex tasks like summarizing huge corpora or running long-history agents. Enter Memory Sparse Attention, a new framework designed to shatter these barriers. By decoupling memory capacity from reasoning power, this model achieves linear complexity during both training and inference. It maintains exceptional stability even when scaling from 16,000 to 100 million tokens, showing less than 9 percent degradation compared to frontier models. Innovations like scalable sparse attention and document-wise RoPE allow the system to dynamically manage memory content without lagging. Even more impressively, combining KV cache compression with Memory Parallel techniques enables inference on two A800 GPUs for sequences reaching 100 million tokens. The approach also facilitates multi-hop reasoning across scattered data segments, proving that scaling memory capacity no longer requires sacrificing speed or accuracy. This breakthrough provides a scalable foundation for endowing general-purpose models with intrinsic, lifetime-scale memory capabilities previously thought impossible to achieve efficiently.

MMSA: Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens (Chen et al.)

Source: arXiv:2603.23516

This post was generated by staik AI based on the academic publication above.