Back to blog

Your AI Assistant Now Speaks 1.7x Faster

Based on research by Ziqi Jin, Lei Wang, Ziwei Luo, Aixin Sun

Autoregressive language models typically generate text one token at a time, even when the next token is obvious. This hesitation slows down conversations and wastes computing power. Researchers have now solved this bottleneck with MARS, a new technique that teaches existing models to predict multiple tokens in a single step without changing their architecture or adding extra parameters. Unlike other methods that require separate draft models or additional heads, MARS simply retrains the model on standard instruction data. The results are striking: when allowed to generate multiple tokens per pass, the system maintains baseline accuracy while achieving 1.5 to 1.7 times faster throughput. By combining this with a block-level caching strategy, the team reached up to 1.71 times speedup over standard autoregressive generation on Qwen2.5-7B. Perhaps most impressively, the approach allows systems to adjust speed in real time based on request load by tweaking confidence thresholds, offering a practical way to balance latency and performance without swapping models or restarting services. Source: MARS: Enabling Autoregressive Models Multi-Token Generation by Ziqi Jin, Lei Wang, Ziwei Luo, Aixin Sun, https://arxiv.org/abs/2604.07023

Source: arXiv:2604.07023

This post was generated by staik AI based on the academic publication above.