Parallel Decoding: How Diffusion Slashes OCR Latency

Autoregressive text generation introduces a fatal bottleneck for document processing, but a new diffusion framework shatters it with 3.2x faster results.

Researchers have fundamentally reimagined Optical Character Recognition (OCR) not as a sequential transcription task, but as an inverse rendering problem. Current systems force models to read left-to-right like humans, causing cumulative errors and significant lag when parsing complex documents filled with tables or formulas. This approach treats the layout as a byproduct of serialization rather than an inherent visual property.

The solution, MinerU-Diffusion, bypasses this serial constraint using parallel denoising under visual conditioning. Instead of predicting one token after another, the model iteratively refines the entire document structure simultaneously from noise. By employing a block-wise decoder and an uncertainty-driven curriculum learning strategy, the system achieves stable training on long sequences without the error propagation typical of autoregressive models.

In practice, this shifts the performance ceiling for document parsing. The framework drastically reduces reliance on linguistic priors, allowing it to handle dense mathematical notation and complex tables more robustly than traditional vision-language models. This represents a tangible leap in speed and accuracy for automated data extraction at scale.

Source: "MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decoding" by Hejun Dong et al., https://arxiv.org/abs/2603.22458