The idea that reinforcement learning makes AI smarter while supervised fine-tuning just helps it memorize is a popular belief, but new findings suggest this simple divide misses the mark. Researchers are discovering that teaching models to reason with long chains of thought creates generalization benefits, but only under specific conditions that depend on how the model is trained and what data it sees.

The study reveals that poor cross-domain performance often stems from stopping training too early rather than a fundamental flaw in the approach. When training continues longer, performance dips initially before recovering and improving, meaning short training sessions unfairly underestimate a model's true potential. Additionally, the quality of the training material is critical; low-quality solutions hinder progress, whereas verified long chains of thought consistently boost results across different tasks.

However, this improvement comes with a hidden trade-off that changes the conversation entirely. While reasoning capabilities grow significantly, safety measures tend to degrade during this process. This asymmetry forces a shift in perspective from asking if reasoning fine-tuning works at all to determining exactly what conditions enable it and what costs must be accepted. Ultimately, stronger models learn deep procedural patterns like backtracking, while weaker ones merely mimic surface-level verbosity, highlighting that capability is just as essential as data quality.