AI systems are increasingly tasked with conducting autonomous scientific research, but a new study reveals a disturbing gap: these agents can produce results without actually reasoning like scientists. This matters because if the underlying logic is flawed, the knowledge they generate cannot be trusted, regardless of how accurate the final output appears.

Researchers analyzed over 25,000 runs of large language model-based scientific agents across eight different domains. They looked at both performance and the epistemological structure of the agents' reasoning. The findings show that the base model drives behavior far more than the surrounding software scaffolding. Crucially, these agents ignore evidence in 68% of their processes and rarely use convergent multi-test evidence to refine their beliefs.

The conflict lies in the illusion of competence. These agents persist in flawed reasoning patterns even when provided with successful examples as context. The unreliability compounds over repeated trials, especially in complex domains. Outcome-based evaluations fail to detect these deep-seated failures because the agents can still execute workflows and produce outputs that look correct on the surface.

The takeaway is stark: current AI agents execute scientific tasks but do not exhibit the self-correcting patterns essential to genuine scientific inquiry. Until reasoning itself becomes a primary training target, the knowledge produced by these systems remains unjustified by the process that generated it. We must stop confusing workflow execution with actual scientific understanding.