Large language models are poised to become powerful user simulators, yet they currently fail to mimic the messy reality of human life. Existing tests trap AI in isolated bubbles that ignore how our decisions ripple across different situations over time.

To fix this, researchers built OmniBehavior, a new benchmark crafted entirely from real-world data. It challenges models with long-term goals that span multiple scenarios and diverse behavioral patterns, moving far beyond the narrow, synthetic datasets used before. The results are stark: current AI struggles to keep up, with performance hitting a wall even when given massive amounts of context.

A deeper look reveals a troubling bias where simulated humans become hyper-active and overly positive. Instead of reflecting individual quirks or rare behaviors, these models converge toward an idealized "average person." This Utopian filter wipes out the unique differences that make human behavior authentic, leaving long-tail actions completely unrepresented.

The takeaway is clear: high-fidelity simulation requires more than just bigger context windows. Future research must address these structural biases to capture the full spectrum of how real people actually think and act.