Discover hidden agent vulnerabilities with our new red-teaming breakthrough
Based on research by Hyomin Lee, Sangwoo Park, Yumin Choi, Sohyun An, Seanie Lee
While previous safety tests focused on detecting harmful text from large language models, a new gap has emerged regarding multi-step actions using tools like the Model Context Protocol. Researchers argue that standard methods miss critical risks that only appear when an AI actually performs tasks rather than merely generating text. To address this, the team developed T-MAP, a technique that uses execution paths to automatically craft attacks that bypass safety filters and successfully execute harmful actions. The method proved highly effective, outperforming existing tools in achieving malicious goals across various environments. Crucially, it revealed previously unknown weaknesses in cutting-edge models including GPT-4, Gemini 1.5 Pro, Qwen3.5, and GLM-Edge, proving that safety strategies must evolve alongside the rise of autonomous agents.