Minimal, Local, Causal Explanations for Jailbreak Success in Large Language Models

Shubham Kumar, Narendra Ahuja·ArXiv cs.AI·AI Safety·May 4, 2026

arXiv:2605.00123v1 Announce Type: new Abstract: Safety trained large language models (LLMs) can often be induced to answer harmful requests through jailbreak prompts. Because we lack a robust understanding of why LLMs are susceptible to jailbreaks, future frontier models operating more autonomously in higher-stakes settings may similarly be vulnerable to such attacks. Prior work has studied jailbreak success by examining the model's intermediate representations, identifying directions in this sp...

Read full article →

Minimal, Local, Causal Explanations for Jailbreak Success in Large Language Models

Related Articles