Verbalized Eval Awareness Inflates Measured Safety

·LessWrong··

We provide the most comprehensive evidence to date that verbalized eval awareness is present across models and benchmarks, finding that it correlates with safer behavior across models and causally inflates safe behavior in Kimi K2.5 on the Fortress benchmark. We further identify recurring prompt cues that trigger verbalized eval awareness and show that removing these cues significantly reduces it.Authors: Santiago Aranguri (Goodfire), Joseph Bloom (UK AISI)IntroductionAs large language models be...

Read full article →

Related Articles

Training Model to Predict Its Own Generalization: A Preliminary Study
Tianyi (Alex) Qiu · LessWrong · 3d ago
A Theoretical Game of Attacks via Compositional Skills
Xinbo Wu, Huan Zhang, Abhishek Umrawal, Lav R. Varshney · ArXiv cs.CL · 3d ago
BioVeil MATRIX: Uncovering and categorizing vulnerabilities of agentic biological AI scientists
Kimon Antonios Provatas, Avery Self, Ioannis Mouratidis, Ilias Georgakopoulos-Soares · ArXiv q-bio · 3d ago
Irretrievability; or, Murphy's Curse of Oneshotness upon ASI
Eliezer Yudkowsky · LessWrong · 4d ago
The Threat of AI Crimes Are Under-Appreciated
Joshua Krook · EA Forum · 4d ago