Verbalized Eval Awareness Inflates Measured Safety

Santiago Aranguri·LessWrong·AI Safety·May 4, 2026

We provide the most comprehensive evidence to date that verbalized eval awareness is present across models and benchmarks, finding that it correlates with safer behavior across models and causally inflates safe behavior in Kimi K2.5 on the Fortress benchmark. We further identify recurring prompt cues that trigger verbalized eval awareness and show that removing these cues significantly reduces it.Authors: Santiago Aranguri (Goodfire), Joseph Bloom (UK AISI)IntroductionAs large language models be...

Read full article →

Verbalized Eval Awareness Inflates Measured Safety

Related Articles