Verbalized Eval Awareness Inflates Measured Safety

·LessWrong··

We provide the most comprehensive evidence to date that verbalized eval awareness is present across models and benchmarks, finding that it correlates with safer behavior across models and causally inflates safe behavior in Kimi K2.5 on the Fortress benchmark. We further identify recurring prompt cues that trigger verbalized eval awareness and show that removing these cues significantly reduces it.Authors: Santiago Aranguri (Goodfire), Joseph Bloom (UK AISI)IntroductionAs large language models be...

Read full article →

Related Articles

Sequent: scale and automation for higher confidence in alignment
Geoffrey Irving · Alignment Forum · 25d ago
A Mike's-Eye View of ARC's Research
Michael Winer · ARC · 25d ago
My research: a computational cognitive neuroscience perspective on alignment
Seth Herd · Alignment Forum · 1mo ago
A Pipeline for Generating Synthetic Sabotage Trajectories to Red-Team Monitors
Myles H · LessWrong · 1mo ago
Announcing the ARC White-Box Estimation Challenge
Jacob_Hilton · Alignment Forum · 1mo ago