Verbalised evaluation awareness in language models has little effect on their behaviour

·LessWrong··

TL;DR: We provide evidence that the presence of verbalised evaluation awareness (VEA) in CoTs does not imply eval gaming. We tested this across 8 open-weight LRMs and 4 benchmarks (safety, alignment, moral dilemmas, political opinion) by comparing answer distributions on the same prompts, with and without VEA in the CoT. We find that overall distribution shifts are negligible to small across all benchmarks and experiments, challenging the common assumption that evaluation awareness equals (safet...

Read full article →

Related Articles

“Beyond the limit”: Satellites and mirrors in space pose threat to the night sky
Breadmaker · Hacker News · 1d ago
Solar rail could become common in Europe after successful trial in Switzerland
neilfrndes · Hacker News · 2h ago
GPT-5.5 Codex reasoning-token clustering may be leading to degraded performance
maille · Hacker News · 19h ago
Potential session/cache leakage between workspace instances or consumer accounts
chatmasta · Hacker News · 1d ago
Show HN: KiCad in the Browser
ViktorEE · Hacker News · 5h ago