Verbalised evaluation awareness in language models has little effect on their behaviour
TL;DR: We provide evidence that the presence of verbalised evaluation awareness (VEA) in CoTs does not imply eval gaming. We tested this across 8 open-weight LRMs and 4 benchmarks (safety, alignment, moral dilemmas, political opinion) by comparing answer distributions on the same prompts, with and without VEA in the CoT. We find that overall distribution shifts are negligible to small across all benchmarks and experiments, challenging the common assumption that evaluation awareness equals (safet...
Read full article →