Preliminary investigation: KL penalties in RL can increase CoT unfaithfulness

7vik·LessWrong·Community·June 30, 2026

Authors: Satvik Golechha, Sid Black, Joseph BloomWork done as part of the Model Transparency team at UK AISI. We consider this to be a small set of follow-up experiments and contributing more conceptual clarity and discussion than our previous work.Executive SummaryIn our recent work replicating MacDiarmid et al. with open models, we informed LLMs about vulnerabilities in a code environment, explicitly asked them to not exploit the hacks, and showed that during RL they learned to reward hack any...

Read full article →

Preliminary investigation: KL penalties in RL can increase CoT unfaithfulness

Related Articles