Preliminary investigation: KL penalties in RL can increase CoT unfaithfulness
Authors: Satvik Golechha, Sid Black, Joseph BloomWork done as part of the Model Transparency team at UK AISI. We consider this to be a small set of follow-up experiments and contributing more conceptual clarity and discussion than our previous work.Executive SummaryIn our recent work replicating MacDiarmid et al. with open models, we informed LLMs about vulnerabilities in a code environment, explicitly asked them to not exploit the hacks, and showed that during RL they learned to reward hack any...
Read full article →