Preliminary investigation: KL penalties in RL can increase CoT unfaithfulness

·LessWrong··

Authors: Satvik Golechha, Sid Black, Joseph BloomWork done as part of the Model Transparency team at UK AISI. We consider this to be a small set of follow-up experiments and contributing more conceptual clarity and discussion than our previous work.Executive SummaryIn our recent work replicating MacDiarmid et al. with open models, we informed LLMs about vulnerabilities in a code environment, explicitly asked them to not exploit the hacks, and showed that during RL they learned to reward hack any...

Read full article →

Related Articles

South Korea to spend $1T on more memory chip production and humanoid robots
jnord · Hacker News · 17h ago
Antares Achieves Criticality of Mark-0 Reactor
clarionbell · Hacker News · 6h ago
LongCat-2.0, a large-scale MoE model with 1.6T total and 48B Active
benjiro29 · Hacker News · 15h ago
Anthropic says Alibaba illicitly extracted Claude AI model capabilities
htrp · Hacker News · 5d ago
One million passports leaked online
jruohonen · Hacker News · 2d ago