Reward Hacking Without Egregious Misalignment in an RL-Only Setting

·LessWrong··

This work was done as part of the MATS fellowship by Joey Yudelson and Vladimir Ivanov. It was mentored by Ryan Greenblatt. Thanks to Aghyad Deeb and Anders Woodruff for comments on this post. Thanks to Monte MacDiarmid, Evan Hubinger, Sid Black, Satvik Golechha, and Joseph Bloom for clarifying conversations.TL;DRWe trained Kimi K2.5 and GPT-OSS 120b on a diverse set of reward-hackable coding environments. The models reliably learn to reward hack, and this reward hacking propensity generalizes t...

Read full article →

Related Articles

OpenAI unveils its first custom chip, built by Broadcom
jamdesk · Hacker News · 4h ago
NSA lost access to Mythos amid Anthropic dispute
thm · Hacker News · 10h ago
A deadly fungus that can infect cats and people is spreading
sohkamyung · Hacker News · 10h ago
AI's Affordability Crisis
ilreb · Hacker News · 1d ago
Trains halted across Germany because of communication system problem
sva_ · Hacker News · 1d ago