Research note on negated reward hacking
This work was done as part of the BlueDot's Technical AI Safety Project Sprint and should be treated as an informal report of preliminary results done over a couple of days.The code is available on GitHub, the negated dataset and model checkpoints are available on HuggingFace, also the hack rollouts are available at this viewer.IntroductionEmergent misalignment (EM) is a phenomenon where narrow fine-tuning can cause broadly misaligned behaviors in current LLMs. Both Anthropic (MacDiarmid et al.,...
Read full article →