Research note on negated reward hacking

·LessWrong··

This work was done as part of the BlueDot's Technical AI Safety Project Sprint and should be treated as an informal report of preliminary results done over a couple of days.The code is available on GitHub, the negated dataset and model checkpoints are available on HuggingFace, also the hack rollouts are available at this viewer.IntroductionEmergent misalignment (EM) is a phenomenon where narrow fine-tuning can cause broadly misaligned behaviors in current LLMs. Both Anthropic (MacDiarmid et al.,...

Read full article →

Related Articles

Anthropic says Alibaba illicitly extracted Claude AI model capabilities
htrp · Hacker News · 1d ago
Ford AI hiccups push carmaker to rehire ‘gray beard’ inspectors
alanwreath · Hacker News · 14h ago
An entire Herculaneum scroll has been read for the first time
verditelabs · Hacker News · 13h ago
IBM debuts sub-1 nanometer chip technology
porridgeraisin · Hacker News · 14h ago
OpenAI unveils its first custom chip, built by Broadcom
jamdesk · Hacker News · 1d ago