Consistency Training while Mitigating Obfuscation via Rate Matching

·LessWrong··

Sohaib Imran, Prakhar Gupta, Jannes Elstner, David Demitri AfricaLinks: Paper | Code TL;DR. Models condition their behavior on extraneous input features in undesirable ways — for example, on evaluation-likeness (resulting in evaluation gaming), or on the user's preferred answer (resulting in sycophancy). Consistency training teaches a model to behave the same whether or not an extraneous feature/cue is present in the input. Existing methods do this by fine-tuning LLMs to generate responses (BCT)...

Read full article →

Related Articles

Department of Commerce has lifted export controls on Claude Fable 5 and Mythos 5
Pragmata · Hacker News · 19h ago
Claude Code is steganographically marking requests
kirushik · Hacker News · 1d ago
Asahi Linux 7.1 Progress Report
pantalaimon · Hacker News · 9h ago
Single Dose of Frog-Derived Gut Bacterium Eradicates 100% of Tumors in Mice
mpweiher · Hacker News · 9h ago
The first early human eggs from stem cells
dsr12 · Hacker News · 14h ago