Consistency Training while Mitigating Obfuscation via Rate Matching

Sohaib Imran·LessWrong·Community·July 1, 2026

Sohaib Imran, Prakhar Gupta, Jannes Elstner, David Demitri AfricaLinks: Paper | Code TL;DR. Models condition their behavior on extraneous input features in undesirable ways — for example, on evaluation-likeness (resulting in evaluation gaming), or on the user's preferred answer (resulting in sycophancy). Consistency training teaches a model to behave the same whether or not an extraneous feature/cue is present in the input. Existing methods do this by fine-tuning LLMs to generate responses (BCT)...

Read full article →

Consistency Training while Mitigating Obfuscation via Rate Matching

Related Articles