Consistency Training while Mitigating Obfuscation via Rate Matching
Sohaib Imran, Prakhar Gupta, Jannes Elstner, David Demitri AfricaLinks: Paper | Code TL;DR. Models condition their behavior on extraneous input features in undesirable ways — for example, on evaluation-likeness (resulting in evaluation gaming), or on the user's preferred answer (resulting in sycophancy). Consistency training teaches a model to behave the same whether or not an extraneous feature/cue is present in the input. Existing methods do this by fine-tuning LLMs to generate responses (BCT)...
Read full article →