Can a stronger model fake being a weaker one? Mostly not

Rob Kopel·LessWrong·Community·June 14, 2026

tldrFrontier models can be prompted into a weaker model's capability tier, but not its identity: they adopt a generic weaker-model error pattern, not a specific predecessor's per-question fingerprint.Targeted sandbagging capabilities: where a stronger model throttles down to a weaker one without reasoning showed as a largely null result.One smaller, intriguing concern: prompting successor models to predict a predecessor's mistakes through latent (out-of-context) reasoning measurably improves imi...

Read full article →

Can a stronger model fake being a weaker one? Mostly not

Related Articles