Can a stronger model fake being a weaker one? Mostly not

·LessWrong··

tldrFrontier models can be prompted into a weaker model's capability tier, but not its identity: they adopt a generic weaker-model error pattern, not a specific predecessor's per-question fingerprint.Targeted sandbagging capabilities: where a stronger model throttles down to a weaker one without reasoning showed as a largely null result.One smaller, intriguing concern: prompting successor models to predict a predecessor's mistakes through latent (out-of-context) reasoning measurably improves imi...

Read full article →

Related Articles

A 'cold blob' in the Atlantic could be a sign of AMOC shutdown
tambourine_man · Hacker News · 4h ago
Noise infusion banned from statistical products published by Census Bureau
nl · Hacker News · 1d ago
Rio de Janeiro's "homegrown" LLM appears to be a merge of an existing model
unrvl22 · Hacker News · 4h ago
The redistribution of housing wealth caused by rent control [pdf]
luu · Hacker News · 16h ago
Rio de Janeiro's city government model Rio3.5 beats Qwen3.7 in recent benchmarks
lucasfcosta · Hacker News · 4h ago