Fragile Correctness: Cases of reasoning harming performance

·LessWrong··

Sometimes a reasoning model appears to pass through the correct answer before ending up wrongMotivationFigure 1: From the Opus 4.8 system card (page 196)Figure 1 shows that Opus 4.8 on max thinking has a lower pass rate on SWE-Bench Pro than Opus 4.8 on x-high thinking. There are further examples of this in the Fable and Mythos system card in the appendix (Figures A1 and A2). This counter-intuitive result means that using more tokens has reduced accuracy. Inference time scaling helps on average,...

Read full article →

Related Articles

Since Linux 6.9, LUKS suspend stopped wiping disk-encryption keys from memory
IngoBlechschmid · Hacker News · 20h ago
Department of Commerce has lifted export controls on Claude Fable 5 and Mythos 5
Pragmata · Hacker News · 2d ago
Claude Code is steganographically marking requests
kirushik · Hacker News · 2d ago
Postgres transactions are a distributed systems superpower
KraftyOne · Hacker News · 17h ago
Kimi K2.7 Code is generally available in GitHub Copilot
unliftedq · Hacker News · 1d ago