Fragile Correctness: Cases of reasoning harming performance
Sometimes a reasoning model appears to pass through the correct answer before ending up wrongMotivationFigure 1: From the Opus 4.8 system card (page 196)Figure 1 shows that Opus 4.8 on max thinking has a lower pass rate on SWE-Bench Pro than Opus 4.8 on x-high thinking. There are further examples of this in the Fable and Mythos system card in the appendix (Figures A1 and A2). This counter-intuitive result means that using more tokens has reduced accuracy. Inference time scaling helps on average,...
Read full article →