Revisiting GSM-Symbolic: Do 2026 Frontier Models Still Fail at Confounded Grade School Math?
TL;DRThe GSM-Symbolic paper (ICLR 2025) purported to show that language models rely on pattern matching rather than genuine reasoning by demonstrating that perturbing the questions to make them break the pattern of the original question would catastrophically reduce performance in the model. Running the results again in March 2026 with GPT-4o, Claude Opus 4.6, and Claude Haiku 4.5 shows that we precisely replicate the original findings only when we do not audit out examples that may actually be ...
Read full article →