Predicting Rare LLM Failures with 30× Fewer Rollouts

Santiago Aranguri·LessWrong·Community·May 13, 2026

TL;DR: We estimate how often Qwen 3 4B exhibits rare harmful behaviors with 30× fewer rollouts than naive sampling, using a new method that interpolates between the model and a less-safe variant in logit space.Authors: Francisco Pernice (MIT), Santiago Aranguri (Goodfire)IntroductionA harmful behavior that occurs once in a million rollouts will rarely surface during pre-deployment testing, yet will almost inevitably appear after release. Labs usually have another resource available: less safety-...

Read full article →

Predicting Rare LLM Failures with 30× Fewer Rollouts

Related Articles