Predicting Rare LLM Failures with 30× Fewer Rollouts
TL;DR: We estimate how often Qwen 3 4B exhibits rare harmful behaviors with 30× fewer rollouts than naive sampling, using a new method that interpolates between the model and a less-safe variant in logit space.Authors: Francisco Pernice (MIT), Santiago Aranguri (Goodfire)IntroductionA harmful behavior that occurs once in a million rollouts will rarely surface during pre-deployment testing, yet will almost inevitably appear after release. Labs usually have another resource available: less safety-...
Read full article →