Logit ROCs: Monitor TPR is linear in FPR in logit space

Kerrick Staley·Redwood Research·AI Safety·April 12, 2026

SummaryWe study trusted monitoring for AI control, where a weaker trusted model reviews the actions of a stronger untrusted agent and flags suspicious behavior for human audit. We propose a simple mathematical model relating safety (true positive rate) to audit budget (false positive rate) at realistic, low FPRs: logit(TPR) is linear in logit(FPR). Equivalently, benign and attack scores can be modeled with logistic distributions. This gives two practical benefits. It lets practitioners estimate ...

Read full article →

Logit ROCs: Monitor TPR is linear in FPR in logit space

Related Articles