Logit ROCs: Monitor TPR is linear in FPR in logit space

·Redwood Research··

SummaryWe study trusted monitoring for AI control, where a weaker trusted model reviews the actions of a stronger untrusted agent and flags suspicious behavior for human audit. We propose a simple mathematical model relating safety (true positive rate) to audit budget (false positive rate) at realistic, low FPRs: logit(TPR) is linear in logit(FPR). Equivalently, benign and attack scores can be modeled with logistic distributions. This gives two practical benefits. It lets practitioners estimate ...

Read full article →

Related Articles

Training Model to Predict Its Own Generalization: A Preliminary Study
Tianyi (Alex) Qiu · LessWrong · 3d ago
A Theoretical Game of Attacks via Compositional Skills
Xinbo Wu, Huan Zhang, Abhishek Umrawal, Lav R. Varshney · ArXiv cs.CL · 3d ago
BioVeil MATRIX: Uncovering and categorizing vulnerabilities of agentic biological AI scientists
Kimon Antonios Provatas, Avery Self, Ioannis Mouratidis, Ilias Georgakopoulos-Soares · ArXiv q-bio · 3d ago
Irretrievability; or, Murphy's Curse of Oneshotness upon ASI
Eliezer Yudkowsky · LessWrong · 4d ago
Verbalized Eval Awareness Inflates Measured Safety
Santiago Aranguri · LessWrong · 4d ago