Can we efficiently explain model behaviors?

·ARC··

ARC’s current plan for solving ELK (and maybe also deceptive alignment) involves three major challenges:Formalizing probabilistic heuristic argument as an operationalization of “explanation”Finding sufficiently specific explanations for important model behaviorsChecking whether particular instances of a behavior are “because of” a particular explanationAll three of these steps are very difficult, but I have some intuition about why steps #1 and #3 should be possible and I expect we’ll see signif...

Read full article →

Related Articles

Training Model to Predict Its Own Generalization: A Preliminary Study
Tianyi (Alex) Qiu · LessWrong · 3d ago
A Theoretical Game of Attacks via Compositional Skills
Xinbo Wu, Huan Zhang, Abhishek Umrawal, Lav R. Varshney · ArXiv cs.CL · 3d ago
BioVeil MATRIX: Uncovering and categorizing vulnerabilities of agentic biological AI scientists
Kimon Antonios Provatas, Avery Self, Ioannis Mouratidis, Ilias Georgakopoulos-Soares · ArXiv q-bio · 3d ago
Irretrievability; or, Murphy's Curse of Oneshotness upon ASI
Eliezer Yudkowsky · LessWrong · 4d ago
Verbalized Eval Awareness Inflates Measured Safety
Santiago Aranguri · LessWrong · 4d ago