Can we efficiently explain model behaviors?

·ARC··

ARC’s current plan for solving ELK (and maybe also deceptive alignment) involves three major challenges:Formalizing probabilistic heuristic argument as an operationalization of “explanation”Finding sufficiently specific explanations for important model behaviorsChecking whether particular instances of a behavior are “because of” a particular explanationAll three of these steps are very difficult, but I have some intuition about why steps #1 and #3 should be possible and I expect we’ll see signif...

Read full article →

Related Articles

Loss of Oversight: How AI Systems May Become Harder to Audit, Monitor, and Investigate
Jordan Taylor · LessWrong · 35m ago
The Case for Evaluating Model Behaviors
jsteinhardt · Alignment Forum · 20h ago
Mechanistic estimation for expectations of random products
Jacob Hilton · ARC · 5d ago
Multipolar Civilisation Depends on Maintaining an Attacker’s Dilemma
Naci Cankaya · LessWrong · 14d ago
Using Base-LCM to Monitor LLMs
Éloïse Benito-Rodriguez · LessWrong · 14d ago