Can we efficiently explain model behaviors?

·ARC··

ARC’s current plan for solving ELK (and maybe also deceptive alignment) involves three major challenges:Formalizing probabilistic heuristic argument as an operationalization of “explanation”Finding sufficiently specific explanations for important model behaviorsChecking whether particular instances of a behavior are “because of” a particular explanationAll three of these steps are very difficult, but I have some intuition about why steps #1 and #3 should be possible and I expect we’ll see signif...

Read full article →

Related Articles

Probing the loss-band sparsity assumption in Scientist AI
Alejandro Tlaie · LessWrong · 53m ago
SFT Drives Gemini’s Safety Properties
Josh Engels · Alignment Forum · 22d ago
Sequent: scale and automation for higher confidence in alignment
Geoffrey Irving · Alignment Forum · 25d ago
A Mike's-Eye View of ARC's Research
Michael Winer · ARC · 25d ago
My research: a computational cognitive neuroscience perspective on alignment
Seth Herd · Alignment Forum · 1mo ago