Can we efficiently explain model behaviors?
ARC’s current plan for solving ELK (and maybe also deceptive alignment) involves three major challenges:Formalizing probabilistic heuristic argument as an operationalization of “explanation”Finding sufficiently specific explanations for important model behaviorsChecking whether particular instances of a behavior are “because of” a particular explanationAll three of these steps are very difficult, but I have some intuition about why steps #1 and #3 should be possible and I expect we’ll see signif...
Read full article →